Building a MicroGPT in Common Lisp
In the next several chapters we will use production-grade Large Language Models like GPT-5, Gemini, Claude, etc. Here we look at optional material: Karpathy’s MicroGPT re-implemented in Common Lisp.
Introduction
In recent years, the artificial intelligence landscape has been completely transformed by large language models (LLMs) built upon the foundational Transformer architecture. Models like GPT-5, LLaMA, Gemma 3, and Claude contain tens of billions to trillions of parameters. Their capabilities are profound, ranging from generating sophisticated poetry to writing complex computer code. However, the sheer size and complexity of these production models can make them feel like impenetrable black boxes to developers and researchers trying to grasp their inner workings.
To understand how these massive systems operate, it can be beneficial to strip away the complexities of distributed training, massive datasets, and heavily optimized C++/CUDA backend libraries. One of the most famous educational projects in this domain is Andrej Karpathy’s MicroGPT written in Python (often associated with his broader nanoGPT and micrograd repositories). Karpathy’s work distills the essence of a Generative Pre-trained Transformer into a bare-bones, dependency-free implementation, allowing anyone to trace the mathematical operations from start to finish.
In this chapter, we will explore a complete, dependency-free port of MicroGPT written entirely in Common Lisp. Common Lisp has a rich, storied history in the field of artificial intelligence. Its symbolic nature, dynamic typing, and interactive REPL-driven development cycle make it a uniquely powerful tool for exploring complex algorithmic architectures. By stepping away from the heavy machinery of modern Python frameworks like PyTorch, TensorFlow, or JAX, we are forced to construct the foundational components from scratch. We will build an automatic differentiation engine (autograd), construct neural network layers, implement the multi-head self-attention mechanism, and design the training loop. This hands-on approach not only demystifies how modern AI systems learn but also beautifully demonstrates the expressive power and mathematical elegance of Common Lisp.
Demystifying the Core Components
To build a functioning, end-to-end language model from scratch, we need to orchestrate several interacting systems. Our Common Lisp implementation elegantly encapsulates these systems into a single, cohesive file. Let us break down the architecture into its primary constituents: the autograd engine, model parameter initialization, the Transformer architecture, the training loop, and the inference generator. Understanding each of these pieces is crucial for grasping the holistic behavior of a language model.
The Autograd Engine
At the heart of any modern neural network training process is the ability to automatically compute the gradients of the loss function with respect to the model’s millions (or in this case, hundreds) of parameters. This is achieved through a technique known as reverse-mode automatic differentiation.
In our Lisp code, this mechanism is implemented using a simple defstruct called value. A value structure encapsulates a scalar piece of data, its current gradient, the child nodes from which it was computed, and the local gradients with respect to those children. This means that every time we perform a mathematical operation, we are not merely calculating a result; we are dynamically building a computational graph.
When operations like addition (v+), multiplication (v*), exponentiation (vpow), or non-linearities like ReLU (vrelu) are executed, they instantiate new value objects. These objects maintain pointers to their operands. For example, if c = a + b, then c knows that it was created from a and b, and it stores the local derivatives of the addition operation.
Once a forward pass through the network is complete and a scalar loss value is computed (representing how wrong the network’s predictions were), the backward function takes over. It performs a topological sort of the entire computational graph. Then, traversing this sorted list in reverse order, it applies the chain rule from calculus to propagate gradients backward from the loss down to every individual weight of the model. This dependency-free approach, heavily inspired by Karpathy’s micrograd, clearly illustrates that automatic differentiation is fundamentally just an automated, programmatic application of the chain rule over a directed acyclic graph.
Model Parameters and Initialization
A neural network is essentially an enormous collection of parameters—weights and biases—that are iteratively optimized over time. In our Lisp implementation, these parameters are stored in a centralized hash table called *state-dict*.
The init-model function defines the architectural shape of our MicroGPT. We initialize several distinct sets of matrices:
wte: The token embedding matrix, which translates discrete character IDs into dense continuous vectors.wpe: The positional embedding matrix, which gives the model information about where a token is located in the sequence.lm_head: The final language modeling head, a linear layer that projects the internal representations back into the vocabulary space to predict the next character.- Transformer Layers: For each layer, we initialize matrices for the attention mechanism (
attn_wq,attn_wk,attn_wv,attn_wo) and the multi-layer perceptron (mlp_fc1,mlp_fc2).
Each parameter is initialized using a Gaussian distribution (random-gauss). This small amount of random noise provides the network with an asymmetrical starting state, breaking symmetry and allowing gradient descent to effectively navigate the loss landscape. Crucially, every weight generated is a value structure, ensuring that every calculation utilizing these weights is automatically tracked by the autograd engine.
The Transformer Architecture
The gpt function serves as the central processing unit of the model. It implements the forward pass of a decoder-only Transformer. As data flows through this function, it undergoes a series of sophisticated transformations:
- Embeddings: Each discrete token (in our case, characters) and its position in the sequence are converted into dense vectors using our embedding matrices. These two vectors are added together to form the initial representation of the input sequence.
- Layer Normalization: We employ RMSNorm (Root Mean Square Normalization) to stabilize the neural activations. RMSNorm is computationally simpler than standard Layer Normalization because it does not require mean-centering, yet it is highly effective. It has become a standard technique in modern, highly optimized models like Meta’s LLaMA.
- Multi-Head Self-Attention: This is the defining feature of the Transformer. The model linearly projects the input into three distinct representations: Queries (Q), Keys (K), and Values (V). For each attention head, the model computes the dot product of the Query and Key. This dot product determines how much “attention” the current token should pay to all past tokens. The attention scores are normalized using a
vsoftmaxfunction and are then multiplied by the Value vector. To significantly speed up inference, we maintain a Key-Value (KV) cache. This cache stores past K and V computations, preventing the model from redundantly recalculating past context for every newly generated token. - Multi-Layer Perceptron (MLP): Following the attention mechanism, the data is passed through a simple feed-forward neural network equipped with a ReLU activation function. While the attention mechanism allows tokens to communicate with one another, the MLP allows each individual token to process and refine the information it has gathered.
- Residual Connections: You will notice in the code that the outputs of the attention layer and the MLP layer are added back to their respective inputs (
setf x (mapcar #'v+ x xr)). These residual connections create “shortcuts” for gradients to flow backward through the network, mitigating the vanishing gradient problem and enabling the training of deeper architectures. - Language Modeling Head: Finally, the refined representations are projected back into the vocabulary space. The output is a set of raw scores, or “logits,” representing the model’s prediction for what the next token in the sequence should be.
Training the Model: Learning from Data
The run-training function orchestrates the intricate dance of the learning process. The model trains by reading lines of text—in our specific example, from a file called names.txt—processing it character by character.
The training loop proceeds iteratively over many steps:
- The Forward Pass: The model ingests a sequence of characters and attempts to predict the subsequent character for every position in the sequence simultaneously.
- Loss Calculation: We utilize the cross-entropy loss function. By applying the softmax function to our raw logits, we convert them into a normalized probability distribution. The loss is calculated as the negative log-likelihood of the correct next character. If the model assigns a high probability to the correct character, the loss is low. If it assigns a low probability, the loss is high.
- The Backward Pass: The
backwardfunction is invoked on the final scalar loss value. This single function call ripples backward through the computational graph, computing the exact gradient for every parameter in the network. - Optimization: We update the parameters using the Adam optimization algorithm. Unlike simple stochastic gradient descent, Adam maintains running averages of both the gradients and the squared gradients. This allows it to adaptively tune the learning rate for each individual parameter, resulting in faster and more stable convergence.
By executing this loop over hundreds or thousands of steps, the model incrementally adjusts its internal weights. It slowly learns to assign higher probabilities to the correct sequences of characters, thereby absorbing the underlying statistical structure of the training data.
Inference: Hallucinating New Data
Once the model has completed its training phase and internalized the patterns of the dataset, we can use it to generate novel text via the run-inference function.
Starting with a special Beginning-Of-Sequence (BOS) token, we feed this initial context into the model to obtain the probabilities for the first character. We then sample from this probability distribution (using the random-choice function) to select a character. This newly generated character is appended to our context and fed back into the model to predict the next character. This autoregressive generation process continues in a loop until the model predicts the BOS token again, which we use to indicate the end of the sequence.
Because we sample from a probability distribution rather than greedily picking the single most likely token at every step, the model can generate diverse, creative, and sometimes surprising outputs. It effectively “hallucinates” new names or words that strongly resemble the stylistic patterns of the training data but were never explicitly contained within it.
Conclusion
This complete Common Lisp implementation of MicroGPT vividly demonstrates that the core concepts underpinning modern large language models are not inscrutable magic. By methodically breaking down the architecture into its fundamental mathematical operations and constructing a rudimentary but fully functional autograd engine from scratch, we gain a profound, demystified understanding of how these powerful systems learn and operate.
Common Lisp, with its unparalleled flexibility and expressive syntax, proves to be an exceptionally capable and elegant language for this endeavor. Its interactive nature makes it a joy to use for defining complex computational graphs and experimenting with novel neural network architectures.
In the following section, you will find the complete source code for our dependency-free MicroGPT. Reading carefully through this program listing will help solidify the theoretical concepts discussed in this chapter and provide you with a robust, transparent foundation for further experimentation and exploration in the fascinating world of language modeling.
Complete Source Code Listing for microgpt.lisp
1 ;;;; microgpt.lisp — Karpathy's microGPT in dependency-free Common Lisp
2
3 (defpackage #:microgpt (:use #:cl) (:export #:run-microgpt))
4 (in-package #:microgpt)
5
6 (defun random-gauss (mean std)
7 (+ mean (* std (sqrt (* -2 (log (max (random 1.0) 1e-15)))) (cos (* 2 pi (random 1.0))))))
8
9 (defun shuffle-list (seq)
10 (let ((v (coerce seq 'vector)))
11 (loop for i from (1- (length v)) downto 1 for j = (random (1+ i))
12 do (rotatef (aref v i) (aref v j)))
13 (coerce v 'list)))
14
15 (defun read-lines (file)
16 (with-open-file (s file :if-does-not-exist nil)
17 (when s
18 (loop for line = (read-line s nil) while line
19 for tr = (string-trim '(#\Space #\Tab #\Newline #\Return) line)
20 when (plusp (length tr)) collect tr))))
21
22 (defparameter *docs* nil) (defparameter *uchars* nil) (defparameter *bos* nil)
23 (defparameter *vocab-size* nil) (defparameter *n-layer* 1) (defparameter *n-embd* 16)
24 (defparameter *block-size* 16) (defparameter *n-head* 4) (defparameter *head-dim* 4)
25 (defparameter *state-dict* (make-hash-table :test 'equal)) (defparameter *params* nil)
26
27 ;;; Autograd
28 (defstruct (value (:print-function print-value) (:constructor %make-value))
29 (data 0.0 :type single-float) (grad 0.0 :type single-float)
30 (children nil) (local-grads nil))
31
32 (defun print-value (v s d) (declare (ignore d))
33 (format s "<Value ~A/~A>" (value-data v) (value-grad v)))
34
35 (defun ensure (v) (if (value-p v) v (%make-value :data (float v 1.0))))
36 (defun make-value (d &optional ch lg)
37 (%make-value :data (if (value-p d) (value-data d) (float d 1.0)) :children ch :local-grads lg))
38
39 (defun v+ (a b)
40 (let ((a (ensure a)) (b (ensure b)))
41 (make-value (+ (value-data a) (value-data b)) (list a b) '(1.0 1.0))))
42 (defun v* (a b)
43 (let ((a (ensure a)) (b (ensure b)))
44 (make-value (* (value-data a) (value-data b)) (list a b) (list (value-data b) (value-data a)))))
45 (defun vpow (v p)
46 (let ((v (ensure v)))
47 (make-value (expt (value-data v) p) (list v)
48 (list (coerce (* p (expt (value-data v) (1- p))) 'single-float)))))
49 (defun vlog (v) (let ((v (ensure v))) (make-value (log (value-data v)) (list v) (list (/ 1.0 (value-data v))))))
50 (defun vexp (v) (let ((v (ensure v))) (make-value (exp (value-data v)) (list v) (list (exp (value-data v))))))
51 (defun vrelu (v) (let ((v (ensure v))) (make-value (max 0.0 (value-data v)) (list v) (list (if (plusp (value-data v)) 1.0 0.0)))))
52 (defun vneg (v) (v* v -1.0))
53 (defun v- (a b) (v+ a (vneg b)))
54 (defun v/ (a b) (v* a (vpow b -1.0)))
55
56 (defun backward (v)
57 (let ((topo nil) (seen (make-hash-table :test 'eq)))
58 (labels ((build (n)
59 (unless (gethash n seen)
60 (setf (gethash n seen) t)
61 (mapc #'build (value-children n))
62 (push n topo))))
63 (build v))
64 (setf (value-grad v) 1.0)
65 (dolist (n topo)
66 (mapc (lambda (c g) (incf (value-grad c) (* g (value-grad n))))
67 (value-children n) (value-local-grads n)))))
68
69 ;;; Model Parameters
70 (defun make-matrix (nout nin &optional (std 0.08))
71 (coerce (loop for i below nout collect
72 (coerce (loop for j below nin collect
73 (let ((v (make-value (random-gauss 0 std))))
74 (push v *params*) v))
75 'vector))
76 'vector))
77
78 (defun init-model ()
79 (setf *state-dict* (make-hash-table :test 'equal) *params* nil)
80 (flet ((mat (k r c) (setf (gethash k *state-dict*) (make-matrix r c))))
81 (mat "wte" *vocab-size* *n-embd*) (mat "wpe" *block-size* *n-embd*) (mat "lm_head" *vocab-size* *n-embd*)
82 (dotimes (i *n-layer*)
83 (flet ((lmat (s r c) (mat (format nil "layer~A.~A" i s) r c)))
84 (lmat "attn_wq" *n-embd* *n-embd*) (lmat "attn_wk" *n-embd* *n-embd*)
85 (lmat "attn_wv" *n-embd* *n-embd*) (lmat "attn_wo" *n-embd* *n-embd*)
86 (lmat "mlp_fc1" (* 4 *n-embd*) *n-embd*) (lmat "mlp_fc2" *n-embd* (* 4 *n-embd*)))))
87 (setf *params* (nreverse *params*)))
88
89 ;;; Architecture
90 (defun vsum (vs) (reduce #'v+ vs :initial-value (make-value 0.0)))
91 (defun vlinear (x w)
92 (loop for row across w
93 collect (reduce #'v+ (mapcar #'v* (coerce row 'list) x) :initial-value (make-value 0.0))))
94 (defun vsoftmax (logits)
95 (let* ((mx (reduce #'max logits :key #'value-data))
96 (exps (mapcar (lambda (v) (vexp (v- v mx))) logits))
97 (sum (vsum exps)))
98 (mapcar (lambda (e) (v/ e sum)) exps)))
99 (defun vrmsnorm (x)
100 (let ((sc (vpow (v+ (v/ (vsum (mapcar (lambda (xi) (v* xi xi)) x)) (length x)) 1e-5) -0.5)))
101 (mapcar (lambda (xi) (v* xi sc)) x)))
102 (defun make-kv-cache ()
103 (coerce (loop for i below *n-layer* collect (make-array 0 :fill-pointer 0 :adjustable t)) 'vector))
104
105 (defun gpt (tok-id pos-id keys vals)
106 (let ((x (mapcar #'v+ (coerce (aref (gethash "wte" *state-dict*) tok-id) 'list)
107 (coerce (aref (gethash "wpe" *state-dict*) pos-id) 'list))))
108 (setf x (vrmsnorm x))
109 (dotimes (li *n-layer*)
110 (flet ((sd (s) (gethash (format nil "layer~A.~A" li s) *state-dict*)))
111 (let ((xr x))
112 (setf x (vrmsnorm x))
113 (let* ((q (vlinear x (sd "attn_wq")))
114 (k (vlinear x (sd "attn_wk")))
115 (v (vlinear x (sd "attn_wv"))))
116 (vector-push-extend k (aref keys li))
117 (vector-push-extend v (aref vals li))
118 (setf x (vlinear
119 (loop for h below *n-head* nconc
120 (let* ((hs (* h *head-dim*)) (he (+ hs *head-dim*))
121 (q-h (subseq q hs he))
122 (k-h (map 'list (lambda (ki) (subseq ki hs he)) (aref keys li)))
123 (v-h (map 'list (lambda (vi) (subseq vi hs he)) (aref vals li)))
124 (aw (vsoftmax (loop for kt in k-h collect
125 (v/ (vsum (mapcar #'v* q-h kt)) (sqrt *head-dim*))))))
126 (loop for j below *head-dim* collect
127 (vsum (mapcar (lambda (w vt) (v* w (nth j vt))) aw v-h)))))
128 (sd "attn_wo")))
129 (setf x (mapcar #'v+ x xr))))
130 (let ((xr x))
131 (setf x (vrmsnorm x))
132 (setf x (mapcar #'vrelu (vlinear x (sd "mlp_fc1"))))
133 (setf x (vlinear x (sd "mlp_fc2")))
134 (setf x (mapcar #'v+ x xr)))))
135 (vlinear x (gethash "lm_head" *state-dict*))))
136
137 ;;; Training
138 (defun run-training ()
139 (let* ((lr 0.01) (b1 0.85) (b2 0.99) (eps 1e-8) (steps 1000) (np (length *params*))
140 (m (make-array np :initial-element 0.0)) (mv (make-array np :initial-element 0.0)))
141 (dotimes (step steps)
142 (let* ((doc (nth (mod step (length *docs*)) *docs*))
143 (tokens (append (list *bos*)
144 (loop for ch across doc collect (position ch *uchars*))
145 (list *bos*)))
146 (n (min *block-size* (1- (length tokens))))
147 (keys (make-kv-cache)) (vals (make-kv-cache)) (losses nil))
148 (dotimes (pos n)
149 (let* ((logits (gpt (nth pos tokens) pos keys vals))
150 (probs (coerce (vsoftmax logits) 'vector)))
151 (push (vneg (vlog (aref probs (nth (1+ pos) tokens)))) losses)))
152 (let* ((loss (v/ (vsum losses) n)) (t+1 (1+ step)))
153 (backward loss)
154 (loop for i below np for p in *params* do
155 (setf (aref m i) (+ (* b1 (aref m i)) (* (- 1 b1) (value-grad p))))
156 (setf (aref mv i) (+ (* b2 (aref mv i)) (* (- 1 b2) (expt (value-grad p) 2))))
157 (let ((mh (/ (aref m i) (- 1 (expt b1 t+1))))
158 (vh (/ (aref mv i) (- 1 (expt b2 t+1)))))
159 (decf (value-data p) (/ (* lr (- 1 (/ step steps)) mh) (+ (sqrt vh) eps)))
160 (setf (value-grad p) 0.0)))
161 (format t "step ~4D/~4D | loss ~,4F~%" t+1 steps (value-data loss))
162 (finish-output))))))
163
164 ;;; Inference
165 (defun random-choice (weights)
166 (let ((r (random (reduce #'+ weights))) (acc 0.0))
167 (dotimes (i (1- (length weights)) (1- (length weights)))
168 (incf acc (aref weights i))
169 (when (< r acc) (return-from random-choice i)))))
170
171 (defun run-inference ()
172 (format t "~%--- inference (new, hallucinated names) ---~%")
173 (dotimes (i 20)
174 (let ((keys (make-kv-cache)) (vals (make-kv-cache)) (tok *bos*) (sample nil))
175 (dotimes (pos *block-size*)
176 (let* ((lt (mapcar (lambda (l) (/ (value-data l) 0.5)) (gpt tok pos keys vals)))
177 (mx (reduce #'max lt))
178 (exps (mapcar (lambda (v) (exp (- v mx))) lt))
179 (probs (coerce (mapcar (lambda (e) (/ e (reduce #'+ exps))) exps) 'vector)))
180 (setf tok (random-choice probs))
181 (if (= tok *bos*) (return) (push (char *uchars* tok) sample))))
182 (format t "sample ~2D: ~A~%" (1+ i) (coerce (reverse sample) 'string)))))
183
184 ;;; Main
185 (defun run-microgpt ()
186 (setf *random-state* (make-random-state t)
187 *docs* (shuffle-list (read-lines "names.txt")))
188 (format t "num docs: ~A~%" (length *docs*))
189 (let ((chars nil))
190 (dolist (doc *docs*) (loop for ch across doc do (pushnew ch chars)))
191 (setf *uchars* (coerce (sort chars #'char<) 'string)
192 *bos* (length *uchars*)
193 *vocab-size* (1+ *bos*)))
194 (format t "vocab size: ~A~%" *vocab-size*)
195 (init-model)
196 (format t "num params: ~A~%" (length *params*))
197 (run-training)
198 (run-inference))
199
200 (microgpt:run-microgpt)
Let’s run the code:
1 $ sbcl
2 * (load "microgpt.lisp")
3 num docs: 32033
4 vocab size: 27
5 num params: 4192
6 step 1/1000 | loss 3.3421
7 step 2/1000 | loss 3.5779
8 ...
9 step 997/1000 | loss 2.2916
10 step 998/1000 | loss 1.7724
11 step 999/1000 | loss 2.3279
12 step 1000/1000 | loss 2.3213
13
14 --- inference (new, hallucinated names) ---
15 sample 1: lereri
16 sample 2: nahand
17 sample 3: eleni
18 sample 4: amica
19 sample 5: calina
20 sample 6: alla
21 sample 7: marona
22 sample 8: chidynn
23 sample 9: kanan
24 sample 10: brann
25 sample 11: kalen
26 sample 12: jorena
Wrap Up
This concludes our exploration of MicroGPT implemented from scratch in Common Lisp. By studying the complete source code listing provided above, you have seen firsthand how the abstract mathematical concepts of reverse-mode automatic differentiation, self-attention, and autoregressive language modeling translate into concrete, functional code.
I highly encourage you to load microgpt.lisp into your own Common Lisp REPL and run it. Try experimenting with the hyperparameters—adjust the *n-layer*, *n-embd*, or *n-head* variables and observe how they impact both the training time and the quality of the hallucinated outputs. You can also swap out the names.txt dataset for your own text files to see how the model adapts to entirely different linguistic patterns. By actively tinkering with this minimalistic implementation, you will cement your understanding of how modern generative AI truly works under the hood.