logistics supply chain header
Home / AI / RTS Experiment: We Built a Tiny LLM From Scratch
AI / Experiments
ai accountability / AI Consulting / ai logistics

RTS Experiment: We Built a Tiny LLM From Scratch

Published:

TABLE OF CONTENTS

TL;DR

The RTS Labs team built and trained two tiny language models from scratch using Andrej Karpathy’s nanoGPT framework to observe training behavior firsthand on their hardware.

Key Takeaways:

  • Clear Performance Ceiling: The tiny Bigram model (~13k parameters) quickly hit its architectural limit, flatlining at a loss of 2.51 by step 500. Its output remained complete gibberish, establishing the baseline any real model had to beat.

  • The Power of Attention: Upgrading to a Small Transformer (~10.8M parameters) allowed the model to look back 256 characters. This dropped the validation loss to 1.30 and yielded “Twain-flavored” prose that convincingly used real and stylistically accurate made-up words.

  • The Cross-Dataset Illusion: When running the bigram model on a second dataset (TinyStories), it achieved a lower converged loss (2.30) simply because repetitive children’s stories have higher intrinsic predictability than 19th-century prose. Yet, the qualitative output was still total gibberish.

Dataset Baseline Comparison:

Dataset Vocab Size Random Baseline (ln(vocab)) Bigram Converged Loss
Twain (7.4M chars) 114 4.74 2.51
TinyStories (1.9B chars) 174 5.16 2.30

Conclusion: Model size and raw loss numbers do not guarantee cross-dataset comparison reliability. As the experiment shows, a lower raw loss on a simpler corpus doesn’t mean a model is “better,” making it essential to focus on the shape of the train-vs-validation divergence rather than absolute numbers.

Most of what we do day-to-day involves calling LLMs that already exist. But we wanted to know what’s actually happening underneath training, so we spent some time building one from scratch following Andrej Karpathy’s nanoGPT video. The dataset is small (12 Mark Twain books from Project Gutenberg, about 7 megabytes of text). The models are tiny by industry standards. The point was to watch the training behavior with our own eyes on our own hardware, not to produce something extraordinary. This is the first of three posts on that work.

The Setup

Training ran on our NVIDIA DGX Spark, the same workstation we used in the benchmark posts. The model reads text one character at a time, so the “vocabulary” is the 114 unique characters that appear in the Twain corpus (letters, digits, punctuation, whitespace). 90% of the text was used for training, 10% held out for validation.

Two models in this post:

  • bigram model (about 13,000 parameters): given the current character, predict the next one. No memory of anything earlier. The simplest thing that technically counts as a language model.
  • small transformer (about 10.8 million parameters, 830× larger): same task, but with attention. It can look back at the previous 256 characters when predicting the next one.

Both trained for 5,000 iterations on the same data, same split, same random seed.

A note on prompts. Every generated sample shown in this post starts from a single tab character (\t) as the prompt, not a meaningful phrase like “Once upon a time” or “Tom said,”. A tab gives the model a neutral zero-context start, so the output you see is the model freewheeling from nothing rather than reacting to text that might favor a particular dataset or style.

Establishing the Baseline Model – Bigram

The bigram’s loss dropped from 5.22 to 2.51 in the first 500 iterations and then flatlined. We kept training for another 4,500 iterations and the loss did not change.

The plateau is the model’s ceiling. Bigram models only see one character of context, so once the letter-pair statistics are captured there is nothing else to fit. Generation sample after 5,000 iterations:

“hepecirdus de abrlwhinge m the wh g, Th Ke. izeangeseemaury-t thel the r. u—erig thead_hep MGU tas. hebyt bler arerthe ce iminoolapion tond lad t. icoor us thio ROn imaruly tonf…”

That output is what loss = 2.51 sounds like: letter pairs, a handful of incidental real words, no sentence structure, no meaning. It is the baseline any real model on this dataset has to beat.

Adding Attention

The transformer started at the same loss (4.74, basically ln(114)) and dropped past the bigram’s 2.51 baseline by step 500. By step 5,000 it landed at a training loss of 1.21 and a validation loss of 1.30. The full curves:

Generation sample after 5,000 iterations:

“Vindeberk. Where Saw the week is the dramazy roading over the thick wood. He tranquilled to himself, that would have posted one book washed from a snatch coin, committs selled upon his armed in a grance Ameracharlel’s Track command, who yeaes betheld upon on defect sometimation, very dition of tree, called by his wagon and by fleeb-over his lo dogs…”

Many of the words are real and the made-up ones sound like Twain coinages rather than random letter strings (VindeberktranquilledAmeracharlel). The output reads as Twain-flavored, even though it is not saying anything coherent.

A Second Dataset

Everything above is one dataset (Twain). The other two posts in this series compare Twain against a second, much larger dataset called TinyStories: about 1.9 billion characters of synthetic children’s stories, roughly 250 times the size of Twain. Where Twain is a slice of one author’s 19th-century prose, TinyStories is a vast pile of formulaic kid-story text (small vocabulary, repetitive sentence templates, lots of “Once upon a time” openers).

Before we use the two datasets to do anything interesting, one calibration is worth locking in: the absolute loss numbers across these two datasets are not directly comparable, because the two corpora have different intrinsic per-character predictability. To show that concretely, we ran the same bigram model on TinyStories. Same architecture, same training budget, same random seed. Only the data changed.

DatasetVocabRandom baseline ln(vocab)Bigram converged loss
Twain (7.4M chars)1144.742.51
TinyStories (1.9B chars)1745.162.30

At the letter-pair level alone, TinyStories text is more predictable than Twain text by 0.21 nats. That holds even though TinyStories has a larger character set (174 vs 114) and therefore has further to descend from random.

TinyStories bigram sample after 5,000 iterations:

“buredallomy t ay fobapomyo to ser ad wandacret tharom’soo y t se Loy. d “Lo andareckethrtousan theande an ef lomy. nd tog hindinerk! bo Juted…”

Reading the two bigram samples side by side, the 0.21-nat gap does not jump out as readability: both still look like gibberish. The numerical floor differs; the qualitative output does not. Letter-pair statistics alone are not enough to produce real words at either floor.

Why this matters for the rest of the series. The next two posts compare loss numbers between models trained on each dataset, and the intuitive reading of “TinyStories loss 0.73, Twain loss 1.21, therefore the TinyStories model is better” is wrong. The loss gap between datasets is already present at the bigram baseline, before any sophisticated model gets involved. Within a dataset, comparing model A to model B is meaningful. Across datasets, the only meaningful comparison is the shape of the train-vs-val divergence, which is what the next post turns on.

What We’re Not Concluding

This covers two datasets, one tokenization (character-level), and two model classes (a bigram baseline and a small transformer). The transformer sample is what character-level training on 7 megabytes of Twain produces; a different dataset or a larger model produces visibly different prose, which is the subject of the next two posts. We are not making general claims about how language models work from this one run. The point was to give the team a concrete visual for how training actually behaves at the simplest scale, and to set up the cross-dataset comparison that the rest of the series builds on.

Share this guide:

Facebook
LinkedIn
Reddit
X
What to do next?
RTS LABS • AI CONSULTING

AI at scale without the gorvernance headaches?
We fix that...fast.

  • AI governance audit tailored to your stack & compliance posture

  • Green/red zone framework implemented in weeks, not months

  • SOC 2, HIPAA, PCI DSS compliance mapping included

Years Enterprise
Experience
0 +
Clients
Served
0 +
Real Results

Proof of Success. Real AI in Production.

Real engineering teams. Real production systems. Real outcomes you can verify. Browse the case studies for practical proof of enterprise AI adoption — done right, done fast.

Let’s Build Something Great Together!

Have questions or need expert guidance? Reach out to our team and let’s discuss how we can help.