From O(N) to O(log N): A Faster BPE Training Algorithm, Buried and Rediscovered

Background: Why BPE Training Speed Matters In practice, BPE vocabulary training is slow enough that most practitioners sub-sample their data before learning a vocabulary. A common recipe is to randomly sample a few million sentences, train BPE on that, and apply the learned vocabulary to the full dataset. For dataset with less diversity or a single language, this is usually fine — a million sentences captures most frequent patterns. But for massively multilingual models, sub-sampling is a serious compromise. In 2020, I was building a many-to-English translation system covering 500+ languages (see our ACL 2021 paper write-up). The full training corpus had roughly half a billion sentences across hundreds of languages. Many of these languages had only a few thousand sentences each. Sub-sampling a few million sentences meant that low-resource languages would be drastically underrepresented in the vocabulary — their unique character sequences and morphological patterns would be treated as rare noise rather than learned as proper subword units. ...

March 30, 2026 · 7 min · TG Gowda

Building a Jinja2 Template Engine from Scratch in C++

Modern LLMs are chat models that use Jinja2 templates to format conversations. But Jinja is Python-centric, making deployment hard in C and edge environments. Here's how to build a Jinja2 template engine from scratch in C — covering lexing, parsing, and evaluation — so you can render chat templates natively without Python.

March 10, 2026 · 20 min · TG Gowda

I Let Two AI Agents Race to Modernize pigz

I gave Claude Opus 4.6 and GPT 5.4 the same task: rewrite pigz (a parallel gzip tool) in modern C++23. After 70 minutes, one delivered a clean-room rewrite, the other a wrapper around the old code. Then I pushed the winner to beat pigz — and it did.

March 7, 2026 · 11 min · Thamme Gowda

Sequence Transduction: Generalization and Challenges

Sequence-to-sequence transduction, e.g. neural machine translation (NMT), is a general problem. This task involves transformation of a sequence of symbols to another sequence of symbols, where both input and output can have varying lengths. Challenges Sequential information: Order of input and output symbols are important. Variable length sequences with long term dependencies: Sequences can be extremely long. Symbols may contains dependencies across the sequence. E.g. Consider the text in a book, and dependencies across chapters. Unbounded vocabulary, e.g. Vocabulary in natural languages. Imbalanced distribution: Some symbols may appear frequently while other may appear rarely. E.g. The distribution of types in natural languages. ...

May 4, 2021 · 2 min · Thamme Gowda

Many-to-English Machine Translation Tools, Data, and Pretrained Models

Many-to-English Machine Translation Tools, Data, and Pretrained Models

April 25, 2021 · 3 min · Thamme Gowda

Macro-Average: Rare Types Are Important Too

Macro-Average: Rare Types Are Important Too

March 11, 2021 · 8 min · Thamme Gowda

Finding the Optimal Vocabulary for Neural Machine Translation

Finding the Optimal Vocabulary for Neural Machine Translation

November 1, 2020 · 2 min · Thamme Gowda