C++ | Thamme Gowda

From O(N) to O(log N): A Faster BPE Training Algorithm, Buried and Rediscovered

Background: Why BPE Training Speed Matters In practice, BPE vocabulary training is slow enough that most practitioners sub-sample their data before learning a vocabulary. A common recipe is to randomly sample a few million sentences, train BPE on that, and apply the learned vocabulary to the full dataset. For dataset with less diversity or a single language, this is usually fine — a million sentences captures most frequent patterns. But for massively multilingual models, sub-sampling is a serious compromise. In 2020, I was building a many-to-English translation system covering 500+ languages (see our ACL 2021 paper write-up). The full training corpus had roughly half a billion sentences across hundreds of languages. Many of these languages had only a few thousand sentences each. Sub-sampling a few million sentences meant that low-resource languages would be drastically underrepresented in the vocabulary — their unique character sequences and morphological patterns would be treated as rare noise rather than learned as proper subword units. ...

Building a Jinja2 Template Engine from Scratch in C++

Modern LLMs are chat models that use Jinja2 templates to format conversations. But Jinja is Python-centric, making deployment hard in C and edge environments. Here's how to build a Jinja2 template engine from scratch in C — covering lexing, parsing, and evaluation — so you can render chat templates natively without Python.

I Let Two AI Agents Race to Modernize pigz

I gave Claude Opus 4.6 and GPT 5.4 the same task: rewrite pigz (a parallel gzip tool) in modern C++23. After 70 minutes, one delivered a clean-room rewrite, the other a wrapper around the old code. Then I pushed the winner to beat pigz — and it did.