Background: Why BPE Training Speed Matters
In practice, BPE vocabulary training is slow enough that most practitioners sub-sample their data before learning a vocabulary. A common recipe is to randomly sample a few million sentences, train BPE on that, and apply the learned vocabulary to the full dataset. For dataset with less diversity or a single language, this is usually fine — a million sentences captures most frequent patterns.
But for massively multilingual models, sub-sampling is a serious compromise. In 2020, I was building a many-to-English translation system covering 500+ languages (see our ACL 2021 paper write-up). The full training corpus had roughly half a billion sentences across hundreds of languages. Many of these languages had only a few thousand sentences each. Sub-sampling a few million sentences meant that low-resource languages would be drastically underrepresented in the vocabulary — their unique character sequences and morphological patterns would be treated as rare noise rather than learned as proper subword units.
I needed to train BPE on the full dataset, or at least a very large fraction of it. SentencePiece — the standard tool — was too slow even on a 100M sentence sample (it would take hours). So I wrote a faster BPE learner. It worked, shipped as part of nlcodec, and I moved on. Five years later, I rediscovered it and realized the algorithmic trick deserved a proper write-up.
How BPE Training Works
Byte Pair Encoding (BPE) [1] learns a subword vocabulary by iteratively merging the most frequent adjacent token pair. Starting from a character-level vocabulary, each iteration:
Counts the frequency of every adjacent pair \((a, b)\) across the corpus.
Selects the pair \((a^*, b^*) = \arg\max_{(a,b)} f(a,b)\).
Replaces all occurrences of \((a^*, b^*)\) with a new token \(r = a^* \| b^*\).
Repeats until \(V\) merge operations have been performed.
The critical question is: how efficiently can we find the maximum-frequency pair and update counts after each merge?
Why SentencePiece Is Slow
SentencePiece’s BPE trainer uses a symbol graph with lazy frequency recomputation:
Finding the best pair: maintains an "active set" of the top 5% most frequent bigrams. Every 100 iterations, it re-scans all bigrams, recomputes their frequencies (by iterating all of each bigram’s occurrence positions), and partial-sorts to find the top 5%.
Frequency updates: after a merge, neighbor pairs have their frequency set to zero, triggering a full recomputation on next access. Each recomputation iterates the positions list, validates each position against the current symbol array, and erases stale entries.
The cost per merge is dominated by:
where \(|\mathcal{A}|\) is the active set size (~5% of all bigrams), \(|\mathcal{S}|\) is the total symbol cache size, and \(\bar{P}\) is the average number of positions per symbol. The periodic full rescan makes the total cost superlinear in practice.
The nlcodec Algorithm
The nlcodec algorithm, originally implemented in Python in 2020 as part of the MTData/nlcodec toolchain [2], uses three data structures to achieve \(O(\log N)\) per merge:
Data Structure 1: Max-Heap
A standard max-priority queue (binary heap) keyed by bigram frequency. Finding the best pair is \(O(1)\) peek, \(O(\log N)\) pop — not \(O(|\mathcal{A}|)\) linear scan.
Data Structure 2: Doubly-Linked Lists
Each word’s token sequence is stored as a doubly-linked list of LnNode objects, arena-allocated for cache efficiency.
Merging adjacent tokens \((a, b) \to r\) is \(O(1)\):
Before: ... ⇄ x ⇄ a ⇄ b ⇄ y ⇄ ... After: ... ⇄ x ⇄ r ⇄ y ⇄ ... (b deleted, a reused as r)
No array shifting. Node identity is preserved (hash by pointer address), so existing index entries remain valid.
Data Structure 3: Lazy Deletion via "Dirty Map"
After a merge, four neighbor pairs change frequency: \((x, a)\) and \((b, y)\) lose occurrences; \((x, r)\) and \((r, y)\) gain occurrences. Instead of locating and updating entries deep inside the heap (\(O(N)\)):
Increments (new pairs involving \(r\)): pushed directly to the heap.
Decrements (old pairs losing occurrences): accumulated in a hash map
HeapDirty[pair] += delta.On pop: if the popped pair exists in
HeapDirty, apply the correction. If the corrected frequency is still positive, re-push; otherwise discard.
This maintains heap correctness without \(O(N)\) search:
where \(K_i\) is the number of positions of the merged pair (the inherent work). Total over all \(V\) merges:
where \(M\) is the total corpus token count — the same work as a single scan, amortized across all merges.
The Merge Loop (Pseudocode)
heap ← MaxHeap(bigram_frequencies)
dirty ← {}
for i in 1..V:
(pair, freq) ← heap.pop()
while pair in dirty: # apply lazy corrections
freq += dirty.pop(pair)
if freq > 0: heap.push(pair, freq)
(pair, freq) ← heap.pop()
if freq < threshold: break # early stop
r ← new_token(pair.left + pair.right)
for each node where pair occurs: # O(K_i)
delete right_node # O(1) linked-list op
reuse left_node as r # O(1)
update neighbor deltas # O(1) hash map ops
push new pair frequencies to heap
accumulate decrements in dirtyComparison
| Operation | SentencePiece | nlcodec |
|---|---|---|
Find best pair | \(O(|\mathcal{A}|)\) linear scan + lazy freq | \(O(\log N)\) heap pop |
Update frequencies | Set to 0, recompute on access | Direct delta ± \(O(1)\) |
Position tracking |
|
|
Periodic rescan | Every 100 iterations: \(O(|\mathcal{S}| \cdot \bar{P})\) | Never |
Merge operation | Array-based (scan for next/prev) | Doubly-linked list \(O(1)\) |
Benchmark
Setup
Data: 1M multilingual sentences from CC-100 (en, de, zh-Hans, ar, hi), 200k lines per language, shuffled.
Vocab: 32,000 BPE types.
Hardware: 48-core server, single-threaded (BPE merge loop is inherently sequential).
Tools: SentencePiece (Marian fork), nlcodec C++ with pybind11.
Results: 1M Sentences, 32k Vocabulary
| Tool | Language | Time | Speedup |
|---|---|---|---|
SentencePiece (default) | C++ | 9m 08s | 1.0× |
C++ | 4m 52s | 1.9× | |
nlcodec (original) | Python | 3m 17s | 2.8× |
nlcodec C++ (standalone) | C++ | 1m 32s | 6.0× |
SentencePiece + | C++ | 1m 03s | 8.7× |
Results: 200k Sentences, 32k Vocabulary
| Tool | Language | Time | Speedup |
|---|---|---|---|
SentencePiece (default) | C++ | 2m 34s | 1.0× |
C++ | 1m 09s | 2.2× | |
nlcodec (original) | Python | 55s | 2.8× |
nlcodec C++ (standalone) | C++ | 18.8s | 8.2× |
SentencePiece + | C++ | 14.1s | 10.9× |
Note: the original Python nlcodec already beats SentencePiece’s C++ implementation by 2.8×. This is a pure algorithmic win — Python’s hash maps and object overhead are slower than C++, yet the better algorithm more than compensates. Rewriting in C++ adds another 2.9-3.9× on top, for a combined 8.7-10.9× improvement.
The trained models produce nearly identical vocabularies (99.3% overlap at 200k sentences, 99.8% at 1M) and equivalent compression. Encoding the full training data with each model:
| Dataset | Model | Total tokens | Mean sent length |
|---|---|---|---|
200k | Default | 8,480,496 | 44.60 |
200k | Nlcodec | 8,480,629 | 44.60 |
1M | Default | 42,547,569 | 44.75 |
1M | Nlcodec | 42,548,265 | 44.75 |
The tiny vocabulary differences come from tie-breaking in pair frequency ordering; the resulting tokenizations are effectively identical.
The --nlcodec_bpe flag replaces only the merge loop internals; everything else (sentence loading, normalization, whitespace handling, model serialization) uses SentencePiece’s native code paths.
Usage
For SentencePiece (with our patch):
spm_train --input=data.txt --model_prefix=model \
--vocab_size=32000 --model_type=bpe \
--nlcodec_bpeFor nlcodec standalone (pip install):
import nlcodec
tok = nlcodec.train_bpe("data.txt", vocab_size=32000)
tok.save("model.json.gz")History
This algorithm was written in Python in late 2020 as part of the nlcodec library, a companion tool for the MTData project (described in our ACL 2021 system demo). The original Python implementation already outperformed SentencePiece’s C++ BPE trainer by ~3× due to the algorithmic advantage. The source code contained this comment:
# TODO: write this in c++ or rust and bind it hereThat TODO remained unaddressed for five years. In March 2026, I finally rewrote it in C++ (~200 lines for the core algorithm), added pybind11 bindings, and integrated it into SentencePiece as a drop-in replacement.
The C rewrite turned a 3× Python advantage into an 8.7–10.9× advantage -- Python's hash map and object allocation overhead was the remaining bottleneck. The memory usage also dropped from 7.5 GB (Python, 1M lines) to under 2 GB (C).
Code
Original nlcodec (Python): github.com/isi-nlp/nlcodec
nlcodec C++ rewrite: github.com/thammegowda/nlcodec (
tg/cppbranch,csrc/directory)SentencePiece integration:
--nlcodec_bpeflag (PR against Marian fork)tokenizerpp integration:
tokenizers::trainers::train_bpe()API
Citation
If you use this work, please cite:
@inproceedings{gowda-etal-2021-many,
title = "Many-to-{E}nglish Machine Translation Tools, Data, and Pretrained Models",
author = "Gowda, Thamme and
Zhang, Zhao and
Mattmann, Chris and
May, Jonathan",
booktitle = "Proceedings of ACL-IJCNLP 2021: System Demonstrations",
year = "2021",
url = "https://aclanthology.org/2021.acl-demo.37",
}