Background: Why BPE Training Speed Matters

In practice, BPE vocabulary training is slow enough that most practitioners sub-sample their data before learning a vocabulary. A common recipe is to randomly sample a few million sentences, train BPE on that, and apply the learned vocabulary to the full dataset. For dataset with less diversity or a single language, this is usually fine — a million sentences captures most frequent patterns.

But for massively multilingual models, sub-sampling is a serious compromise. In 2020, I was building a many-to-English translation system covering 500+ languages (see our ACL 2021 paper write-up). The full training corpus had roughly half a billion sentences across hundreds of languages. Many of these languages had only a few thousand sentences each. Sub-sampling a few million sentences meant that low-resource languages would be drastically underrepresented in the vocabulary — their unique character sequences and morphological patterns would be treated as rare noise rather than learned as proper subword units.

I needed to train BPE on the full dataset, or at least a very large fraction of it. SentencePiece — the standard tool — was too slow even on a 100M sentence sample (it would take hours). So I wrote a faster BPE learner. It worked, shipped as part of nlcodec, and I moved on. Five years later, I rediscovered it and realized the algorithmic trick deserved a proper write-up.

How BPE Training Works

Byte Pair Encoding (BPE) [1] learns a subword vocabulary by iteratively merging the most frequent adjacent token pair. Starting from a character-level vocabulary, each iteration:

  1. Counts the frequency of every adjacent pair \((a, b)\) across the corpus.

  2. Selects the pair \((a^*, b^*) = \arg\max_{(a,b)} f(a,b)\).

  3. Replaces all occurrences of \((a^*, b^*)\) with a new token \(r = a^* \| b^*\).

  4. Repeats until \(V\) merge operations have been performed.

The critical question is: how efficiently can we find the maximum-frequency pair and update counts after each merge?

Why SentencePiece Is Slow

SentencePiece’s BPE trainer uses a symbol graph with lazy frequency recomputation:

  • Finding the best pair: maintains an "active set" of the top 5% most frequent bigrams. Every 100 iterations, it re-scans all bigrams, recomputes their frequencies (by iterating all of each bigram’s occurrence positions), and partial-sorts to find the top 5%.

  • Frequency updates: after a merge, neighbor pairs have their frequency set to zero, triggering a full recomputation on next access. Each recomputation iterates the positions list, validates each position against the current symbol array, and erases stale entries.

The cost per merge is dominated by:

\[T_{\text{SP}} = O(|\mathcal{A}| \cdot \bar{P}) \quad \text{every 100 iters: } O(|\mathcal{S}| \cdot \bar{P})\]

where \(|\mathcal{A}|\) is the active set size (~5% of all bigrams), \(|\mathcal{S}|\) is the total symbol cache size, and \(\bar{P}\) is the average number of positions per symbol. The periodic full rescan makes the total cost superlinear in practice.

The nlcodec Algorithm

The nlcodec algorithm, originally implemented in Python in 2020 as part of the MTData/nlcodec toolchain [2], uses three data structures to achieve \(O(\log N)\) per merge:

Data Structure 1: Max-Heap

A standard max-priority queue (binary heap) keyed by bigram frequency. Finding the best pair is \(O(1)\) peek, \(O(\log N)\) pop — not \(O(|\mathcal{A}|)\) linear scan.

Data Structure 2: Doubly-Linked Lists

Each word’s token sequence is stored as a doubly-linked list of LnNode objects, arena-allocated for cache efficiency. Merging adjacent tokens \((a, b) \to r\) is \(O(1)\):

Before: ... ⇄ x ⇄ a ⇄ b ⇄ y ⇄ ...
After:  ... ⇄ x ⇄ r ⇄ y ⇄ ...   (b deleted, a reused as r)

No array shifting. Node identity is preserved (hash by pointer address), so existing index entries remain valid.

Data Structure 3: Lazy Deletion via "Dirty Map"

After a merge, four neighbor pairs change frequency: \((x, a)\) and \((b, y)\) lose occurrences; \((x, r)\) and \((r, y)\) gain occurrences. Instead of locating and updating entries deep inside the heap (\(O(N)\)):

  • Increments (new pairs involving \(r\)): pushed directly to the heap.

  • Decrements (old pairs losing occurrences): accumulated in a hash map HeapDirty[pair] += delta.

  • On pop: if the popped pair exists in HeapDirty, apply the correction. If the corrected frequency is still positive, re-push; otherwise discard.

This maintains heap correctness without \(O(N)\) search:

\[T_{\text{nlcodec}} = O(K_i \cdot \log N)\]

where \(K_i\) is the number of positions of the merged pair (the inherent work). Total over all \(V\) merges:

\[T_{\text{total}} = O(M \log N)\]

where \(M\) is the total corpus token count — the same work as a single scan, amortized across all merges.

The Merge Loop (Pseudocode)

heap ← MaxHeap(bigram_frequencies)
dirty ← {}

for i in 1..V:
    (pair, freq) ← heap.pop()

    while pair in dirty:               # apply lazy corrections
        freq += dirty.pop(pair)
        if freq > 0: heap.push(pair, freq)
        (pair, freq) ← heap.pop()

    if freq < threshold: break          # early stop

    r ← new_token(pair.left + pair.right)

    for each node where pair occurs:    # O(K_i)
        delete right_node              # O(1) linked-list op
        reuse left_node as r           # O(1)
        update neighbor deltas         # O(1) hash map ops

    push new pair frequencies to heap
    accumulate decrements in dirty

Comparison

OperationSentencePiecenlcodec

Find best pair

\(O(|\mathcal{A}|)\) linear scan + lazy freq

\(O(\log N)\) heap pop

Update frequencies

Set to 0, recompute on access

Direct delta ± \(O(1)\)

Position tracking

btree_set<uint64>, validate on every access

unordered_set<LnNode*>, pointer-stable

Periodic rescan

Every 100 iterations: \(O(|\mathcal{S}| \cdot \bar{P})\)

Never

Merge operation

Array-based (scan for next/prev)

Doubly-linked list \(O(1)\)

Benchmark

Setup

  • Data: 1M multilingual sentences from CC-100 (en, de, zh-Hans, ar, hi), 200k lines per language, shuffled.

  • Vocab: 32,000 BPE types.

  • Hardware: 48-core server, single-threaded (BPE merge loop is inherently sequential).

  • Tools: SentencePiece (Marian fork), nlcodec C++ with pybind11.

Results: 1M Sentences, 32k Vocabulary

ToolLanguageTimeSpeedup

SentencePiece (default)

C++

9m 08s

1.0×

fastBPE

C++

4m 52s

1.9×

nlcodec (original)

Python

3m 17s

2.8×

nlcodec C++ (standalone)

C++

1m 32s

6.0×

SentencePiece + --nlcodec_bpe

C++

1m 03s

8.7×

Results: 200k Sentences, 32k Vocabulary

ToolLanguageTimeSpeedup

SentencePiece (default)

C++

2m 34s

1.0×

fastBPE

C++

1m 09s

2.2×

nlcodec (original)

Python

55s

2.8×

nlcodec C++ (standalone)

C++

18.8s

8.2×

SentencePiece + --nlcodec_bpe

C++

14.1s

10.9×

Note: the original Python nlcodec already beats SentencePiece’s C++ implementation by 2.8×. This is a pure algorithmic win — Python’s hash maps and object overhead are slower than C++, yet the better algorithm more than compensates. Rewriting in C++ adds another 2.9-3.9× on top, for a combined 8.7-10.9× improvement.

The trained models produce nearly identical vocabularies (99.3% overlap at 200k sentences, 99.8% at 1M) and equivalent compression. Encoding the full training data with each model:

DatasetModelTotal tokensMean sent length

200k

Default

8,480,496

44.60

200k

Nlcodec

8,480,629

44.60

1M

Default

42,547,569

44.75

1M

Nlcodec

42,548,265

44.75

The tiny vocabulary differences come from tie-breaking in pair frequency ordering; the resulting tokenizations are effectively identical.

The --nlcodec_bpe flag replaces only the merge loop internals; everything else (sentence loading, normalization, whitespace handling, model serialization) uses SentencePiece’s native code paths.

Usage

For SentencePiece (with our patch):

spm_train --input=data.txt --model_prefix=model \
          --vocab_size=32000 --model_type=bpe \
          --nlcodec_bpe

For nlcodec standalone (pip install):

import nlcodec
tok = nlcodec.train_bpe("data.txt", vocab_size=32000)
tok.save("model.json.gz")

History

This algorithm was written in Python in late 2020 as part of the nlcodec library, a companion tool for the MTData project (described in our ACL 2021 system demo). The original Python implementation already outperformed SentencePiece’s C++ BPE trainer by ~3× due to the algorithmic advantage. The source code contained this comment:

# TODO: write this in c++ or rust and bind it here

That TODO remained unaddressed for five years. In March 2026, I finally rewrote it in C++ (~200 lines for the core algorithm), added pybind11 bindings, and integrated it into SentencePiece as a drop-in replacement.

The C rewrite turned a 3× Python advantage into an 8.7–10.9× advantage -- Python's hash map and object allocation overhead was the remaining bottleneck. The memory usage also dropped from 7.5 GB (Python, 1M lines) to under 2 GB (C).

Code

Citation

If you use this work, please cite:

@inproceedings{gowda-etal-2021-many,
    title = "Many-to-{E}nglish Machine Translation Tools, Data, and Pretrained Models",
    author = "Gowda, Thamme  and
      Zhang, Zhao  and
      Mattmann, Chris  and
      May, Jonathan",
    booktitle = "Proceedings of ACL-IJCNLP 2021: System Demonstrations",
    year = "2021",
    url = "https://aclanthology.org/2021.acl-demo.37",
}

1. Sennrich et al., Neural Machine Translation of Rare Words with Subword Units, ACL 2016.
2. Gowda et al., Many-to-English Machine Translation Tools, Data, and Pretrained Models, ACL 2021.