From O(N) to O(log N): A Faster BPE Training Algorithm, Buried and Rediscovered
Background: Why BPE Training Speed Matters In practice, BPE vocabulary training is slow enough that most practitioners sub-sample their data before learning a vocabulary. A common recipe is to randomly sample a few million sentences, train BPE on that, and apply the learned vocabulary to the full dataset. For dataset with less diversity or a single language, this is usually fine — a million sentences captures most frequent patterns. But for massively multilingual models, sub-sampling is a serious compromise. In 2020, I was building a many-to-English translation system covering 500+ languages (see our ACL 2021 paper write-up). The full training corpus had roughly half a billion sentences across hundreds of languages. Many of these languages had only a few thousand sentences each. Sub-sampling a few million sentences meant that low-resource languages would be drastically underrepresented in the vocabulary — their unique character sequences and morphological patterns would be treated as rare noise rather than learned as proper subword units. ...