Many-to-English: Data (v1)
Datasets are available at
train.raw.tsv.gz # Training data in raw form, before cleaning, deduping and tokenization
train.v1.eng.tok.gz # English training data, after cleaning and tokenization
train.v1.src.tok.gz # Source training data, after cleaning and tokenization
train.v1.lang.gz # lang ID of source side sentences
train.v1.prov.gz # provenance of record (to see where where this record)
train.v1.tok.stats.tsv # stats such as sentence and token count per language
devs-combo-shuf10k-raw+tok.tgz # 10K sentences for validation, randomly sampled from all dev sets
devtests-raw+tok.tgz # all the dev and test data; both raw and tokenized
citations.bib # BibTeX of articles which published the datasets collected in this work
prep.tgz # scripts to prepare datasets from square 1.
train.v1.{eng.tok,src.tok,lang,prov}
are plain text files after running gunzip.
They should have same number of lines. Line number is the way to cross-reference between them.
You may also prepare these datasets from scratch or revise cleaning mechanisms starting from train.raw.tsv.gz
. The prep.tgz
file has datatprep.ipynb
notebook that contains steps to download, tokenize, deduplicate and filter our bad records.
# | ISO 639-3 | Name | Sentences | SourceTokens | EnglishTokens |
---|---|---|---|---|---|
Total |
473,791,285 |
9,001,777,125 |
9,072,884,192 |
||
1 |
FRA |
French |
33,010,111 |
1,001,694,594 |
862,394,235 |
2 |
RUS |
Russian |
25,564,367 |
640,353,439 |
682,207,974 |
3 |
ARA |
Arabic |
22,679,389 |
588,552,279 |
673,442,592 |
4 |
ZHO |
Chinese |
20,057,876 |
827,184,155 |
563,720,365 |
5 |
TUR |
Turkish |
37,720,743 |
282,136,509 |
386,742,077 |
6 |
SRP |
Serbian |
33,880,099 |
279,793,956 |
343,470,174 |
7 |
HEB |
Hebrew |
25,841,585 |
224,902,346 |
283,316,533 |
8 |
NLD |
Dutch |
12,428,300 |
278,665,986 |
281,604,016 |
9 |
POR |
Portuguese |
10,954,498 |
279,392,943 |
264,796,337 |
10 |
DEU |
German |
12,119,459 |
245,035,715 |
254,126,788 |
11 |
ITA |
Italian |
10,095,386 |
244,588,092 |
245,359,909 |
12 |
SPA |
Spanish |
9,776,966 |
229,401,449 |
212,893,081 |
13 |
SWE |
Swedish |
8,024,230 |
156,259,227 |
175,706,578 |
14 |
DAN |
Danish |
7,682,343 |
163,476,497 |
173,310,159 |
15 |
FIN |
Finnish |
8,187,935 |
131,239,231 |
172,686,863 |
16 |
POL |
Polish |
8,289,276 |
151,588,493 |
169,532,813 |
17 |
ELL |
Modern Greek (1453-) |
6,835,717 |
155,615,617 |
154,230,888 |
18 |
NOR |
Norwegian |
10,860,768 |
126,923,739 |
142,425,026 |
19 |
HUN |
Hungarian |
6,785,904 |
124,670,924 |
140,421,813 |
20 |
SLV |
Slovenian |
6,227,413 |
123,184,373 |
135,429,495 |
21 |
BOS |
Bosnian |
12,903,765 |
108,980,163 |
134,352,300 |
22 |
SLK |
Slovak |
5,689,766 |
111,471,118 |
125,492,209 |
23 |
EST |
Estonian |
5,788,575 |
94,125,966 |
120,585,810 |
24 |
LIT |
Lithuanian |
5,139,565 |
95,332,259 |
113,401,294 |
25 |
LAV |
Latvian |
4,460,210 |
88,412,493 |
104,312,482 |
26 |
FAS |
Persian |
8,054,223 |
96,700,032 |
103,579,635 |
27 |
JPN |
Japanese |
5,379,355 |
111,898,280 |
95,783,174 |
28 |
VIE |
Vietnamese |
6,186,692 |
112,645,134 |
91,410,645 |
29 |
UKR |
Ukrainian |
4,446,827 |
66,084,956 |
75,462,511 |
30 |
CES |
Czech |
3,986,495 |
64,539,275 |
74,129,349 |
31 |
MLT |
Maltese |
3,079,369 |
85,936,755 |
71,180,461 |
32 |
KOR |
Korean |
3,907,008 |
124,773,196 |
67,101,857 |
33 |
IND |
Indonesian |
3,441,203 |
63,578,787 |
64,369,133 |
34 |
CAT |
Catalan |
3,151,150 |
66,838,920 |
60,317,575 |
35 |
RON |
Romanian |
2,871,321 |
52,945,536 |
51,780,651 |
36 |
BUL |
Bulgarian |
2,755,198 |
47,511,112 |
50,192,767 |
37 |
THA |
Thai |
4,003,627 |
55,716,180 |
49,463,563 |
38 |
GLE |
Irish |
1,770,628 |
48,936,403 |
45,855,008 |
39 |
HRV |
Croatian |
2,304,149 |
34,755,332 |
39,315,927 |
40 |
HIN |
Hindi |
2,211,381 |
41,452,435 |
38,163,209 |
41 |
MKD |
Macedonian |
1,898,346 |
29,456,395 |
31,284,559 |
42 |
EUS |
Basque |
2,101,130 |
24,427,549 |
30,827,665 |
43 |
SQI |
Albanian |
1,659,043 |
28,953,259 |
28,389,962 |
44 |
URD |
Urdu |
1,121,988 |
28,181,194 |
26,194,588 |
45 |
TGL |
Tagalog |
1,307,417 |
28,950,146 |
26,087,793 |
46 |
BEN |
Bengali |
1,469,860 |
22,481,515 |
23,509,544 |
47 |
GLG |
Galician |
1,270,160 |
23,281,744 |
22,762,444 |
48 |
AFR |
Afrikaans |
1,164,819 |
22,720,174 |
21,574,598 |
49 |
CEB |
Cebuano |
1,177,127 |
23,461,502 |
21,304,716 |
50 |
EPO |
Esperanto |
1,273,333 |
20,025,116 |
20,887,797 |
51 |
SWA |
Swahili |
975,456 |
17,407,097 |
19,155,309 |
52 |
ZUL |
Zulu |
964,117 |
13,671,887 |
18,412,040 |
53 |
MSA |
Malay |
1,945,672 |
16,671,082 |
18,053,859 |
54 |
TAM |
Tamil |
1,020,167 |
11,400,283 |
17,989,394 |
55 |
XHO |
Xhosa |
993,668 |
13,026,040 |
17,543,214 |
56 |
MAL |
Malayalam |
1,042,337 |
10,594,525 |
17,246,958 |
57 |
ILO |
Iloko |
898,926 |
17,814,296 |
17,022,059 |
58 |
SIN |
Sinhala |
1,141,931 |
11,643,363 |
16,436,631 |
59 |
MLG |
Malagasy |
826,222 |
17,981,469 |
16,408,357 |
60 |
HIL |
Hiligaynon |
807,375 |
17,432,476 |
15,442,425 |
61 |
SNA |
Shona |
763,546 |
11,391,280 |
15,229,553 |
62 |
NYA |
Nyanja |
778,089 |
12,958,088 |
14,921,565 |
63 |
TSN |
Tswana |
780,798 |
20,008,782 |
14,855,070 |
64 |
TSO |
Tsonga |
757,853 |
17,309,341 |
14,474,201 |
65 |
AMH |
Amharic |
669,145 |
9,765,847 |
14,326,330 |
66 |
ISL |
Icelandic |
1,112,770 |
10,987,587 |
13,281,910 |
67 |
AZE |
Azerbaijani |
693,153 |
10,806,686 |
12,639,115 |
68 |
KAT |
Georgian |
677,998 |
7,474,496 |
11,906,197 |
69 |
MAR |
Marathi |
625,462 |
8,315,779 |
11,844,853 |
70 |
MYA |
Burmese |
510,083 |
12,713,472 |
11,545,908 |
71 |
EWE |
Ewe |
588,735 |
12,826,778 |
11,227,502 |
72 |
SRN |
Sranan Tongo |
546,788 |
14,002,996 |
10,719,800 |
73 |
TAH |
Tahitian |
547,403 |
16,474,812 |
10,587,464 |
74 |
NSO |
Pedi |
555,777 |
13,718,112 |
10,567,895 |
75 |
LIN |
Lingala |
536,198 |
10,804,478 |
10,179,883 |
76 |
TWI |
Twi |
537,268 |
11,294,035 |
10,174,586 |
77 |
TEL |
Telugu |
557,715 |
6,533,017 |
9,246,609 |
78 |
KIN |
Kinyarwanda |
488,086 |
8,493,687 |
9,135,649 |
79 |
BIS |
Bislama |
476,064 |
11,972,512 |
9,034,190 |
80 |
BCL |
Central Bikol |
451,274 |
9,946,706 |
8,765,221 |
81 |
NEP |
Nepali |
444,058 |
5,427,354 |
8,183,887 |
82 |
LOZ |
Lozi |
411,874 |
9,495,543 |
7,872,336 |
83 |
GAA |
Ga |
409,659 |
9,284,738 |
7,868,224 |
84 |
IBO |
Igbo |
415,234 |
10,075,710 |
7,737,716 |
85 |
YOR |
Yoruba |
411,461 |
12,671,491 |
7,668,947 |
86 |
PAN |
Panjabi |
394,938 |
6,569,439 |
7,564,168 |
87 |
HYE |
Armenian |
382,378 |
5,779,847 |
7,504,431 |
88 |
KAN |
Kannada |
327,475 |
4,029,042 |
7,429,653 |
89 |
TAT |
Tatar |
378,375 |
6,070,274 |
7,401,723 |
90 |
PAP |
Papiamento |
381,796 |
8,155,628 |
7,213,489 |
91 |
BEM |
Bemba (Zambia) |
381,297 |
6,526,989 |
7,171,421 |
92 |
TPI |
Tok Pisin |
383,675 |
9,120,242 |
7,162,905 |
93 |
GUJ |
Gujarati |
420,729 |
4,899,065 |
6,961,346 |
94 |
SMO |
Samoan |
364,010 |
9,262,113 |
6,940,694 |
95 |
RUN |
Rundi |
364,103 |
6,521,655 |
6,836,527 |
96 |
FIJ |
Fijian |
357,673 |
7,837,220 |
6,726,164 |
97 |
EFI |
Efik |
332,589 |
7,312,421 |
6,298,566 |
98 |
TIR |
Tigrinya |
320,856 |
4,953,825 |
6,288,908 |
99 |
TON |
Tonga (Tonga Islands) |
323,838 |
11,087,182 |
6,085,262 |
100 |
LUE |
Luvale |
317,092 |
4,717,517 |
6,023,708 |
101 |
HAU |
Hausa |
295,829 |
6,459,154 |
5,881,574 |
102 |
LUA |
Luba-Lulua |
292,212 |
5,517,519 |
5,532,234 |
103 |
KIR |
Kirghiz |
283,308 |
3,984,657 |
5,499,207 |
104 |
TOI |
Tonga (Zambia) |
291,857 |
4,344,007 |
5,468,385 |
105 |
GUW |
Gun |
286,899 |
6,650,627 |
5,431,468 |
106 |
PAG |
Pangasinan |
282,341 |
5,602,136 |
5,351,754 |
107 |
WAR |
Waray (Philippines) |
281,941 |
6,216,918 |
5,338,684 |
108 |
PIS |
Pijin |
263,681 |
5,313,880 |
5,010,374 |
109 |
SWC |
Congo Swahili |
271,892 |
4,583,791 |
4,937,615 |
110 |
TGK |
Tajik |
286,675 |
4,184,663 |
4,873,294 |
111 |
SAG |
Sango |
250,019 |
6,554,737 |
4,779,729 |
112 |
SOM |
Somali |
161,865 |
3,646,152 |
4,573,785 |
113 |
MAH |
Marshallese |
233,516 |
5,757,879 |
4,448,457 |
114 |
OSS |
Ossetian |
225,664 |
3,811,618 |
4,404,414 |
115 |
TUM |
Tumbuka |
232,540 |
3,622,880 |
4,347,800 |
116 |
HMO |
Hiri Motu |
227,759 |
4,781,868 |
4,314,100 |
117 |
LUG |
Ganda |
224,749 |
3,731,957 |
4,261,136 |
118 |
BEL |
Belarusian |
290,072 |
3,469,541 |
4,206,848 |
119 |
PON |
Pohnpeian |
218,908 |
4,397,432 |
4,185,579 |
120 |
TLL |
Tetela |
222,225 |
4,233,529 |
4,185,359 |
121 |
LAT |
Latin |
203,175 |
2,912,205 |
4,180,355 |
122 |
KQN |
Kaonde |
219,170 |
3,727,122 |
4,097,908 |
123 |
YAP |
Yapese |
212,548 |
6,219,777 |
4,078,280 |
124 |
ISO |
Isoko |
215,449 |
4,917,558 |
4,047,680 |
125 |
CHK |
Chuukese |
207,347 |
4,363,629 |
4,031,326 |
126 |
NIU |
Niuean |
214,222 |
5,416,864 |
3,993,808 |
127 |
UMB |
Umbundu |
212,228 |
3,988,296 |
3,939,817 |
128 |
GIL |
Gilbertese |
203,252 |
4,762,950 |
3,888,542 |
129 |
KON |
Kongo |
206,234 |
4,417,086 |
3,883,443 |
130 |
VEN |
Venda |
204,407 |
5,057,268 |
3,782,533 |
131 |
LUB |
Luba-Katanga |
197,423 |
3,541,419 |
3,742,176 |
132 |
HAT |
Haitian |
197,201 |
4,438,519 |
3,630,101 |
133 |
KAL |
Kalaallisut |
191,660 |
2,206,906 |
3,610,449 |
134 |
ZNE |
Zande (Individual) |
190,082 |
4,401,103 |
3,602,098 |
135 |
OCI |
Occitan (Post 1500) |
182,542 |
3,606,108 |
3,536,378 |
136 |
LUS |
Lushai |
187,503 |
4,290,199 |
3,534,861 |
137 |
CRS |
Seselwa Creole French |
188,361 |
3,847,882 |
3,528,139 |
138 |
MOS |
Mossi |
186,434 |
4,706,515 |
3,517,104 |
139 |
TIV |
Tiv |
184,113 |
4,810,668 |
3,469,806 |
140 |
NDS |
Low German |
185,909 |
2,927,867 |
3,456,271 |
141 |
MFE |
Morisyen |
181,560 |
4,068,226 |
3,367,397 |
142 |
FRY |
Western Frisian |
174,498 |
2,726,426 |
3,360,333 |
143 |
MON |
Mongolian |
169,290 |
2,406,026 |
3,330,866 |
144 |
TVL |
Tuvalu |
172,371 |
4,970,984 |
3,302,688 |
145 |
YUA |
Yucateco |
168,299 |
3,524,830 |
3,301,268 |
146 |
KWY |
San Salvador Kongo |
169,875 |
2,964,934 |
3,119,715 |
147 |
WLS |
Wallisian |
154,488 |
3,981,535 |
2,871,012 |
148 |
ORM |
Oromo |
155,084 |
2,646,041 |
2,858,967 |
149 |
GUG |
Paraguayan Guaraní |
143,391 |
2,151,216 |
2,742,729 |
150 |
ZAI |
Isthmus Zapotec |
146,783 |
2,741,722 |
2,741,357 |
151 |
KUR |
Kurdish |
111,126 |
3,090,640 |
2,729,572 |
152 |
AYM |
Aymara |
138,760 |
1,939,507 |
2,710,890 |
153 |
KHM |
Khmer |
150,117 |
2,966,056 |
2,683,112 |
154 |
TZO |
Tzotzil |
140,124 |
2,993,791 |
2,679,819 |
155 |
BCI |
Baoulé |
142,169 |
3,685,710 |
2,597,887 |
156 |
SND |
Sindhi |
86,214 |
2,580,903 |
2,566,710 |
157 |
QUE |
Quechua |
134,194 |
1,739,523 |
2,543,319 |
158 |
LUO |
Luo (Kenya And Tanzania) |
136,625 |
2,603,306 |
2,502,410 |
159 |
LUN |
Lunda |
134,578 |
1,857,219 |
2,482,581 |
160 |
QUZ |
Cusco Quechua |
127,408 |
1,649,454 |
2,450,616 |
161 |
RND |
Ruund |
133,631 |
2,443,228 |
2,446,692 |
162 |
UZB |
Uzbek |
137,566 |
2,233,702 |
2,360,392 |
163 |
DIV |
Dhivehi |
85,159 |
2,441,892 |
2,350,351 |
164 |
WAL |
Wolaytta |
120,608 |
1,844,136 |
2,325,431 |
165 |
UIG |
Uighur |
84,928 |
2,088,135 |
2,239,573 |
166 |
SSW |
Swati |
116,170 |
1,690,313 |
2,238,351 |
167 |
TUK |
Turkmen |
121,578 |
1,730,753 |
2,231,716 |
168 |
QUY |
Ayacucho Quechua |
113,702 |
1,402,781 |
2,164,966 |
169 |
NYK |
Nyaneka |
116,364 |
1,753,649 |
2,133,198 |
170 |
TDT |
Tetun Dili |
112,041 |
2,385,376 |
2,107,083 |
171 |
BZS |
Brazilian Sign Language |
110,679 |
2,044,945 |
2,065,232 |
172 |
KWN |
Kwangali |
106,595 |
1,709,040 |
1,939,056 |
173 |
KAZ |
Kazakh |
248,822 |
1,652,871 |
1,906,753 |
174 |
KEK |
Kekchí |
63,350 |
2,231,030 |
1,836,968 |
175 |
KUA |
Kuanyama |
99,227 |
1,932,349 |
1,830,081 |
176 |
NDO |
Ndonga |
99,817 |
1,864,378 |
1,810,583 |
177 |
MRI |
Maori |
62,963 |
2,151,493 |
1,807,107 |
178 |
PCK |
Paite Chin |
61,173 |
1,773,303 |
1,799,998 |
179 |
PES |
Iranian Persian |
64,142 |
1,508,974 |
1,791,906 |
180 |
PLT |
Plateau Malagasy |
60,810 |
1,843,559 |
1,789,347 |
181 |
DJE |
Zarma |
60,515 |
1,924,044 |
1,780,607 |
182 |
LTZ |
Luxembourgish |
92,860 |
1,400,718 |
1,744,759 |
183 |
KIK |
Kikuyu |
94,242 |
1,714,356 |
1,737,121 |
184 |
NZI |
Nzima |
92,884 |
1,798,852 |
1,685,594 |
185 |
TOP |
Papantla Totonac |
86,769 |
1,345,448 |
1,620,933 |
186 |
KMB |
Kimbundu |
90,341 |
1,960,969 |
1,617,965 |
187 |
BAK |
Bashkir |
88,618 |
1,216,200 |
1,580,575 |
188 |
ARG |
Aragonese |
82,038 |
1,641,632 |
1,535,954 |
189 |
TSC |
Tswa |
84,311 |
1,911,340 |
1,534,506 |
190 |
FAO |
Faroese |
75,612 |
1,194,153 |
1,534,477 |
191 |
JSL |
Japanese Sign Language |
83,773 |
2,223,227 |
1,528,544 |
192 |
ISE |
Italian Sign Language |
79,874 |
1,497,912 |
1,527,368 |
193 |
GYM |
Ngäbere |
78,796 |
1,624,979 |
1,459,454 |
194 |
JAV |
Javanese |
73,185 |
1,177,647 |
1,442,806 |
195 |
ASM |
Assamese |
94,568 |
1,027,775 |
1,390,206 |
196 |
ZLM |
Malay (Individual) |
72,676 |
1,139,427 |
1,372,948 |
197 |
VMW |
Makhuwa |
72,847 |
1,181,569 |
1,328,856 |
198 |
ACH |
Acoli |
73,172 |
1,496,706 |
1,325,711 |
199 |
CHV |
Chuvash |
68,211 |
1,032,279 |
1,302,667 |
200 |
BRE |
Breton |
129,742 |
1,301,922 |
1,286,455 |
201 |
MCO |
Coatlán Mixe |
66,222 |
1,102,029 |
1,263,381 |
202 |
MFS |
Mexican Sign Language |
63,494 |
1,247,858 |
1,253,416 |
203 |
TOG |
Tonga (Nyasa) |
67,113 |
1,053,514 |
1,231,614 |
204 |
MAM |
Mam |
57,254 |
1,424,153 |
1,171,359 |
205 |
RAR |
Rarotongan |
66,762 |
1,613,321 |
1,170,921 |
206 |
ADA |
Adangme |
63,021 |
1,669,121 |
1,131,992 |
207 |
NNO |
Norwegian Nynorsk |
139,111 |
1,113,937 |
1,116,261 |
208 |
CAB |
Garifuna |
59,416 |
1,016,417 |
1,095,937 |
209 |
NCJ |
Northern Puebla Nahuatl |
59,251 |
963,782 |
1,092,104 |
210 |
ARZ |
Egyptian Arabic |
54,590 |
927,189 |
1,089,748 |
211 |
DHV |
Dehu |
58,875 |
1,504,920 |
1,078,197 |
212 |
WUU |
Wu Chinese |
46,633 |
1,437,243 |
1,075,055 |
213 |
DJK |
Eastern Maroon Creole |
52,628 |
1,458,098 |
1,038,818 |
214 |
GUC |
Wayuu |
53,537 |
826,821 |
985,121 |
215 |
CAK |
Kaqchikel |
46,427 |
1,293,345 |
962,624 |
216 |
SEH |
Sena |
52,334 |
856,758 |
945,244 |
217 |
CYM |
Welsh |
99,826 |
1,055,852 |
937,929 |
218 |
KAM |
Kamba (Kenya) |
51,054 |
959,249 |
932,972 |
219 |
SOP |
Songe |
51,070 |
938,990 |
927,422 |
220 |
QVI |
Imbabura Highland Quichua |
50,527 |
672,711 |
921,493 |
221 |
NYN |
Nyankole |
50,379 |
806,471 |
912,254 |
222 |
BAR |
Bavarian |
58,409 |
796,255 |
908,588 |
223 |
RSL |
Russian Sign Language |
44,582 |
702,400 |
856,169 |
224 |
SID |
Sidamo |
46,851 |
686,888 |
847,841 |
225 |
ORI |
Oriya |
49,192 |
698,991 |
832,456 |
226 |
IDO |
Ido |
46,163 |
763,729 |
831,713 |
227 |
LMO |
Lombard |
39,461 |
864,899 |
827,408 |
228 |
YAO |
Yao |
43,689 |
675,341 |
791,700 |
229 |
MGR |
Mambwe-Lungu |
43,911 |
731,912 |
785,222 |
230 |
KRI |
Krio |
42,349 |
999,510 |
752,221 |
231 |
MWL |
Mirandese |
31,518 |
742,141 |
746,747 |
232 |
HMN |
Hmong |
41,806 |
906,741 |
729,196 |
233 |
NGL |
Lomwe |
39,339 |
596,790 |
693,966 |
234 |
KSS |
Southern Kisi |
37,693 |
773,744 |
655,897 |
235 |
NCX |
Central Puebla Nahuatl |
36,422 |
523,049 |
654,865 |
236 |
KOO |
Konzo |
36,378 |
575,840 |
642,178 |
237 |
CJK |
Chokwe |
35,767 |
601,420 |
627,383 |
238 |
TCF |
Malinaltepec Me’Phaa |
34,679 |
847,824 |
627,050 |
239 |
BBC |
Batak Toba |
35,181 |
574,823 |
619,966 |
240 |
TOJ |
Tojolabal |
33,805 |
667,730 |
606,099 |
241 |
NIA |
Nias |
34,280 |
582,906 |
604,629 |
242 |
SRM |
Saramaccan |
34,739 |
848,933 |
597,203 |
243 |
IBA |
Iban |
34,577 |
614,600 |
591,325 |
244 |
NCH |
Central Huasteca Nahuatl |
31,018 |
475,377 |
561,258 |
245 |
FON |
Fon |
31,273 |
865,700 |
552,748 |
246 |
KAB |
Kabyle |
38,018 |
758,336 |
550,414 |
247 |
KSW |
S’Gaw Karen |
26,363 |
1,264,404 |
545,235 |
248 |
IBG |
Ibanag |
30,270 |
568,962 |
537,232 |
249 |
NGU |
Guerrero Nahuatl |
29,768 |
462,975 |
535,371 |
250 |
URH |
Urhobo |
29,347 |
593,673 |
530,354 |
251 |
NDC |
Ndau |
30,369 |
488,763 |
527,840 |
252 |
KBP |
Kabiyè |
29,066 |
618,571 |
521,923 |
253 |
WES |
Cameroon Pidgin |
28,159 |
642,670 |
499,592 |
254 |
MAU |
Huautla Mazatec |
27,544 |
496,580 |
499,460 |
255 |
BAS |
Basa (Cameroon) |
27,771 |
616,191 |
496,676 |
256 |
BUM |
Bulu (Cameroon) |
27,996 |
624,348 |
494,140 |
257 |
CTU |
Chol |
26,462 |
547,357 |
478,157 |
258 |
CNH |
Hakha Chin |
27,733 |
554,298 |
477,786 |
259 |
BTX |
Batak Karo |
27,295 |
436,554 |
470,088 |
260 |
NBA |
Nyemba |
27,317 |
553,352 |
469,851 |
261 |
LAO |
Lao |
22,217 |
697,242 |
462,713 |
262 |
NYU |
Nyungwe |
24,491 |
419,713 |
427,580 |
263 |
ABK |
Abkhazian |
23,161 |
292,717 |
423,050 |
264 |
PUS |
Pushto |
28,260 |
483,360 |
421,907 |
265 |
CHR |
Cherokee |
15,746 |
287,938 |
416,623 |
266 |
COP |
Coptic |
15,706 |
256,370 |
416,296 |
267 |
DOP |
Lukpa |
15,711 |
558,508 |
416,290 |
268 |
SYR |
Syriac |
15,747 |
217,478 |
415,892 |
269 |
QUW |
Tena Lowland Quichua |
15,674 |
292,905 |
415,461 |
270 |
USP |
Uspanteco |
15,583 |
500,341 |
412,979 |
271 |
QUC |
K’Iche' |
15,575 |
616,986 |
412,280 |
272 |
ROM |
Romany |
16,048 |
422,362 |
411,993 |
273 |
AMU |
Guerrero Amuzgo |
15,533 |
566,676 |
411,225 |
274 |
JAK |
Jakun |
15,513 |
564,458 |
411,137 |
275 |
NHG |
Tetelcingo Nahuatl |
15,459 |
408,868 |
409,271 |
276 |
TZH |
Tzeltal |
22,481 |
529,093 |
408,029 |
277 |
SHI |
Tachelhit |
15,288 |
637,003 |
404,390 |
278 |
CNI |
Asháninka |
15,264 |
331,249 |
404,004 |
279 |
WOL |
Wolof |
15,230 |
402,534 |
403,295 |
280 |
OKE |
Okpe (Southwestern Edo) |
22,471 |
458,573 |
401,423 |
281 |
CJP |
Cabécar |
15,155 |
608,344 |
400,614 |
282 |
FSE |
Finnish Sign Language |
21,671 |
298,224 |
400,110 |
283 |
GBI |
Galela |
15,023 |
624,914 |
398,145 |
284 |
SSP |
Spanish Sign Language |
21,242 |
387,212 |
395,815 |
285 |
PCM |
Nigerian Pidgin |
22,001 |
465,750 |
394,764 |
286 |
PPK |
Uma |
14,576 |
660,598 |
384,278 |
287 |
BHW |
Biak |
22,261 |
366,474 |
381,127 |
288 |
PSO |
Polish Sign Language |
20,433 |
312,201 |
379,785 |
289 |
CMN |
Mandarin Chinese |
44,110 |
487,138 |
374,935 |
290 |
CHQ |
Quiotepec Chinantec |
14,251 |
912,859 |
366,313 |
291 |
DIK |
Southwestern Dinka |
13,319 |
383,777 |
353,980 |
292 |
OJB |
Northwestern Ojibwa |
13,318 |
290,036 |
353,940 |
293 |
CHA |
Chamorro |
14,539 |
316,758 |
350,061 |
294 |
QUG |
Chimborazo Highland Quichua |
20,272 |
247,947 |
349,776 |
295 |
CSL |
Chinese Sign Language |
17,874 |
494,654 |
348,101 |
296 |
JIV |
Shuar |
12,910 |
272,452 |
342,885 |
297 |
AGR |
Aguaruna |
12,778 |
295,678 |
338,609 |
298 |
ACU |
Achuar-Shiwiar |
12,347 |
349,234 |
328,050 |
299 |
AKE |
Akawaio |
12,346 |
493,819 |
326,593 |
300 |
CCE |
Chopi |
17,935 |
346,295 |
303,153 |
301 |
CHW |
Chuwabu |
17,988 |
252,729 |
299,706 |
302 |
GSG |
German Sign Language |
16,400 |
268,254 |
298,220 |
303 |
ARN |
Mapudungun |
16,737 |
275,559 |
296,456 |
304 |
BSN |
Barasana-Eduria |
11,180 |
681,542 |
291,888 |
305 |
TTJ |
Tooro |
16,442 |
252,981 |
280,979 |
306 |
SUN |
Sundanese |
15,850 |
250,374 |
272,390 |
307 |
KBH |
Camsá |
10,287 |
384,331 |
272,175 |
308 |
LAM |
Lamba |
14,846 |
240,797 |
271,580 |
309 |
DUA |
Duala |
15,351 |
444,826 |
269,185 |
310 |
HNE |
Chhattisgarhi |
52,059 |
322,484 |
263,135 |
311 |
XMF |
Mingrelian |
12,946 |
169,159 |
262,169 |
312 |
KMR |
Northern Kurdish |
14,798 |
273,245 |
260,963 |
313 |
DYU |
Dyula |
14,886 |
322,319 |
258,596 |
314 |
HSH |
Hungarian Sign Language |
13,877 |
214,890 |
256,618 |
315 |
AED |
Argentine Sign Language |
12,390 |
248,120 |
251,486 |
316 |
NAV |
Navajo |
14,626 |
229,364 |
248,195 |
317 |
TYV |
Tuvinian |
12,979 |
197,520 |
245,669 |
318 |
RMN |
Balkan Romani |
14,527 |
257,259 |
241,449 |
319 |
FCS |
Quebec Sign Language |
13,034 |
253,315 |
239,624 |
320 |
TSS |
Taiwan Sign Language |
12,298 |
357,723 |
239,608 |
321 |
BTS |
Batak Simalungun |
14,255 |
233,806 |
238,303 |
322 |
GLV |
Manx |
11,006 |
261,914 |
232,828 |
323 |
NIJ |
Ngaju |
13,158 |
213,322 |
222,299 |
324 |
CSE |
Czech Sign Language |
11,655 |
179,085 |
210,357 |
325 |
WLN |
Walloon |
41,887 |
301,028 |
210,120 |
326 |
BIN |
Bini |
11,635 |
261,717 |
208,326 |
327 |
SXN |
Sangir |
11,668 |
228,088 |
195,463 |
328 |
KVK |
Korean Sign Language |
9,330 |
321,614 |
194,428 |
329 |
RMS |
Romanian Sign Language |
10,434 |
198,109 |
193,459 |
330 |
KAC |
Kachin |
10,930 |
270,098 |
184,701 |
331 |
SVK |
Slovakian Sign Language |
10,141 |
159,092 |
182,557 |
332 |
AMI |
Amis |
9,156 |
186,011 |
175,520 |
333 |
UDM |
Udmurt |
9,394 |
147,593 |
173,612 |
334 |
MNI |
Manipuri |
7,281 |
127,875 |
162,599 |
335 |
TMH |
Tamashek |
5,363 |
168,471 |
152,620 |
336 |
HER |
Herero |
8,179 |
151,589 |
141,558 |
337 |
GSS |
Greek Sign Language |
7,090 |
137,633 |
140,185 |
338 |
ALZ |
Alur |
7,567 |
154,509 |
133,119 |
339 |
BZJ |
Belize Kriol English |
6,905 |
136,165 |
119,827 |
340 |
IKU |
Inuktitut |
5,244 |
66,528 |
113,112 |
341 |
POT |
Potawatomi |
4,113 |
108,987 |
110,373 |
342 |
MXV |
Metlatónoc Mixtec |
5,924 |
176,367 |
105,471 |
343 |
PDT |
Plautdietsch |
6,019 |
115,997 |
104,913 |
344 |
SME |
Northern Sami |
18,433 |
96,152 |
100,433 |
345 |
INA |
Interlingua (International Auxiliary Language Association) |
12,194 |
101,705 |
99,401 |
346 |
ISH |
Esan |
5,221 |
112,815 |
91,400 |
347 |
KEA |
Kabuverdianu |
5,217 |
102,331 |
90,402 |
348 |
TSZ |
Purepecha |
4,939 |
79,259 |
89,766 |
349 |
GLA |
Scottish Gaelic |
8,444 |
108,564 |
87,714 |
350 |
TLH |
Klingon |
12,602 |
76,817 |
87,087 |
351 |
JBO |
Lojban |
11,470 |
88,384 |
83,024 |
352 |
CSN |
Colombian Sign Language |
3,428 |
71,580 |
70,847 |
353 |
ALT |
Southern Altai |
3,764 |
52,407 |
70,386 |
354 |
PSR |
Portuguese Sign Language |
3,655 |
70,060 |
69,174 |
355 |
TOH |
Gitonga |
3,898 |
78,079 |
65,345 |
356 |
YUE |
Yue Chinese |
5,681 |
77,996 |
62,414 |
357 |
FSL |
French Sign Language |
2,862 |
58,931 |
56,227 |
358 |
AST |
Asturian |
9,261 |
59,964 |
54,602 |
359 |
FIL |
Filipino |
2,125 |
51,507 |
47,398 |
360 |
SRD |
Sardinian |
5,836 |
50,985 |
43,786 |
361 |
SCO |
Scots |
853 |
40,927 |
41,327 |
362 |
ECS |
Ecuadorian Sign Language |
2,019 |
38,677 |
39,731 |
363 |
FUR |
Friulian |
5,791 |
42,727 |
37,760 |
364 |
YID |
Yiddish |
4,039 |
32,533 |
33,128 |
365 |
MEN |
Mende (Sierra Leone) |
1,668 |
36,531 |
28,675 |
366 |
GOM |
Goan Konkani |
722 |
24,148 |
26,834 |
367 |
LIM |
Limburgan |
4,491 |
25,040 |
24,410 |
368 |
LFN |
Lingua Franca Nova |
3,443 |
26,131 |
24,227 |
369 |
COR |
Cornish |
4,105 |
24,498 |
23,815 |
370 |
MAI |
Maithili |
4,256 |
23,102 |
20,698 |
371 |
VSL |
Venezuelan Sign Language |
958 |
17,798 |
18,472 |
372 |
CBK |
Chavacano |
2,423 |
17,255 |
16,912 |
373 |
ILE |
Interlingue |
2,570 |
16,439 |
16,599 |
374 |
VOL |
Volapük |
2,338 |
12,510 |
15,481 |
375 |
DTP |
Kadazan Dusun |
1,831 |
12,814 |
12,803 |
376 |
MIN |
Minangkabau |
330 |
11,492 |
11,998 |
377 |
TET |
Tetum |
406 |
11,359 |
10,648 |
378 |
PAM |
Pampanga |
1,448 |
8,955 |
10,276 |
379 |
ZSM |
Standard Malay |
1,158 |
9,170 |
10,106 |
380 |
PRL |
Peruvian Sign Language |
529 |
8,562 |
9,567 |
381 |
ZIB |
Zimbabwe Sign Language |
356 |
9,497 |
9,249 |
382 |
CRH |
Crimean Tatar |
1,361 |
8,351 |
8,996 |
383 |
KHA |
Khasi |
1,277 |
9,156 |
8,623 |
384 |
ASE |
American Sign Language |
538 |
8,038 |
8,382 |
385 |
BFI |
British Sign Language |
384 |
7,005 |
8,302 |
386 |
ARQ |
Algerian Arabic |
919 |
5,587 |
7,531 |
387 |
BOD |
Tibetan |
1,029 |
17,081 |
6,956 |
388 |
ZPA |
Lachiguiri Zapotec |
359 |
6,326 |
6,549 |
389 |
LZH |
Literary Chinese |
531 |
5,024 |
6,379 |
390 |
GOS |
Gronings |
986 |
4,727 |
4,963 |
391 |
GRC |
Ancient Greek (To 1453) |
568 |
3,778 |
4,845 |
392 |
NST |
Tase Naga |
769 |
5,773 |
4,794 |
393 |
CSG |
Chilean Sign Language |
329 |
4,320 |
4,734 |
394 |
GOR |
Gorontalo |
111 |
4,034 |
4,651 |
395 |
MZY |
Mozambican Sign Language |
244 |
4,364 |
4,447 |
396 |
CKB |
Central Kurdish |
1,040 |
4,332 |
4,400 |
397 |
ANG |
Old English (Ca. 450-1100) |
993 |
4,273 |
4,088 |
398 |
CSB |
Kashubian |
892 |
4,280 |
4,070 |
399 |
OTA |
Ottoman Turkish (1500-1928) |
622 |
3,564 |
4,032 |
400 |
KAS |
Kashmiri |
701 |
3,855 |
3,743 |
401 |
SAT |
Santali |
101 |
3,263 |
3,604 |
402 |
HOC |
Ho |
631 |
3,038 |
3,520 |
403 |
ZZA |
Zaza |
505 |
2,895 |
3,194 |
404 |
COS |
Corsican |
75 |
2,984 |
2,880 |
405 |
DZO |
Dzongkha |
449 |
8,250 |
2,871 |
406 |
INL |
Indonesian Sign Language |
206 |
2,459 |
2,814 |
407 |
DIQ |
Dimli (Individual) |
74 |
1,924 |
2,669 |
408 |
GRN |
Guarani |
229 |
1,883 |
2,577 |
409 |
SWH |
Swahili (Individual) |
369 |
1,826 |
2,522 |
410 |
WAE |
Walser |
512 |
2,416 |
2,483 |
411 |
LAD |
Ladino |
371 |
2,196 |
2,412 |
412 |
ACE |
Achinese |
446 |
2,783 |
2,409 |
413 |
ASF |
Auslan |
149 |
2,413 |
2,322 |
414 |
AKA |
Akan |
61 |
2,214 |
2,225 |
415 |
JAM |
Jamaican Creole English |
61 |
2,060 |
2,166 |
416 |
ORV |
Old Russian |
313 |
1,749 |
2,144 |
417 |
PMS |
Piemontese |
263 |
2,626 |
2,133 |
418 |
GSW |
Swiss German |
220 |
1,914 |
2,052 |
419 |
XAL |
Kalmyk |
268 |
1,624 |
2,041 |
420 |
CSF |
Cuba Sign Language |
117 |
1,816 |
1,946 |
421 |
ZSL |
Zambian Sign Language |
96 |
2,299 |
1,915 |
422 |
INS |
Indian Sign Language |
208 |
1,754 |
1,905 |
423 |
NAN |
Min Nan Chinese |
88 |
2,352 |
1,894 |
424 |
MAX |
North Moluccan Malay |
268 |
1,981 |
1,857 |
425 |
PRG |
Prussian |
213 |
1,526 |
1,717 |
426 |
GOT |
Gothic |
207 |
4,340 |
1,698 |
427 |
BXR |
Russia Buriat |
44 |
1,221 |
1,658 |
428 |
TCY |
Tulu |
47 |
1,265 |
1,541 |
429 |
SAH |
Yakut |
65 |
1,042 |
1,537 |
430 |
FRP |
Arpitan |
473 |
1,820 |
1,485 |
431 |
KAU |
Kanuri |
285 |
2,936 |
1,453 |
432 |
BVL |
Bolivian Sign Language |
98 |
1,188 |
1,391 |
433 |
NOV |
Novial |
187 |
1,268 |
1,296 |
434 |
HRX |
Hunsrik |
214 |
1,300 |
1,247 |
435 |
AWA |
Awadhi |
248 |
1,249 |
1,246 |
436 |
AVK |
Kotava |
157 |
933 |
1,221 |
437 |
PIH |
Pitcairn-Norfolk |
39 |
1,425 |
1,195 |
438 |
PYS |
Paraguayan Sign Language |
90 |
971 |
1,150 |
439 |
NEW |
Newari |
38 |
993 |
1,101 |
440 |
HIF |
Fiji Hindi |
45 |
777 |
1,044 |
441 |
MZN |
Mazanderani |
47 |
770 |
1,002 |
442 |
BHO |
Bhojpuri |
55 |
858 |
908 |
443 |
SAN |
Sanskrit |
150 |
725 |
887 |
444 |
HAW |
Hawaiian |
94 |
824 |
874 |
445 |
DTY |
Dotyali |
24 |
874 |
804 |
446 |
PDC |
Pennsylvania German |
65 |
706 |
734 |
447 |
RUE |
Rusyn |
113 |
479 |
683 |
448 |
KRL |
Karelian |
135 |
655 |
682 |
449 |
DSB |
Lower Sorbian |
37 |
471 |
677 |
450 |
SHN |
Shan |
150 |
1,926 |
652 |
451 |
EXT |
Extremaduran |
65 |
588 |
628 |
452 |
CHO |
Choctaw |
107 |
595 |
624 |
453 |
FKV |
Kven Finnish |
56 |
528 |
622 |
454 |
QYA |
Quenya |
100 |
423 |
581 |
455 |
GLK |
Gilaki |
12 |
268 |
521 |
456 |
TPW |
Tupí |
87 |
520 |
518 |
457 |
TZL |
Talossan |
108 |
480 |
483 |
458 |
MHR |
Eastern Mari |
69 |
379 |
482 |
459 |
RMY |
Vlax Romani |
10 |
522 |
467 |
460 |
NOG |
Nogai |
80 |
326 |
464 |
461 |
NPI |
Nepali (Individual) |
98 |
380 |
457 |
462 |
EGL |
Emilian |
81 |
499 |
456 |
463 |
GCF |
Guadeloupean Creole French |
78 |
464 |
446 |
464 |
LDN |
Láadan |
76 |
464 |
430 |
465 |
SFS |
South African Sign Language |
30 |
380 |
428 |
466 |
MWW |
Hmong Daw |
74 |
497 |
408 |
467 |
LIJ |
Ligurian |
52 |
419 |
385 |
468 |
AFB |
Gulf Arabic |
69 |
303 |
370 |
469 |
KSH |
Kölsch |
23 |
385 |
358 |
470 |
SGS |
Samogitian |
43 |
199 |
356 |
471 |
MGM |
Mambae |
33 |
282 |
307 |
472 |
PNT |
Pontic |
5 |
300 |
301 |
473 |
MYV |
Erzya |
33 |
239 |
300 |
474 |
NAP |
Neapolitan |
29 |
299 |
249 |
475 |
IKE |
Eastern Canadian Inuktitut |
43 |
138 |
245 |
476 |
HSB |
Upper Sorbian |
36 |
194 |
235 |
477 |
LLD |
Ladin |
20 |
225 |
227 |
478 |
FRM |
Middle French (Ca. 1400-1600) |
17 |
207 |
220 |
479 |
ARY |
Moroccan Arabic |
41 |
148 |
216 |
480 |
ROH |
Romansh |
16 |
205 |
216 |
481 |
SMA |
Southern Sami |
44 |
178 |
216 |
482 |
PPL |
Pipil |
29 |
163 |
208 |
483 |
SHS |
Shuswap |
39 |
243 |
207 |
484 |
TLY |
Talysh |
43 |
148 |
207 |
485 |
PNB |
Western Panjabi |
31 |
204 |
194 |
486 |
PMY |
Papuan Malay |
41 |
189 |
186 |
487 |
SJN |
Sindarin |
31 |
157 |
186 |
488 |
SUX |
Sumerian |
36 |
203 |
186 |
489 |
SZL |
Silesian |
34 |
153 |
177 |
490 |
LIV |
Liv |
29 |
150 |
176 |
491 |
RIF |
Tarifit |
34 |
145 |
170 |
492 |
BVY |
Baybayanon |
23 |
157 |
163 |
493 |
MIQ |
Mískito |
66 |
165 |
158 |
494 |
FUV |
Nigerian Fulfulde |
29 |
133 |
157 |
495 |
AIN |
Ainu (Japan) |
26 |
117 |
154 |
496 |
NLV |
Orizaba Nahuatl |
14 |
137 |
154 |
497 |
GBM |
Garhwali |
34 |
162 |
144 |
498 |
HDS |
Honduras Sign Language |
12 |
94 |
139 |
499 |
NON |
Old Norse |
13 |
134 |
136 |
500 |
ALN |
Gheg Albanian |
25 |
133 |
134 |
Acknowledgements
All the data consolidated in this work are retrieved from various sources and we do not own the dataset. If you use this dataset, please cite all the articles in citations.bib
file.
We are enabling this derived dataset to be easily accessible, with the intention to accelerate the research of language technologies to low resource languages. However, if you view this derived dataset as a violation of intellectual property rights, please let us know, so we will be happy to remove it from the corpus.