64

Home

Demo v1

Demo v2

Data v1

Data v2

Many-to-English: Data (v1)

Datasets are available at

train.raw.tsv.gz  # Training data in raw form, before cleaning, deduping and tokenization
train.v1.eng.tok.gz # English  training data, after cleaning and tokenization
train.v1.src.tok.gz # Source training data, after cleaning and tokenization
train.v1.lang.gz   # lang ID of source side sentences
train.v1.prov.gz   # provenance of record (to see where where this record)
train.v1.tok.stats.tsv # stats such as sentence and token count per language
devs-combo-shuf10k-raw+tok.tgz # 10K sentences for validation, randomly sampled from all dev sets
devtests-raw+tok.tgz  # all the dev and test data; both raw and tokenized
citations.bib  # BibTeX of articles which published the datasets collected in this work
prep.tgz  # scripts to prepare datasets from square 1.

train.v1.{eng.tok,src.tok,lang,prov} are plain text files after running gunzip. They should have same number of lines. Line number is the way to cross-reference between them.

You may also prepare these datasets from scratch or revise cleaning mechanisms starting from train.raw.tsv.gz. The prep.tgz file has datatprep.ipynb notebook that contains steps to download, tokenize, deduplicate and filter our bad records.

lang stats v1
Figure 1. V1 Statistics (Get PDF)
Table 1. Many-English V1 training data statistics (Get TSV)
# ISO 639-3 Name Sentences SourceTokens EnglishTokens

Total

473,791,285

9,001,777,125

9,072,884,192

1

FRA

French

33,010,111

1,001,694,594

862,394,235

2

RUS

Russian

25,564,367

640,353,439

682,207,974

3

ARA

Arabic

22,679,389

588,552,279

673,442,592

4

ZHO

Chinese

20,057,876

827,184,155

563,720,365

5

TUR

Turkish

37,720,743

282,136,509

386,742,077

6

SRP

Serbian

33,880,099

279,793,956

343,470,174

7

HEB

Hebrew

25,841,585

224,902,346

283,316,533

8

NLD

Dutch

12,428,300

278,665,986

281,604,016

9

POR

Portuguese

10,954,498

279,392,943

264,796,337

10

DEU

German

12,119,459

245,035,715

254,126,788

11

ITA

Italian

10,095,386

244,588,092

245,359,909

12

SPA

Spanish

9,776,966

229,401,449

212,893,081

13

SWE

Swedish

8,024,230

156,259,227

175,706,578

14

DAN

Danish

7,682,343

163,476,497

173,310,159

15

FIN

Finnish

8,187,935

131,239,231

172,686,863

16

POL

Polish

8,289,276

151,588,493

169,532,813

17

ELL

Modern Greek (1453-)

6,835,717

155,615,617

154,230,888

18

NOR

Norwegian

10,860,768

126,923,739

142,425,026

19

HUN

Hungarian

6,785,904

124,670,924

140,421,813

20

SLV

Slovenian

6,227,413

123,184,373

135,429,495

21

BOS

Bosnian

12,903,765

108,980,163

134,352,300

22

SLK

Slovak

5,689,766

111,471,118

125,492,209

23

EST

Estonian

5,788,575

94,125,966

120,585,810

24

LIT

Lithuanian

5,139,565

95,332,259

113,401,294

25

LAV

Latvian

4,460,210

88,412,493

104,312,482

26

FAS

Persian

8,054,223

96,700,032

103,579,635

27

JPN

Japanese

5,379,355

111,898,280

95,783,174

28

VIE

Vietnamese

6,186,692

112,645,134

91,410,645

29

UKR

Ukrainian

4,446,827

66,084,956

75,462,511

30

CES

Czech

3,986,495

64,539,275

74,129,349

31

MLT

Maltese

3,079,369

85,936,755

71,180,461

32

KOR

Korean

3,907,008

124,773,196

67,101,857

33

IND

Indonesian

3,441,203

63,578,787

64,369,133

34

CAT

Catalan

3,151,150

66,838,920

60,317,575

35

RON

Romanian

2,871,321

52,945,536

51,780,651

36

BUL

Bulgarian

2,755,198

47,511,112

50,192,767

37

THA

Thai

4,003,627

55,716,180

49,463,563

38

GLE

Irish

1,770,628

48,936,403

45,855,008

39

HRV

Croatian

2,304,149

34,755,332

39,315,927

40

HIN

Hindi

2,211,381

41,452,435

38,163,209

41

MKD

Macedonian

1,898,346

29,456,395

31,284,559

42

EUS

Basque

2,101,130

24,427,549

30,827,665

43

SQI

Albanian

1,659,043

28,953,259

28,389,962

44

URD

Urdu

1,121,988

28,181,194

26,194,588

45

TGL

Tagalog

1,307,417

28,950,146

26,087,793

46

BEN

Bengali

1,469,860

22,481,515

23,509,544

47

GLG

Galician

1,270,160

23,281,744

22,762,444

48

AFR

Afrikaans

1,164,819

22,720,174

21,574,598

49

CEB

Cebuano

1,177,127

23,461,502

21,304,716

50

EPO

Esperanto

1,273,333

20,025,116

20,887,797

51

SWA

Swahili

975,456

17,407,097

19,155,309

52

ZUL

Zulu

964,117

13,671,887

18,412,040

53

MSA

Malay

1,945,672

16,671,082

18,053,859

54

TAM

Tamil

1,020,167

11,400,283

17,989,394

55

XHO

Xhosa

993,668

13,026,040

17,543,214

56

MAL

Malayalam

1,042,337

10,594,525

17,246,958

57

ILO

Iloko

898,926

17,814,296

17,022,059

58

SIN

Sinhala

1,141,931

11,643,363

16,436,631

59

MLG

Malagasy

826,222

17,981,469

16,408,357

60

HIL

Hiligaynon

807,375

17,432,476

15,442,425

61

SNA

Shona

763,546

11,391,280

15,229,553

62

NYA

Nyanja

778,089

12,958,088

14,921,565

63

TSN

Tswana

780,798

20,008,782

14,855,070

64

TSO

Tsonga

757,853

17,309,341

14,474,201

65

AMH

Amharic

669,145

9,765,847

14,326,330

66

ISL

Icelandic

1,112,770

10,987,587

13,281,910

67

AZE

Azerbaijani

693,153

10,806,686

12,639,115

68

KAT

Georgian

677,998

7,474,496

11,906,197

69

MAR

Marathi

625,462

8,315,779

11,844,853

70

MYA

Burmese

510,083

12,713,472

11,545,908

71

EWE

Ewe

588,735

12,826,778

11,227,502

72

SRN

Sranan Tongo

546,788

14,002,996

10,719,800

73

TAH

Tahitian

547,403

16,474,812

10,587,464

74

NSO

Pedi

555,777

13,718,112

10,567,895

75

LIN

Lingala

536,198

10,804,478

10,179,883

76

TWI

Twi

537,268

11,294,035

10,174,586

77

TEL

Telugu

557,715

6,533,017

9,246,609

78

KIN

Kinyarwanda

488,086

8,493,687

9,135,649

79

BIS

Bislama

476,064

11,972,512

9,034,190

80

BCL

Central Bikol

451,274

9,946,706

8,765,221

81

NEP

Nepali

444,058

5,427,354

8,183,887

82

LOZ

Lozi

411,874

9,495,543

7,872,336

83

GAA

Ga

409,659

9,284,738

7,868,224

84

IBO

Igbo

415,234

10,075,710

7,737,716

85

YOR

Yoruba

411,461

12,671,491

7,668,947

86

PAN

Panjabi

394,938

6,569,439

7,564,168

87

HYE

Armenian

382,378

5,779,847

7,504,431

88

KAN

Kannada

327,475

4,029,042

7,429,653

89

TAT

Tatar

378,375

6,070,274

7,401,723

90

PAP

Papiamento

381,796

8,155,628

7,213,489

91

BEM

Bemba (Zambia)

381,297

6,526,989

7,171,421

92

TPI

Tok Pisin

383,675

9,120,242

7,162,905

93

GUJ

Gujarati

420,729

4,899,065

6,961,346

94

SMO

Samoan

364,010

9,262,113

6,940,694

95

RUN

Rundi

364,103

6,521,655

6,836,527

96

FIJ

Fijian

357,673

7,837,220

6,726,164

97

EFI

Efik

332,589

7,312,421

6,298,566

98

TIR

Tigrinya

320,856

4,953,825

6,288,908

99

TON

Tonga (Tonga Islands)

323,838

11,087,182

6,085,262

100

LUE

Luvale

317,092

4,717,517

6,023,708

101

HAU

Hausa

295,829

6,459,154

5,881,574

102

LUA

Luba-Lulua

292,212

5,517,519

5,532,234

103

KIR

Kirghiz

283,308

3,984,657

5,499,207

104

TOI

Tonga (Zambia)

291,857

4,344,007

5,468,385

105

GUW

Gun

286,899

6,650,627

5,431,468

106

PAG

Pangasinan

282,341

5,602,136

5,351,754

107

WAR

Waray (Philippines)

281,941

6,216,918

5,338,684

108

PIS

Pijin

263,681

5,313,880

5,010,374

109

SWC

Congo Swahili

271,892

4,583,791

4,937,615

110

TGK

Tajik

286,675

4,184,663

4,873,294

111

SAG

Sango

250,019

6,554,737

4,779,729

112

SOM

Somali

161,865

3,646,152

4,573,785

113

MAH

Marshallese

233,516

5,757,879

4,448,457

114

OSS

Ossetian

225,664

3,811,618

4,404,414

115

TUM

Tumbuka

232,540

3,622,880

4,347,800

116

HMO

Hiri Motu

227,759

4,781,868

4,314,100

117

LUG

Ganda

224,749

3,731,957

4,261,136

118

BEL

Belarusian

290,072

3,469,541

4,206,848

119

PON

Pohnpeian

218,908

4,397,432

4,185,579

120

TLL

Tetela

222,225

4,233,529

4,185,359

121

LAT

Latin

203,175

2,912,205

4,180,355

122

KQN

Kaonde

219,170

3,727,122

4,097,908

123

YAP

Yapese

212,548

6,219,777

4,078,280

124

ISO

Isoko

215,449

4,917,558

4,047,680

125

CHK

Chuukese

207,347

4,363,629

4,031,326

126

NIU

Niuean

214,222

5,416,864

3,993,808

127

UMB

Umbundu

212,228

3,988,296

3,939,817

128

GIL

Gilbertese

203,252

4,762,950

3,888,542

129

KON

Kongo

206,234

4,417,086

3,883,443

130

VEN

Venda

204,407

5,057,268

3,782,533

131

LUB

Luba-Katanga

197,423

3,541,419

3,742,176

132

HAT

Haitian

197,201

4,438,519

3,630,101

133

KAL

Kalaallisut

191,660

2,206,906

3,610,449

134

ZNE

Zande (Individual)

190,082

4,401,103

3,602,098

135

OCI

Occitan (Post 1500)

182,542

3,606,108

3,536,378

136

LUS

Lushai

187,503

4,290,199

3,534,861

137

CRS

Seselwa Creole French

188,361

3,847,882

3,528,139

138

MOS

Mossi

186,434

4,706,515

3,517,104

139

TIV

Tiv

184,113

4,810,668

3,469,806

140

NDS

Low German

185,909

2,927,867

3,456,271

141

MFE

Morisyen

181,560

4,068,226

3,367,397

142

FRY

Western Frisian

174,498

2,726,426

3,360,333

143

MON

Mongolian

169,290

2,406,026

3,330,866

144

TVL

Tuvalu

172,371

4,970,984

3,302,688

145

YUA

Yucateco

168,299

3,524,830

3,301,268

146

KWY

San Salvador Kongo

169,875

2,964,934

3,119,715

147

WLS

Wallisian

154,488

3,981,535

2,871,012

148

ORM

Oromo

155,084

2,646,041

2,858,967

149

GUG

Paraguayan Guaraní

143,391

2,151,216

2,742,729

150

ZAI

Isthmus Zapotec

146,783

2,741,722

2,741,357

151

KUR

Kurdish

111,126

3,090,640

2,729,572

152

AYM

Aymara

138,760

1,939,507

2,710,890

153

KHM

Khmer

150,117

2,966,056

2,683,112

154

TZO

Tzotzil

140,124

2,993,791

2,679,819

155

BCI

Baoulé

142,169

3,685,710

2,597,887

156

SND

Sindhi

86,214

2,580,903

2,566,710

157

QUE

Quechua

134,194

1,739,523

2,543,319

158

LUO

Luo (Kenya And Tanzania)

136,625

2,603,306

2,502,410

159

LUN

Lunda

134,578

1,857,219

2,482,581

160

QUZ

Cusco Quechua

127,408

1,649,454

2,450,616

161

RND

Ruund

133,631

2,443,228

2,446,692

162

UZB

Uzbek

137,566

2,233,702

2,360,392

163

DIV

Dhivehi

85,159

2,441,892

2,350,351

164

WAL

Wolaytta

120,608

1,844,136

2,325,431

165

UIG

Uighur

84,928

2,088,135

2,239,573

166

SSW

Swati

116,170

1,690,313

2,238,351

167

TUK

Turkmen

121,578

1,730,753

2,231,716

168

QUY

Ayacucho Quechua

113,702

1,402,781

2,164,966

169

NYK

Nyaneka

116,364

1,753,649

2,133,198

170

TDT

Tetun Dili

112,041

2,385,376

2,107,083

171

BZS

Brazilian Sign Language

110,679

2,044,945

2,065,232

172

KWN

Kwangali

106,595

1,709,040

1,939,056

173

KAZ

Kazakh

248,822

1,652,871

1,906,753

174

KEK

Kekchí

63,350

2,231,030

1,836,968

175

KUA

Kuanyama

99,227

1,932,349

1,830,081

176

NDO

Ndonga

99,817

1,864,378

1,810,583

177

MRI

Maori

62,963

2,151,493

1,807,107

178

PCK

Paite Chin

61,173

1,773,303

1,799,998

179

PES

Iranian Persian

64,142

1,508,974

1,791,906

180

PLT

Plateau Malagasy

60,810

1,843,559

1,789,347

181

DJE

Zarma

60,515

1,924,044

1,780,607

182

LTZ

Luxembourgish

92,860

1,400,718

1,744,759

183

KIK

Kikuyu

94,242

1,714,356

1,737,121

184

NZI

Nzima

92,884

1,798,852

1,685,594

185

TOP

Papantla Totonac

86,769

1,345,448

1,620,933

186

KMB

Kimbundu

90,341

1,960,969

1,617,965

187

BAK

Bashkir

88,618

1,216,200

1,580,575

188

ARG

Aragonese

82,038

1,641,632

1,535,954

189

TSC

Tswa

84,311

1,911,340

1,534,506

190

FAO

Faroese

75,612

1,194,153

1,534,477

191

JSL

Japanese Sign Language

83,773

2,223,227

1,528,544

192

ISE

Italian Sign Language

79,874

1,497,912

1,527,368

193

GYM

Ngäbere

78,796

1,624,979

1,459,454

194

JAV

Javanese

73,185

1,177,647

1,442,806

195

ASM

Assamese

94,568

1,027,775

1,390,206

196

ZLM

Malay (Individual)

72,676

1,139,427

1,372,948

197

VMW

Makhuwa

72,847

1,181,569

1,328,856

198

ACH

Acoli

73,172

1,496,706

1,325,711

199

CHV

Chuvash

68,211

1,032,279

1,302,667

200

BRE

Breton

129,742

1,301,922

1,286,455

201

MCO

Coatlán Mixe

66,222

1,102,029

1,263,381

202

MFS

Mexican Sign Language

63,494

1,247,858

1,253,416

203

TOG

Tonga (Nyasa)

67,113

1,053,514

1,231,614

204

MAM

Mam

57,254

1,424,153

1,171,359

205

RAR

Rarotongan

66,762

1,613,321

1,170,921

206

ADA

Adangme

63,021

1,669,121

1,131,992

207

NNO

Norwegian Nynorsk

139,111

1,113,937

1,116,261

208

CAB

Garifuna

59,416

1,016,417

1,095,937

209

NCJ

Northern Puebla Nahuatl

59,251

963,782

1,092,104

210

ARZ

Egyptian Arabic

54,590

927,189

1,089,748

211

DHV

Dehu

58,875

1,504,920

1,078,197

212

WUU

Wu Chinese

46,633

1,437,243

1,075,055

213

DJK

Eastern Maroon Creole

52,628

1,458,098

1,038,818

214

GUC

Wayuu

53,537

826,821

985,121

215

CAK

Kaqchikel

46,427

1,293,345

962,624

216

SEH

Sena

52,334

856,758

945,244

217

CYM

Welsh

99,826

1,055,852

937,929

218

KAM

Kamba (Kenya)

51,054

959,249

932,972

219

SOP

Songe

51,070

938,990

927,422

220

QVI

Imbabura Highland Quichua

50,527

672,711

921,493

221

NYN

Nyankole

50,379

806,471

912,254

222

BAR

Bavarian

58,409

796,255

908,588

223

RSL

Russian Sign Language

44,582

702,400

856,169

224

SID

Sidamo

46,851

686,888

847,841

225

ORI

Oriya

49,192

698,991

832,456

226

IDO

Ido

46,163

763,729

831,713

227

LMO

Lombard

39,461

864,899

827,408

228

YAO

Yao

43,689

675,341

791,700

229

MGR

Mambwe-Lungu

43,911

731,912

785,222

230

KRI

Krio

42,349

999,510

752,221

231

MWL

Mirandese

31,518

742,141

746,747

232

HMN

Hmong

41,806

906,741

729,196

233

NGL

Lomwe

39,339

596,790

693,966

234

KSS

Southern Kisi

37,693

773,744

655,897

235

NCX

Central Puebla Nahuatl

36,422

523,049

654,865

236

KOO

Konzo

36,378

575,840

642,178

237

CJK

Chokwe

35,767

601,420

627,383

238

TCF

Malinaltepec Me’Phaa

34,679

847,824

627,050

239

BBC

Batak Toba

35,181

574,823

619,966

240

TOJ

Tojolabal

33,805

667,730

606,099

241

NIA

Nias

34,280

582,906

604,629

242

SRM

Saramaccan

34,739

848,933

597,203

243

IBA

Iban

34,577

614,600

591,325

244

NCH

Central Huasteca Nahuatl

31,018

475,377

561,258

245

FON

Fon

31,273

865,700

552,748

246

KAB

Kabyle

38,018

758,336

550,414

247

KSW

S’Gaw Karen

26,363

1,264,404

545,235

248

IBG

Ibanag

30,270

568,962

537,232

249

NGU

Guerrero Nahuatl

29,768

462,975

535,371

250

URH

Urhobo

29,347

593,673

530,354

251

NDC

Ndau

30,369

488,763

527,840

252

KBP

Kabiyè

29,066

618,571

521,923

253

WES

Cameroon Pidgin

28,159

642,670

499,592

254

MAU

Huautla Mazatec

27,544

496,580

499,460

255

BAS

Basa (Cameroon)

27,771

616,191

496,676

256

BUM

Bulu (Cameroon)

27,996

624,348

494,140

257

CTU

Chol

26,462

547,357

478,157

258

CNH

Hakha Chin

27,733

554,298

477,786

259

BTX

Batak Karo

27,295

436,554

470,088

260

NBA

Nyemba

27,317

553,352

469,851

261

LAO

Lao

22,217

697,242

462,713

262

NYU

Nyungwe

24,491

419,713

427,580

263

ABK

Abkhazian

23,161

292,717

423,050

264

PUS

Pushto

28,260

483,360

421,907

265

CHR

Cherokee

15,746

287,938

416,623

266

COP

Coptic

15,706

256,370

416,296

267

DOP

Lukpa

15,711

558,508

416,290

268

SYR

Syriac

15,747

217,478

415,892

269

QUW

Tena Lowland Quichua

15,674

292,905

415,461

270

USP

Uspanteco

15,583

500,341

412,979

271

QUC

K’Iche'

15,575

616,986

412,280

272

ROM

Romany

16,048

422,362

411,993

273

AMU

Guerrero Amuzgo

15,533

566,676

411,225

274

JAK

Jakun

15,513

564,458

411,137

275

NHG

Tetelcingo Nahuatl

15,459

408,868

409,271

276

TZH

Tzeltal

22,481

529,093

408,029

277

SHI

Tachelhit

15,288

637,003

404,390

278

CNI

Asháninka

15,264

331,249

404,004

279

WOL

Wolof

15,230

402,534

403,295

280

OKE

Okpe (Southwestern Edo)

22,471

458,573

401,423

281

CJP

Cabécar

15,155

608,344

400,614

282

FSE

Finnish Sign Language

21,671

298,224

400,110

283

GBI

Galela

15,023

624,914

398,145

284

SSP

Spanish Sign Language

21,242

387,212

395,815

285

PCM

Nigerian Pidgin

22,001

465,750

394,764

286

PPK

Uma

14,576

660,598

384,278

287

BHW

Biak

22,261

366,474

381,127

288

PSO

Polish Sign Language

20,433

312,201

379,785

289

CMN

Mandarin Chinese

44,110

487,138

374,935

290

CHQ

Quiotepec Chinantec

14,251

912,859

366,313

291

DIK

Southwestern Dinka

13,319

383,777

353,980

292

OJB

Northwestern Ojibwa

13,318

290,036

353,940

293

CHA

Chamorro

14,539

316,758

350,061

294

QUG

Chimborazo Highland Quichua

20,272

247,947

349,776

295

CSL

Chinese Sign Language

17,874

494,654

348,101

296

JIV

Shuar

12,910

272,452

342,885

297

AGR

Aguaruna

12,778

295,678

338,609

298

ACU

Achuar-Shiwiar

12,347

349,234

328,050

299

AKE

Akawaio

12,346

493,819

326,593

300

CCE

Chopi

17,935

346,295

303,153

301

CHW

Chuwabu

17,988

252,729

299,706

302

GSG

German Sign Language

16,400

268,254

298,220

303

ARN

Mapudungun

16,737

275,559

296,456

304

BSN

Barasana-Eduria

11,180

681,542

291,888

305

TTJ

Tooro

16,442

252,981

280,979

306

SUN

Sundanese

15,850

250,374

272,390

307

KBH

Camsá

10,287

384,331

272,175

308

LAM

Lamba

14,846

240,797

271,580

309

DUA

Duala

15,351

444,826

269,185

310

HNE

Chhattisgarhi

52,059

322,484

263,135

311

XMF

Mingrelian

12,946

169,159

262,169

312

KMR

Northern Kurdish

14,798

273,245

260,963

313

DYU

Dyula

14,886

322,319

258,596

314

HSH

Hungarian Sign Language

13,877

214,890

256,618

315

AED

Argentine Sign Language

12,390

248,120

251,486

316

NAV

Navajo

14,626

229,364

248,195

317

TYV

Tuvinian

12,979

197,520

245,669

318

RMN

Balkan Romani

14,527

257,259

241,449

319

FCS

Quebec Sign Language

13,034

253,315

239,624

320

TSS

Taiwan Sign Language

12,298

357,723

239,608

321

BTS

Batak Simalungun

14,255

233,806

238,303

322

GLV

Manx

11,006

261,914

232,828

323

NIJ

Ngaju

13,158

213,322

222,299

324

CSE

Czech Sign Language

11,655

179,085

210,357

325

WLN

Walloon

41,887

301,028

210,120

326

BIN

Bini

11,635

261,717

208,326

327

SXN

Sangir

11,668

228,088

195,463

328

KVK

Korean Sign Language

9,330

321,614

194,428

329

RMS

Romanian Sign Language

10,434

198,109

193,459

330

KAC

Kachin

10,930

270,098

184,701

331

SVK

Slovakian Sign Language

10,141

159,092

182,557

332

AMI

Amis

9,156

186,011

175,520

333

UDM

Udmurt

9,394

147,593

173,612

334

MNI

Manipuri

7,281

127,875

162,599

335

TMH

Tamashek

5,363

168,471

152,620

336

HER

Herero

8,179

151,589

141,558

337

GSS

Greek Sign Language

7,090

137,633

140,185

338

ALZ

Alur

7,567

154,509

133,119

339

BZJ

Belize Kriol English

6,905

136,165

119,827

340

IKU

Inuktitut

5,244

66,528

113,112

341

POT

Potawatomi

4,113

108,987

110,373

342

MXV

Metlatónoc Mixtec

5,924

176,367

105,471

343

PDT

Plautdietsch

6,019

115,997

104,913

344

SME

Northern Sami

18,433

96,152

100,433

345

INA

Interlingua (International Auxiliary Language Association)

12,194

101,705

99,401

346

ISH

Esan

5,221

112,815

91,400

347

KEA

Kabuverdianu

5,217

102,331

90,402

348

TSZ

Purepecha

4,939

79,259

89,766

349

GLA

Scottish Gaelic

8,444

108,564

87,714

350

TLH

Klingon

12,602

76,817

87,087

351

JBO

Lojban

11,470

88,384

83,024

352

CSN

Colombian Sign Language

3,428

71,580

70,847

353

ALT

Southern Altai

3,764

52,407

70,386

354

PSR

Portuguese Sign Language

3,655

70,060

69,174

355

TOH

Gitonga

3,898

78,079

65,345

356

YUE

Yue Chinese

5,681

77,996

62,414

357

FSL

French Sign Language

2,862

58,931

56,227

358

AST

Asturian

9,261

59,964

54,602

359

FIL

Filipino

2,125

51,507

47,398

360

SRD

Sardinian

5,836

50,985

43,786

361

SCO

Scots

853

40,927

41,327

362

ECS

Ecuadorian Sign Language

2,019

38,677

39,731

363

FUR

Friulian

5,791

42,727

37,760

364

YID

Yiddish

4,039

32,533

33,128

365

MEN

Mende (Sierra Leone)

1,668

36,531

28,675

366

GOM

Goan Konkani

722

24,148

26,834

367

LIM

Limburgan

4,491

25,040

24,410

368

LFN

Lingua Franca Nova

3,443

26,131

24,227

369

COR

Cornish

4,105

24,498

23,815

370

MAI

Maithili

4,256

23,102

20,698

371

VSL

Venezuelan Sign Language

958

17,798

18,472

372

CBK

Chavacano

2,423

17,255

16,912

373

ILE

Interlingue

2,570

16,439

16,599

374

VOL

Volapük

2,338

12,510

15,481

375

DTP

Kadazan Dusun

1,831

12,814

12,803

376

MIN

Minangkabau

330

11,492

11,998

377

TET

Tetum

406

11,359

10,648

378

PAM

Pampanga

1,448

8,955

10,276

379

ZSM

Standard Malay

1,158

9,170

10,106

380

PRL

Peruvian Sign Language

529

8,562

9,567

381

ZIB

Zimbabwe Sign Language

356

9,497

9,249

382

CRH

Crimean Tatar

1,361

8,351

8,996

383

KHA

Khasi

1,277

9,156

8,623

384

ASE

American Sign Language

538

8,038

8,382

385

BFI

British Sign Language

384

7,005

8,302

386

ARQ

Algerian Arabic

919

5,587

7,531

387

BOD

Tibetan

1,029

17,081

6,956

388

ZPA

Lachiguiri Zapotec

359

6,326

6,549

389

LZH

Literary Chinese

531

5,024

6,379

390

GOS

Gronings

986

4,727

4,963

391

GRC

Ancient Greek (To 1453)

568

3,778

4,845

392

NST

Tase Naga

769

5,773

4,794

393

CSG

Chilean Sign Language

329

4,320

4,734

394

GOR

Gorontalo

111

4,034

4,651

395

MZY

Mozambican Sign Language

244

4,364

4,447

396

CKB

Central Kurdish

1,040

4,332

4,400

397

ANG

Old English (Ca. 450-1100)

993

4,273

4,088

398

CSB

Kashubian

892

4,280

4,070

399

OTA

Ottoman Turkish (1500-1928)

622

3,564

4,032

400

KAS

Kashmiri

701

3,855

3,743

401

SAT

Santali

101

3,263

3,604

402

HOC

Ho

631

3,038

3,520

403

ZZA

Zaza

505

2,895

3,194

404

COS

Corsican

75

2,984

2,880

405

DZO

Dzongkha

449

8,250

2,871

406

INL

Indonesian Sign Language

206

2,459

2,814

407

DIQ

Dimli (Individual)

74

1,924

2,669

408

GRN

Guarani

229

1,883

2,577

409

SWH

Swahili (Individual)

369

1,826

2,522

410

WAE

Walser

512

2,416

2,483

411

LAD

Ladino

371

2,196

2,412

412

ACE

Achinese

446

2,783

2,409

413

ASF

Auslan

149

2,413

2,322

414

AKA

Akan

61

2,214

2,225

415

JAM

Jamaican Creole English

61

2,060

2,166

416

ORV

Old Russian

313

1,749

2,144

417

PMS

Piemontese

263

2,626

2,133

418

GSW

Swiss German

220

1,914

2,052

419

XAL

Kalmyk

268

1,624

2,041

420

CSF

Cuba Sign Language

117

1,816

1,946

421

ZSL

Zambian Sign Language

96

2,299

1,915

422

INS

Indian Sign Language

208

1,754

1,905

423

NAN

Min Nan Chinese

88

2,352

1,894

424

MAX

North Moluccan Malay

268

1,981

1,857

425

PRG

Prussian

213

1,526

1,717

426

GOT

Gothic

207

4,340

1,698

427

BXR

Russia Buriat

44

1,221

1,658

428

TCY

Tulu

47

1,265

1,541

429

SAH

Yakut

65

1,042

1,537

430

FRP

Arpitan

473

1,820

1,485

431

KAU

Kanuri

285

2,936

1,453

432

BVL

Bolivian Sign Language

98

1,188

1,391

433

NOV

Novial

187

1,268

1,296

434

HRX

Hunsrik

214

1,300

1,247

435

AWA

Awadhi

248

1,249

1,246

436

AVK

Kotava

157

933

1,221

437

PIH

Pitcairn-Norfolk

39

1,425

1,195

438

PYS

Paraguayan Sign Language

90

971

1,150

439

NEW

Newari

38

993

1,101

440

HIF

Fiji Hindi

45

777

1,044

441

MZN

Mazanderani

47

770

1,002

442

BHO

Bhojpuri

55

858

908

443

SAN

Sanskrit

150

725

887

444

HAW

Hawaiian

94

824

874

445

DTY

Dotyali

24

874

804

446

PDC

Pennsylvania German

65

706

734

447

RUE

Rusyn

113

479

683

448

KRL

Karelian

135

655

682

449

DSB

Lower Sorbian

37

471

677

450

SHN

Shan

150

1,926

652

451

EXT

Extremaduran

65

588

628

452

CHO

Choctaw

107

595

624

453

FKV

Kven Finnish

56

528

622

454

QYA

Quenya

100

423

581

455

GLK

Gilaki

12

268

521

456

TPW

Tupí

87

520

518

457

TZL

Talossan

108

480

483

458

MHR

Eastern Mari

69

379

482

459

RMY

Vlax Romani

10

522

467

460

NOG

Nogai

80

326

464

461

NPI

Nepali (Individual)

98

380

457

462

EGL

Emilian

81

499

456

463

GCF

Guadeloupean Creole French

78

464

446

464

LDN

Láadan

76

464

430

465

SFS

South African Sign Language

30

380

428

466

MWW

Hmong Daw

74

497

408

467

LIJ

Ligurian

52

419

385

468

AFB

Gulf Arabic

69

303

370

469

KSH

Kölsch

23

385

358

470

SGS

Samogitian

43

199

356

471

MGM

Mambae

33

282

307

472

PNT

Pontic

5

300

301

473

MYV

Erzya

33

239

300

474

NAP

Neapolitan

29

299

249

475

IKE

Eastern Canadian Inuktitut

43

138

245

476

HSB

Upper Sorbian

36

194

235

477

LLD

Ladin

20

225

227

478

FRM

Middle French (Ca. 1400-1600)

17

207

220

479

ARY

Moroccan Arabic

41

148

216

480

ROH

Romansh

16

205

216

481

SMA

Southern Sami

44

178

216

482

PPL

Pipil

29

163

208

483

SHS

Shuswap

39

243

207

484

TLY

Talysh

43

148

207

485

PNB

Western Panjabi

31

204

194

486

PMY

Papuan Malay

41

189

186

487

SJN

Sindarin

31

157

186

488

SUX

Sumerian

36

203

186

489

SZL

Silesian

34

153

177

490

LIV

Liv

29

150

176

491

RIF

Tarifit

34

145

170

492

BVY

Baybayanon

23

157

163

493

MIQ

Mískito

66

165

158

494

FUV

Nigerian Fulfulde

29

133

157

495

AIN

Ainu (Japan)

26

117

154

496

NLV

Orizaba Nahuatl

14

137

154

497

GBM

Garhwali

34

162

144

498

HDS

Honduras Sign Language

12

94

139

499

NON

Old Norse

13

134

136

500

ALN

Gheg Albanian

25

133

134

Acknowledgements

All the data consolidated in this work are retrieved from various sources and we do not own the dataset. If you use this dataset, please cite all the articles in citations.bib file. We are enabling this derived dataset to be easily accessible, with the intention to accelerate the research of language technologies to low resource languages. However, if you view this derived dataset as a violation of intellectual property rights, please let us know, so we will be happy to remove it from the corpus.