Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

LREC 2026 ∙ Long Paper ∙ Oral

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Author(s): Zhiyin Tan, Changxu Duan

What do we mean by “low-resource”?

We compare visible datasets across languages in institutional catalogues and in research-paper circulation, showing where low-resource status reflects scarcity, weak visibility, or fragile access.

Browse the maintained inventory by language, task, modality, and access state.

Open Dataset Inventory

Missing a dataset? Open an issue to add yours.

Evidence at a glance

Measuring scarcity: Resource Density Index

The Resource Density Index (RDI) normalizes dataset counts by speaker population.

RDI = dataset records from one sourcespeaker population / 1,000,000

LanguagesTop 200 by speakers (Ethnologue 200, 2025)

CataloguesRecords in LRE Map and LDC

118languages have no records in either catalogue baseline

+23more have <1 dataset record per 10 million speakers

RDI = 0118

0-0.123

0.1-0.214

0.2-0.518

0.5-1.06

> 1.021

Is this the whole picture?

Catalogue records are one view of dataset visibility. We add a research-circulation view inspired by M3D ↗ 's citation context approach.

Collect citation contexts from related papers

Filter and get dataset citations

Record dataset metadata

Manually validate records

Comparing approaches: literature evidence changes the picture

We compare catalogue baselines with paper-traced datasets. For some languages, papers reveal many more records than LRE Map and LDC register.

35of 118 zero-catalogue languages have paper-traced datasets

50of 141 zero/near-zero catalogue languages have more paper-traced records

Indonesian196

Marathi41

Assamese31

Nepali30

Setswana26

Sindhi17

Access persistence: documented is not always reusable

Finding a dataset in papers does not mean it is still publicly accessible. Access may decay through dead links, moved pages, or restricted distribution.

42% (253/609)of paper-traced datasets were identified in papers but not publicly accessible in the paper snapshot

What we find

The audit separates questions that are often collapsed when a language is described as low-resource.

Scarcity Needs a Denominator

Raw dataset counts are hard to compare across languages with very different speaker populations and documentation infrastructures. RDI asks how densely a language is catalogued relative to its speaker population, making catalogue visibility less dominated by scale alone.

Visibility Depends on Where We Look

Catalogues capture resources that have been registered, curated, or institutionally distributed. Paper traces capture datasets that entered scholarly circulation. Comparing both views shows when apparent scarcity is also a visibility gap.

Documentation Is Not Durable Access

A dataset can remain visible in papers after its public access has weakened through dead links, moved pages, restricted distribution, or lack of maintenance. Dataset audits should therefore separate evidence of existence from evidence of present-day reuse.

Low-Resource Is a Diagnosis, Not a Count

Creation, visibility, circulation, and access can fail in different ways. Treating them as separate evidence layers makes it clearer whether a language needs new datasets, better indexing, or better preservation.

Cite this work

If our findings or inventory inform your research, please cite the LREC 2026 paper.

BibTeX

@inproceedings{tan-etal-2026-beyond,
  series = {LREC},
  title = {Beyond Catalogue Counts: The Dataset Visibility Asymmetry in Low-Resource Multilingual NLP},
  ISSN = {2522-2686},
  url = {http://dx.doi.org/10.63317/3bep4yiomtp2},
  DOI = {10.63317/3bep4yiomtp2},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  publisher = {European Language Resources Association (ELRA)},
  author = {Tan,  Zhiyin and Duan,  Changxu},
  year = {2026},
  month = May,
  pages = {6068–6079},
  collection = {LREC}
}