LREC 2026 ∙ Long Paper ∙ Oral
Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP
Author(s): Zhiyin Tan, Changxu Duan
We compare visible datasets across languages in institutional catalogues and in research-paper circulation, showing where low-resource status reflects scarcity, weak visibility, or fragile access.
Browse the maintained inventory by language, task, modality, and access state.
Missing a dataset? Open an issue to add yours.
Evidence at a glance
Measuring scarcity: Resource Density Index
The Resource Density Index (RDI) normalizes dataset counts by speaker population.
Is this the whole picture?
Catalogue records are one view of dataset visibility. We add a research-circulation view inspired by M3D ↗ 's citation context approach.
Comparing approaches: literature evidence changes the picture
We compare catalogue baselines with paper-traced datasets. For some languages, papers reveal many more records than LRE Map and LDC register.
Access persistence: documented is not always reusable
Finding a dataset in papers does not mean it is still publicly accessible. Access may decay through dead links, moved pages, or restricted distribution.
What we find
The audit separates questions that are often collapsed when a language is described as low-resource.
Scarcity Needs a Denominator
Raw dataset counts are hard to compare across languages with very different speaker populations and documentation infrastructures. RDI asks how densely a language is catalogued relative to its speaker population, making catalogue visibility less dominated by scale alone.
Visibility Depends on Where We Look
Catalogues capture resources that have been registered, curated, or institutionally distributed. Paper traces capture datasets that entered scholarly circulation. Comparing both views shows when apparent scarcity is also a visibility gap.
Documentation Is Not Durable Access
A dataset can remain visible in papers after its public access has weakened through dead links, moved pages, restricted distribution, or lack of maintenance. Dataset audits should therefore separate evidence of existence from evidence of present-day reuse.
Low-Resource Is a Diagnosis, Not a Count
Creation, visibility, circulation, and access can fail in different ways. Treating them as separate evidence layers makes it clearer whether a language needs new datasets, better indexing, or better preservation.
Cite this work
If our findings or inventory inform your research, please cite the LREC 2026 paper.
BibTeX
@inproceedings{tan-etal-2026-beyond,
series = {LREC},
title = {Beyond Catalogue Counts: The Dataset Visibility Asymmetry in Low-Resource Multilingual NLP},
ISSN = {2522-2686},
url = {http://dx.doi.org/10.63317/3bep4yiomtp2},
DOI = {10.63317/3bep4yiomtp2},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
publisher = {European Language Resources Association (ELRA)},
author = {Tan, Zhiyin and Duan, Changxu},
year = {2026},
month = May,
pages = {6068–6079},
collection = {LREC}
}