A P\={a}ninian Foundation for Indic Language Processing
Mirrored from arXiv — Machine Learning for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:A Pāninian Foundation for Indic Language Processing
Abstract:More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks around individual languages or small subsets of genealogical language families, building separate analyzers, parsers, and datasets for each language and starting over for the next. This overlooks a deep regularity. Through more than two millennia of convergence around Sanskrit, Indic languages came to share a morphosyntactic architecture formalized in Pānini's grammar, the Astādhyāyī. This cuts across genealogical lines, uniting languages through a common framework. We argue that this Pāninian framework supplies a unifying computational architecture the field has lacked, and that benchmarks grounded explicitly in it would make Indic language systems more accurate, more data-efficient, and more transferable, effectively merging many apparently disparate and sparse Indic language resources into a single high-resource metalanguage bedrock. We propose a four-part benchmark suite to render this shared architecture explicit, measurable, and ready to be leveraged for practical applications. Moreover, we underscore the question it raises for interpretability research: whether neural models trained on these languages come to represent Pānini's categories on their own.
| Comments: | 16 pages, 0 figures |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) |
| ACM classes: | I.2.7; F.4.2 |
| Cite as: | arXiv:2606.24172 [cs.CL] |
| (or arXiv:2606.24172v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.24172
arXiv-issued DOI via DataCite (pending registration)
|
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
Current browse context:
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — Machine Learning
-
Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol
Jul 2
-
SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling
Jul 2
-
SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification
Jul 2
-
A Filtered Mixture-of-Generators for Fully Synthetic Survival Training
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.