EMNLP 2019 Tutorial : Processing and Understanding Mixed Language Data
Presenters
- Monojit Choudhury : Monojit is a Principal Researcher at Microsoft Research, Bangalore and works on problems in NLP. One of his current works is on Project Melange, whose aim is to build systems and tools for understanding code-mixing
- Anirudh Srinivasan : Anirudh is a Research Fellow at Microsoft Research, Bangalore. His interests are in Deep Learning and NLP and works on Project Melange
- Sandipan Dandapat : Sandipan is a Senior Applied Researcher at Microsoft India Development Center, Hyderabad. Sandipan works on Machine Translation and Trustworthy Bing
Note: Kalika Bali was also originally a presenter for this session, but she was unable to make it
Abstract
Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and NUIs.
Code-mixing is extremely prevalent in all multilingual societies. Current studies have shown that as much as 20% of user generated content from some geographies, like South Asia, parts of Europe, and Singapore, are code-mixed. Thus, it is very important to handle code-mixed content as a part of NLP systems and applications for these geographies.
In the past 5 years, there has been an active interest in computational models for code-mixing with a substantive research outcome in terms of publications, datasets and systems. However, it is not easy to find a single point of access for a complete and coherent overview of the research. This tutorial is expecting to fill this gap and provide new researchers in the area with a foundation in both linguistic and computational aspects of code-mixing. We hope that this then becomes a starting point for those who wish to pursue research, design, development and deployment of code-mixed systems in multilingual societies.
Material from the Session
- Slides from the session
- Bibtex file containing a list of papers from ACL Anthology that are about Code-Mixing
- Github Repo with an research area wise list of papers on Code-Mixing by Genta Indra Winata
Other Links
Citation
If you would like to refer to our tutorial, please use the following citation
@inproceedings{choudhury-etal-2019-processing,
title = "Processing and Understanding Mixed Language Data",
author = "Choudhury, Monojit and
Srinivasan, Anirudh and
Dandapat, Sandipan",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP): Tutorial Abstracts",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics"
}