EMNLP 2019 Tutorial : Processing and Understanding Mixed Language Data

Presenters

  1. Monojit Choudhury : Monojit is a Principal Researcher at Microsoft Research, Bangalore and works on problems in NLP. One of his current works is on Project Melange, whose aim is to build systems and tools for understanding code-mixing
  2. Anirudh Srinivasan : Anirudh is a Research Fellow at Microsoft Research, Bangalore. His interests are in Deep Learning and NLP and works on Project Melange
  3. Sandipan Dandapat : Sandipan is a Senior Applied Researcher at Microsoft India Development Center, Hyderabad. Sandipan works on Machine Translation and Trustworthy Bing

Note: Kalika Bali was also originally a presenter for this session, but she was unable to make it

Abstract

Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and NUIs.

Code-mixing is extremely prevalent in all multilingual societies. Current studies have shown that as much as 20% of user generated content from some geographies, like South Asia, parts of Europe, and Singapore, are code-mixed. Thus, it is very important to handle code-mixed content as a part of NLP systems and applications for these geographies.

In the past 5 years, there has been an active interest in computational models for code-mixing with a substantive research outcome in terms of publications, datasets and systems. However, it is not easy to find a single point of access for a complete and coherent overview of the research. This tutorial is expecting to fill this gap and provide new researchers in the area with a foundation in both linguistic and computational aspects of code-mixing. We hope that this then becomes a starting point for those who wish to pursue research, design, development and deployment of code-mixed systems in multilingual societies.

Material from the Session

[EMNLP Site] [ACL Anthology]

Citation

If you would like to refer to our tutorial, please use the following citation

@inproceedings{choudhury-etal-2019-processing,
    title = "Processing and Understanding Mixed Language Data",
    author = "Choudhury, Monojit  and
    Srinivasan, Anirudh  and
    Dandapat, Sandipan",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language
        Processing and the 9th International Joint Conference on Natural Language Processing
        (EMNLP-IJCNLP): Tutorial Abstracts",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics"
}