p SAIL Code Mixed
Home Important Dates Registration Submission Program Schedule Reference Organizing Committee Contact

Sentiment Analysis for Indian Languages (Code Mixed)

NLP Tool Contest @ICON-2017, Jadavpur University

India is a linguistic area with one of the longest histories of contact, influence, use, teaching and learning of English-in-diaspora in the world (Kachru and Nelson, 2006). Thus, a huge number of Indians active on the internet are able in English communication to some degree. India also enjoys huge diversity in language. Apart from Hindi, it has several regional languages that are the primary tongue of people native to the region. This is to the extent that social media including Facebook, WhatsApp, Twitter, etc. contain more than one language, and such phenomena are called code-mixing and code-switching. On the other side, the evolution of sentiments from such social media texts have also created many new opportunities for information access and language technology, but also many new challenges, making it one of the prime present-day research areas. Sentiment analysis in code-mixed data has several real-life applications in opinion mining from social media campaign to feedback analysis.

Linguistic processing of such social media dataset and its sentiment analysis is a difficult task. Till date, most of the experiments have been performed on identifying the languages (Bali et al., 2014; Das and Gamback, 2014), parts-of-speech tagging (Ghosh et al., 2016), etc. Few tasks also have been started on the sentiment analysis of code-mixed data such as Hindi-English (Joshi et al., 2016). Therefore, we believe that it is the best place to bring more research attention towards developing language technologies for identifying sentiments from Indian social media texts.

Main goal of this task is to identify the sentence level sentiment polarity of the code-mixed dataset of Indian languages pairs (Hi-En, Ben-Hi-En) collected from Twitter, Facebook, and WhatsApp. Each of the sentences is annotated with language information as well as polarity at the sentence level. The participants will be provided development, training and test dataset.

Each participating team will be allowed to submit two systems for each of the language pairs, and the best result will be considered as final. The final evaluation will be performed based on the macro-averaged F1-measure. The python code for the evaluation will be provided by the organizers. Initially, each of participating teams will have access to the development and training data. Later, the unlabeled test data will be provided, and the teams have to submit the results within 24 hours. There will be no distinction between constrained and unconstrained systems, but the participants will be asked to report what additional resources they have used for each of their submitted runs.

The contest will have three prizes:
First Prize: Rs. 10,000/-
Second Prize: Rs. 7,500/-
Third Prize: Rs. 5,000/-

Important Dates

  • Registration Ends Aug 7, 2017
  • Training Data Released Aug 10, 2017
  • Test Data Ready Oct 2, 2017
  • Submit Run Within 24 hours of the test data receive
  • Result AnnouncementOct 15, 2017
  • Working Notes Submissions Due Oct 31, 2017
  • Working Notes ReviewsNov 10, 2017
  • Working Notes Final Version DueNov 15, 2017


Please fill up this form to express your interest in taking part in this contest. Due to privacy policies of Facebook and WhatsApp, we will not be able to release the data publicly. You need to fillup the copyright form and send it to dipankar{DOT}dipnil2005{AT}gmail{DOT}com. Please mention "Copyright Form for NLP Tool Contest" in the subject line. The request must contain preferred data pair(Hindi-English or Bengali-English or both).

Please join SAIL google group (sail_icon2017@googlegroups.com) for discussion.


Test data will be available upon request from October 02, 2017 to October 09, 2017. The participants have to submit the predicted output within 24 hours of getting the test dataset. The test data will be provided in JSON format, in which all the sentiment tags are set to NA. The participants have to change the sentiment values only. The validation and evaluation scripts are available at the following link. Maximum of two submissions for each dataset for each team will be accepted.

We strongly encourage you to submit your runs in given format.

1. We will accept maximum of four files (2 for each data pair).

2. The participants have to submit the predicted output for the following format. The file name should contain TeamName, Data, and RunID joined together with underscore. For example, JU_BN-EN_Run1.json.

3. All the runs should be compressed to a single file named as TeamID.

4. Participant are asked to send request mail for test data to brajagopal{DOT}cse{AT}gmail{DOT}com and also cc: dipankar{DOT}dipnil2005{AT}gmail{DOT}com and amitava{DOT}das{AT}iiits{DOT}in. Please mention preferred language pairs (HI-EN or BN-EN or BOTH) and subject as "REQUEST FOR TEST DATA OF SAIL 2017 @ICON".

5. Particiants should send the predicted output to the above mail ids within 24 hours of receiving the test data.

Results: The results are available at the following link. All the participants have to send a paragraph of their system description to us on or before Friday, 20 October 2017.

System Description: System description papers are limited to 6 pages for content, with two additional pages allowed for references. The submission format is available at the following link. We encourage you to use the LaTeX format. The PDF version of submissions must be made to the CMT system at: https://cmt3.research.microsoft.com/SAIL2018/. Papers are due by Tuesday, 31 October 2017.

Proceedings: We are planning to publish the system description papers in CEUR Workshop Proceedings. We will let you know the details once it is confirmed.

Naming Convention: Paper titles are required to follow the specific format: "[SystemName]@[SAIL_CodeMixed-2017]:[Insert Paper Title Here]".

Program Schedule

15:30 – 15:40: Opening Remarks

15:40 – 16:10: Invited Talk
Kalika Bali

16:10 – 16:25: IIITH_NBP@SAIL_CodeMixed-2017: Code-Mixed Sentiment Analysis Using Machine Learning and Neural Network Approaches
Pruthwik Mishra, Prathyusha Danda, and Pranav Dhakras

16:25 – 16:40: JU_KS@SAIL_CodeMixed-2017: Sentiment Analysis for Indian Code Mixed Social Media Texts
Kamal Sarkar

16:40 – 16:55: BIT Mesra @ SAIL CodeMixed-2017: Majority Voting of SVM and NB classifiers for Sentiment Analysis of Hindi-English Code-Mixed Text
Ambuj Mishra, Rakesh Ranjan, Neelansh, and Sujan Kr. Saha

16:55 – 17:00: Closing Remarks


Bali, K., Sharma, J., Choudhury, M., and Vyas, Y. (2014). “i am borrowing ya mixing?" an analysis of english-hindi code mixing in facebook. In Proceedings of the First Workshop on Computational Approaches to Code Switching, pages 116-126.

Das, A. and Gamback, B. (2014). Identifying languages at the word level in code-mixed Indian social media text. In Proceedings of the 11th International Conference on Natural Language Processing, Goa, India, pages 169-178.

Ghosh, S, Ghosh, S. and Das, D. (2016). Part-of-speech Tagging of Code-Mixed Social Media Text. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, EMNLP, pages 90–97.

Joshi, A., Prabhu, A., Shrivastava, M., and Varma, V. (2016). Towards subword level compositions for sentiment analysis of hindi-english code mixed text. In Proceedings of the 26th International Conference on Computational Linguistics (COLING-2016), pages 2482-2491.

Kachru, Yamuna and Nelson, Cecil L. (2006). World Englishes in Asian Contexts. Hong Kong University Press.

Organizing Committee

Dipankar Das Jadavpur University, Kolkata, India
Amitava Das IIIT Sricity, Andhra Pradesh, India
Braja Gopal Patra UTHealth, Houston, USA


Koustav Rudra IIT Kharagpur, West Bengal, India
Upendra Kumar IIIT Sricity, Andhra Pradesh, India
Sainik Kumar Mahata Jadavpur University, Kolkata, India


Dipankar Das
Amitava Das
Braja Gopal Patra