Sentiment Analysis for Indian Languages (Code Mixed)

NLP Tool Contest @ICON-2017, Jadavpur University

India is a linguistic area with one of the longest histories of contact, influence, use, teaching and learning of English-in-diaspora in the world (Kachru and Nelson, 2006). Thus, a huge number of Indians active on the internet are able in English communication to some degree. India also enjoys huge diversity in language. Apart from Hindi, it has several regional languages that are the primary tongue of people native to the region. This is to the extent that social media including Facebook, WhatsApp, Twitter, etc. contain more than one language, and such phenomena are called code-mixing and code-switching. On the other side, the evolution of sentiments from such social media texts have also created many new opportunities for information access and language technology, but also many new challenges, making it one of the prime present-day research areas. Sentiment analysis in code-mixed data has several real-life applications in opinion mining from social media campaign to feedback analysis.

Linguistic processing of such social media dataset and its sentiment analysis is a difficult task. Till date, most of the experiments have been performed on identifying the languages (Bali et al., 2014; Das and Gamback, 2014), parts-of-speech tagging (Ghosh et al., 2016), etc. Few tasks also have been started on the sentiment analysis of code-mixed data such as Hindi-English (Joshi et al., 2016). Therefore, we believe that it is the best place to bring more research attention towards developing language technologies for identifying sentiments from Indian social media texts.

Main goal of this task is to identify the sentence level sentiment polarity of the code-mixed dataset of Indian languages pairs (Hi-En, Ben-Hi-En) collected from Twitter, Facebook, and WhatsApp. Each of the sentences is annotated with language information as well as polarity at the sentence level. The participants will be provided development, training and test dataset.

Each participating team will be allowed to submit two systems for each of the language pairs, and the best result will be considered as final. The final evaluation will be performed based on the macro-averaged F1-measure. The python code for the evaluation will be provided by the organizers. Initially, each of participating teams will have access to the development and training data. Later, the unlabeled test data will be provided, and the teams have to submit the results within 24 hours. There will be no distinction between constrained and unconstrained systems, but the participants will be asked to report what additional resources they have used for each of their submitted runs.

The contest will have three prizes:
First Prize: Rs. 10,000/-
Second Prize: Rs. 7,500/-
Third Prize: Rs. 5,000/-

Important Dates

  • Registration Ends Aug 7, 2017
  • Training Data Released Aug 10, 2017
  • Test Data Ready Sep 28, 2017
  • Submit Run Within 24 hours of the test data receive
  • Result Announcement Oct 3, 2017
  • Working Notes Submissions Due Oct 15, 2017
  • Working Notes ReviewsNov 1, 2017
  • Working Notes Final Version DueNov 15, 2017


Please fill up this form to express your interest in taking part in this contest. Due to privacy policies of Facebook and WhatsApp, we will not be able to release the data publicly. You need to fillup the copyright form and send it to dipankar{DOT}dipnil2005{AT}gmail{DOT}com. Please mention "Copyright Form for NLP Tool Contest" in the subject line. The request must contain preferred data pair(Hindi-English or Bengali-English or both).

Please join SAIL google group (sail_icon2017@googlegroups.com) for discussion.


Test data will be available upon request from October 02, 2017 to October 09, 2017. The participants have to submit the predicted output within 24 hours of getting the test dataset. The test data will be provided in JSON format, in which all the sentiment tags are set to NA. The participants have to change the sentiment values only. The validation and evaluation scripts are available at the following link. Maximum of two submissions for each dataset for each team will be accepted.

We strongly encourage you to submit your runs in given format.

1. We will accept maximum of four files (2 for each data pair).

2. The participants have to submit the predicted output for the following format. The file name should contain TeamName, Data, and RunID joined together with underscore. For example, JU_BN-EN_Run1.json.

3. All the runs should be compressed to a single file named as TeamID.

4. Participant are asked to send request mail for test data to brajagopal{DOT}cse{AT}gmail{DOT}com and also cc: dipankar{DOT}dipnil2005{AT}gmail{DOT}com and amitava{DOT}das{AT}iiits{DOT}in. Please mention preferred language pairs (HI-EN or BN-EN or BOTH) and subject as "REQUEST FOR TEST DATA OF SAIL 2017 @ICON".

5. Particiants should send the predicted output to the above mail ids within 24 hours of receiving the test data.


Organizing Committee

Dipankar Das Jadavpur University, Kolkata, India
Amitava Das IIIT Sricity, Andhra Pradesh, India
Braja Gopal Patra UTHealth, Houston, USA


Koustav Rudra IIT Kharagpur, West Bengal, India
Upendra Kumar IIIT Sricity, Andhra Pradesh, India
Sainik Kumar Mahata Jadavpur University, Kolkata, India


