I've been doing a lot of different things at work recently but one of them got me thinking about emoticons, or as I think most people call them, smileys.
Smileys (as I'll refer to them throughout) often convey emotive opinions. For example, if you have just been given a gift you might use :) in an e-mail or tweet saying thank you. On the other hand if your flight has been delayed you might well use :( to show your displeasure. It's fairly easy to build two lists of smileys, one positive and one negative, containing all the possible variations of each smiley. These lists can then be used as one feature when trying to classify text and such lists can be found in most opinion mining systems which attempt to label a piece of text as positive, negative or neutral.
It turns out though, that in some cases a sad smiley can actually have a positive meaning. This came to my attention when participating in the profiling task of RepLab 2012. The motivation behind the profiling task is brand management. For example, imagine you are the PR company responsible for a company like Apple. In an ideal world you would read every document ever written about the company and try and address any negative opinions you discover. Of course we don't live in an ideal world and the profiling task aimed to encourage development of systems that could do this task automatically. The task used tweets as the documents, and the aim was to first determine if a tweet was relevant to a given entity (i.e. is it talking about Apple the computer manufacturer rather than the fruit or record label) and then to determine it's polarity, i.e. does the tweet have positive or negative implications for the company's reputation. On the face of it, you would think that the second step could be performed by a standard opinion mining system. Unfortunately I don't think that is necessarily the case.
The example given in the task description makes it quite clear that a sad sounding tweet can actually have a positive effect on a brand: R.I.P. Michael Jackson. We'll miss you. I could easily imagine this tweet being extended with the addition of a sad smiley. So a sad smiley could easily occur in both negative and positive tweets.
As I was intending to use a machine learning algorithm to build a classifier to learn polarity this dual use of smileys (and other words in general) didn't bother me too much as if I could find enough training data then hopefully the algorithm would sort out the contradictions. The problem was that I didn't have that much training data (400 tweets for each of 6 entities). Unfortunately there are many ways to write the same smiley and this variation means that the algorithm might see each version just once and would therefore be unable to draw any strong conclusions from it's presence.
The solution of course is to normalize each smiley to a given form. So for example, I would normalize all the smileys in the title of this post to :) and then feed that to the machine learning algorithm. In GATE this was easy to achieve using a gazetteer with an additional feature to store the normalized version. The gazetteer I've built covers all of the western style (i.e. viewed from the side) smileys I could find as well as the relevant Unicode symbols. To save anyone else the hassle of having to build such a gazetteer I've made it available for anyone who is interested (note when loading it into GATE use a space as the feature separator).
Of course, after the work of assembling the gazetteer it didn't actually make any difference to the performance of my polarity classifier. It turns out that in all the training data I had there weren't many smileys! Given that I'm going to be doing more work in this area over the coming months, I'm hoping that it will eventually turn out to be useful.
This does all lead to a question though -- while there are lots of ways of writing the same smiley do most people use the simplest form, i.e. :) instead of any of the three character versions?