SentiStorm - Real-time Twitter Sentiment Classification based on Apache Storm

SentiStorm is based on Apache Storm and uses different machine learning techniques to identify the sentiment of a tweet. For example, SentiStorm uses Part-of-Speech (POS) tags, Term Frequency-Inverse Document Frequency (TF-IDF) and multiple sentiment lexica to extract a feature vector out of a tweet. This extracted feature vector is processed by a Support Vector Machine (SVM), which predicts the sentiment based on a training dataset.

The full thesis can be found here.

Topology of SentiStorm

Topology of SentiStorm

The figure above illustrates the topology of SentiStorm including its components. The Dataset Spout emits tweets from a local dataset into the Storm pipeline. It can be easily replaced by another Spout. For example, a Twitter Spout can be used to emit tweets directly from the real-time Twitter stream. After tweets have been emitted by a Spout, the Tokenizer replaces possible Unicode or HTML symbols and tokenizes the tweet text by a complex regular expression. Then, each token is processed by the Preprocessor, which tries to unify emoticons, fix slang language or gerund forms and remove elongations. The unification of emotions removes repeating characters to get a consistent set of emoticons. For example, the emoticon :-))) is replaced by :-) and therefore the sentiment can be easily obtained from an emoticon lexicon. Slang expressions such as omg are substituted by oh my god by the usage of multiple slang lexica. Gerund forms are fixed by checking the ending of words for an omitted g such as in goin. The remove elongations process is equivalent to the unification of emoticons and tries to eliminate repeating characters such as in suuuper. After the Preprocessor, a POS Tagger predicts the part-of-speech label for each token and forwards them to the Feature Vector Generation. The feature extraction process is a key component in SentiStorm. It generates a feature vector for each tweet based on the previously gathered data. The Feature Vector Generation component uses TF-IDF, POS tags and multiple sentiment lexica to map a tweet text into numerical features. Based on this feature vector the SVM component is finally able to predict the sentiment of the given tweet.

The figure also illustrates the corresponding parallelism hint of each component. The parallelism value depends on the number of workers or nodes n. For example, the parallelism value of the POS Tagger component is 50 for a 10-node cluster, which means that each node executes 5 threads. These parallelism values fully utilize the 32 cores of a c3.8xlarge instance, because the LIBSVM library uses multiple threads too.

Tokenizer

The Tokenizer is the first Bolt in the SentiStorm topology and splits a tweet text into several tokens. In this process, the Tokenizer uses pattern matching with regular expressions. Furthermore, it replaces Unicode or HTML symbols before tokenizing the tweet text.

Tokenizer workflow

Preprocessor

The Preprocessor component receives the tokenized tweet from the Tokenizer and prepares the tokens for the POS Tagger. The following figure illustrates the workflow of the Preprocessor, which consists of multiple steps.

Preprocessor workflow

In the first step, the Preprocessor unifies all emoticons. For example, the emoticon :-))) will become :-) to get a consistent set of emoticons. SentiStorm does currently not differentiate between these two emoticons, both of them have the same positive sentiment score based on the SentiStrength emoticons lexicon. Future extensions of SentiStorm might differentiate between these emoticons by using boost sentiment scores. In the second step, the Preprocessor tries to substitute slang expressions. The replacement of slang expressions will help the POS Tagger to determine the right POS tag. The next step fixes possible punctuations between characters. For example, the term L.O.V.E is replaced by the term LOVE. The Preprocessor also fixes incomplete gerund forms such as goin by replacing it with going. For that purpose, it uses the WordNet dictionary to find a valid word. In the last step, elongations such as suuuper are removed. If an elongation has been removed by the Preprocessor, then it has to check the term for any slang expression again.

POS Tagger

The POS Tagger component determines the part-of-speech (POS) labels for the preprocessed tokens. Currently there are two major POS taggers available, which are highly specialized for the Twitter-specific language. The first POS tagger was presented by Derczynski et al. [1] of the General Architecture for Text Engineering (GATE) group at the University of Sheffield. Owoputi et al. [2] of the ARK research group at the Carnegie Mellon University proposed the second major POS tagger.

The first implementation of SentiStorm used the GATE POS tagger because of the commonly used PTB tagset support. But the major drawback in speed of the GATE POS tagger made a transition to the ARK tagger necessary. The GATE tagger is significantly slower than the ARK tagger and therefore it is not applicable in a real-time environment such as Storm.

Feature Vector Generation

The feature extraction process is a key component of SentiStorm. It is responsible for the predicting quality of the follow-up Support Vector Machine component. The Feature Vector Generation component extracts numerical features out of the preprocessed and tagged tweets. For that purpose, it uses a rich feature set, which consists of Term Frequency-Inverse Document Frequency (TF-IDF), POS tags and sentiment lexica.

The following table presents the different sentiment lexica, which are used by SentiStorm. It also includes the number of terms and the range of the sentiment scores. Each sentiment lexicon consists of a set of tokens, which are assigned by a sentiment score.

Sentiment Lexicon	# of Terms	Scores
AFINN-111	2477 words	[-5, 5]
SentiStrength Emotions	2,544 regex	[-5, 5]
SentiStrength Emoticons	107 emoticons	[-1, 1]
SentiWords	147,292 words	[-0.935, 0.88257]
Sentiment140	62,468 unigrams	[-4.999, 5]
Bing Liu	6,785 words	[positive, negative]
MPQA Subjectivity	6,886 words	[positive, negative]

Support Vector Machine (SVM)

The last component of the SentiStorm topology is the Support Vector Machine. SVM is used to classify the sentiment of a tweet based on its feature vector. It is a supervised learning model and requires a set of training data and associated labels. The training data consist of feature vectors, which are usually defined by numerical values. The SVM tries to find hyperplanes that separate these training vectors based on their associated labels. Then all future feature vectors can be classified. SentiStorm uses the LIBSVM library of Chang et al. [3], which is a well-known SVM implementation in the machine learning area.

Quality of SentiStorm

The quality evaluation compares the sentiment prediction quality of SentiStorm with state-of-art sentiment classification systems based on the SemEval 2013 dataset. The F_p/n-measure of SentiStorm is 66.85%, which would achieve the second place in the top five SemEval message polarity results of 2013. The following table shows the top five SemEval Message Polarity [4] results of 2013.

Team	F_p/n
NRC-Canada	0.6902
GU-MLT-LT	0.6527
teragram	0.6486
BOUNCE	0.6353
KLUE	0.6306

The feature ablation of the following table illustrates how much impact different features have on the overall prediction quality. Each row presents F-measures, which are obtained by subtracting one feature from all features. The most important features are the class weights and TF-IDF, which improve the F-measure by 0.0354 and 0.0287. The sentiment lexica of Bing Liu and MPQA have only a minimal impact in the prediction quality.

Features	SemEval 2013 Test
	F_pos	F_neg	F_ntr	F_all	F_p/n	Acc	D_{F_all}	D_{F_p/n}
All Features	.7080	.6290	.7251	.7012	.6685	.7021
- Class Weights	.7021	.5642	.7302	.7023	.6331	.6974	+.0011	-.0354
- TF-IDF	.6380	.7354	.6689	.6634	.6398	.6666	-.0378	-.0287
- POS Tags	.7049	.6014	.7148	.6903	.6531	.6916	-.0109	-.0154
- AFINN	.6952	.6138	.7082	.6857	.6545	.6869	-.0155	-.0140
- SentiStrength	.7070	.6218	.7247	.6993	.6644	.7002	-.0019	-.0041
- SentiStrength :-)	.6938	.6138	.7180	.6905	.6538	.6910	-.0107	-.0147
- SentiWords	.7003	.6094	.7246	.6951	.6549	.6958	-.0061	-.0136
- Sentiment140	.6972	.6051	.7222	.6918	.6511	.6926	-.0094	-.0174
- Bing Liu	.7031	.6261	.7242	.6989	.6646	.6994	-.0023	-.0039
- MPQA	.7075	.6159	.7279	.7002	.6617	.7010	-.0010	-.0068

Performance of SentiStorm

The performance evaluation analyzes the speed of SentiStorm. The speed is mostly measured in tuples per second, which in this case are tweets per second. The performance evaluations are based on Amazon c3.8xlarge EC2 instances. The Storm multi-node cluster consists of a single worker per node and goes up to 10 nodes.

The following table illustrates the latency of each SentiStorm component and the complete latency of the topology. The Preprocessor has the lowest latency of about 0.108 ms. The POS Tagger component has the highest latency. It needs about 1.53 ms to process one tweet, which is more than 10 times slower than the Preprocessor. SVM is slightly faster with a latency of 1.025 ms. The table also shows only a minimal increase in latency for multiple nodes. The complete latency of the topology is about 53.5 ms, which means that it takes 53.5 ms to process a tweet throughout the complete topology. The topology of SentiStorm was optimized for high throughput, accepting a higher latency.

Nodes	Tokenizer Latency (ms)	Preprocessor Latency (ms)	POS Tagger Latency (ms)	Feature Generation Latency (ms)	SVM Latency (ms)	Complete Latency (ms)
1	0.179	0.108	1.492	0.185	0.953	48.155
2	0.182	0.108	1.514	0.183	0.987	51.048
3	0.189	0.112	1.531	0.183	1.034	52.607
4	0.187	0.109	1.543	0.180	1.023	52.311
5	0.188	0.110	1.536	0.183	1.023	52.657
6	0.184	0.108	1.532	0.179	1.025	53.332
7	0.182	0.109	1.544	0.178	1.022	53.575
8	0.187	0.110	1.549	0.178	1.025	53.359
9	0.180	0.107	1.521	0.177	1.016	54.055
10	0.182	0.107	1.528	0.176	1.031	53.889

The following table presents the throughput of SentiStorm. The throughput is measured in tweets per second at the end of the topology. The average number of tweets per second decreases only minimal from 1044 tweets per second at one node to 929 tweets per second at 10 nodes. This means that a single-node Storm cluster is able to execute 3133 tweets per second, which is only 20% less than the stand-alone performance. Based on Storm the SentiStorm topology scales almost linear and achieves 27,876 tweets per second at 10 nodes. These are 1,672,560 tweets per minute, 100,353,600 tweets per hour and 2,408,486,400 tweets per day. SentiStorm is able to predict the sentiment of each tweet of the global Twitter stream in real-time.

Nodes	Tweets per Second
1	3133
2	5920
3	8599
4	11528
5	14295
6	17025
7	19735
8	22576
9	25207
10	27876

Throughput of SentiStorm based on the SemEval 2013 dataset and c3.8xlarge EC2 nodes

Requirements

You have to download wn3.1.dict.tar.gz into resources/dictionaries/wordnet.
wget -P resources/dictionaries/wordnet/ http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz
Increase supervisor.childopts by updating the conf/storm.yaml and restart supervisors.
echo supervisor.childopts: \"-Xmx4g\" >> conf/storm.yaml

Build and Run

You will need Java 7 and Apache Ant to build SentiStorm.

You can simply build with:

ant jar

You can run SentiStorm with:

ant run

You can use the Twitter live stream with the Streaming API credentials:

ant run -DconsumerKey=XXXX -DconsumerSecret=XXXX -DaccessToken=XXXX -DaccessTokenSecret=XXXX

References

[1] https://gate.ac.uk/sale/ranlp2013/twitter_pos/twitter_pos.pdf

[2] http://www.ark.cs.cmu.edu/TweetNLP/owoputi+etal.naacl13.pdf

[3] http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf

[4] http://www.cs.york.ac.uk/semeval-2013/accepted/101_Paper.pdf