DSC 180B Project

Introduction

With the exponential growth of microblogging social media platforms such as Twitter and Facebook in the last two decades, the issue of spam has followed the same trajectory. Spam is unsolicited and unwanted messages used to deceive and negatively impact the experience of other users. Since spam in social media is prevalent in our society today, we believe that it is necessary to investigate how public sentiment will change after accounting for spam because people are using social media as a way to gain information and form their opinions. It is important to see if a larger effort should be made in the fight against spam because it is dangerous if spam is misinforming and skewing large amounts of opinions. We believe that this issue is extremely prevalent in the topic of abortions because it is a controversial topic and contains lots of spam with the recent growth in public attention that it has received with the politicization of it. To perform this experiment, we will be using two different models for spam detection: a simpler Naive Bayes classifier and a more advanced transfer learning model based on Google’s BERT which will be complemented by NLTK VADER for sentiment analysis.

Streaming Pipeline

We created a durable messaging system using Pulsar and Astra streaming to obtain real time tweets on abortion from Twitter to simulate the same experience that a user would get if they were on Twitter looking up abortion. In general, social media users are looking for the latest about what is going on, that is why it is necessary to use a method like a streaming pipeline to get the same feed a user would get. The benefit of using a real time messaging system is that the insights that we gain are not outdated like we would have gotten if we went with a batch processing model. The framework for the pipeline has four main components: data retrieval, spam detection, sentiment analysis, and visualization. We use Pulsar’s publish-subscribe pattern to effectively process the raw results from the Twitter API. As we can see in the figure below, the streaming pipeline provides a seamless experience of retrieving and transforming the tweets to be stored in the database.

Baseline Spam Detection: Naive Bayes Model

For our baseline model, we used a Multinomial Naive Bayes classifier using unigrams. We chose to use a multinomial version of Naive Bayes because it looks at the occurrence count of a word rather than if a word appears or not like we would see in a Bernoulli Naive Bayes. We used Naive Bayes as our baseline model because it is more straightforward to understand as it follows Bayes theorem and is a probabilistic approach. It is also extremely fast for training and prediction as well as only has a few parameters to tune. Before we are able to train the model, we have to preprocess the tweets to only keep the base of the tweet and remove features that are related to the use of social media such as hashtags and image links. We don’t want features of a tweet that are unique and unlikely to be seen in other tweets because it follows Bayes theorem which looks at probabilities of an event based on the occurrence of another event To optimize the model, we use K-fold cross-validation with 5 folds to tune the smoothing parameter alpha. We find that alpha=.5 is the optimal alpha parameter for our model and use the model to predict on the test set. The results of predicting on the test set is show in the confusion matrix below.

Advanced Spam Detection: BERT Model

For our more advanced spam classification model, we’ve implemented a transfer learning model using Google’s BERT. The model architecture is mainly comprised of using BERT’s embeddings and finetuning it with two fully connected layers. The resulting spam classification/probability is then outputted using a sigmoidal layer with one node. Within the linear layers, dropout, batch normalization, and ReLU activations are utilized. Dropout is introduced to prevent over-dependence of certain neurons and to improve robustness. Batch normalization is used to normalize the activations within the layer to mitigate internal covariant shifts to improve the model’s stability. ReLU is employed to learn non-linear patterns within the data.

Sentiment Classifier: NLTK VADER

We use NLTK VADER as our model because it was designed and trained with microblog-like data, just like Twitter. It combines both a lexicon and rule-based value to come up with a compound score. It uses a lexicon of 7,500 sentiment features that includes many features of microblog types of words and phrases such as slang and emoticons. It also uses grammatical elements such as punctuation, capitalization, and degree modifiers to gain contextual awareness. To calculate the compound score, it looks for the sentimental features and modifies the intensity and polarity based on the rules. It then sums the ratings of features and normalizes the sum to get a compound score between -1 and 1. Below, we have the threshold that we used to translate the compound score to the 3 classes of positive, neutral, and negative.

Positive: Compound Score ≥ .05
Neutral: -0.05 < Compound Score < 0.05
Negative: Compound Score ≤ -0.05

While NLTK VADER is good for our type of data, there is also a drawback that comes with it. Since the type of post on Twitter tend to be more informal, users tend to pay less attention to the quality of the text they are posting, so there tend to be more misspellings and grammatical errors. Since NLTK VADER is a lexicon and rule-based model, it may overlook important words and usage of text if there are misspellings and grammatical errors within the text which would cause the actual sentiment of the text to be misinterpreted.

Results

After doing spam classification on the live-streamed tweets, we found that negative sentiment was the most prevalent, constituting over 40% of all sentiments, followed by positive sentiment, while neutral sentiment was the least common. It also suggests a similarity on the sentiment distribution, regardless of whether spam is present or not. The maximum difference is 6%.

Besides categorical sentiment, we scrutinized the exact sentiment compound score (SCS) distribution as well. The mean SCS for NB filtered tweets, BERT filtered tweets, and all tweets (non-filtered) are approximately -0.065, -0.042, and -0.071, respectively. We saw that spam-filtered tweets exhibits lower SCS than unfiltered tweets. This aligns with the SCS of manually filtered tweets, which average around -0.033. Furthermore, the discrepancy in SCS between BERT filtered tweets and all tweets (0.029) is notably larger than that between NB filtered tweets and all tweets (0.006). This marginal difference for the NB spam classifier implies that a user employing an NB-based spam filter on Twitter would likely perceive no noticeable change in sentiment within the discussions compared to not using any filter.

Discussion & Conclusion

Our analysis of streamed tweets on abortion-related topics reveals that a significant proportion of tweets contain spam content, as determined by both Naive Bayes and transfer learning models. Despite the prevalence of spam, we found that the sentiment distributions of all tweets and manually/model filtered quality tweets were quite similar, with the majority expressing negative sentiment with a maximum sentiment difference of 6%. Furthermore, examination of the sentiment compound score suggested that the presence of spam did not have a significant impact on the overall sentiment towards abortion-related topics. However, upon closer inspection, we observed a slightly more negative sentiment associated with tweets containing spam, as indicated by the mean sentiment compound score. This finding highlights the potential for spam to subtly influence public sentiment and underscores the importance of identifying and filtering out such content in future analyses.

How Spam Affects Sentiment of Tweets