F1 BahreinGP tweet sentiment analysis

utassydv
6 min readDec 5, 2020

This article is created as an individual home assignment for the Central Europen University — MS in Business Analytics program by David Utassy. The purpose of this writing is to show the power of available tools such as Amazon Comprehend, R, and the ‘rtweet’ package. The whole experiment is built around the most horrific crash in Formula 1 that I have ever witnessed in the past 20 years.

Romain Grosjean (Picture: Formula1/Twitter)
Romain Grosjean (Photo: Formula1/Twitter)

Introduction

Raceday is an especially active day on Twitter if we are speaking about the F1 community. Most of the tweets regarding F1 are published on these Sundays. Therefore I have decided to use this home assignment to experiment on these tweets. I am sorry for F1 fans, this article should be somewhat technical, but Romain Grosjean brought fun into this article for sure.

This project is based on Gerold Csendes’s article and used mostly the same tools as he already mentioned. I recommend his article if you are interested in football tweet analyses, or just want to read another fascinating story.

Before going into details, here is the link to my code on GitHub in which I have completed this analysis.

Getting data from Twitter

As a first step, we need data to analyze. As I already mentioned I have made the decision to get tweets for this purpose. To make this happen we have to use the API provided by Twitter. First of all, we need to create a normal Twitter account, and following this link, we are able to submit our claim for a Twitter developer application. At this point, we need to have the sketch of our project in mind, as we have to fill in a form which is quite detailed. For me, the whole process took about 30 minutes, however, I needed to wait for approval for about a day after answering some additional questions I got in an email.

If we are set with Twitter, we can use the rtweet R package to interact with our application in order to scrape the desired tweets. We have to create a token with our secret keys provided by Twitter in the following way:

After having this token, we are able to use a rtweet to get data from Twitter in the following way. In the code below, I used a type, language, and date filter, to get all kinds of English tweets about BahreinGP from the desired race day.

The keyword I used is “BahreinGP” which is the official hashtag of the GP. Additionally, you can see some code cleaning in it as well the get clean data for sentiment analysis.

Setting up AWS Amazon Comprehend

For sentiment analysis, I used Amazon Comprehend which is a service provided on Amazon Web Services (AWS). Its main purpose is to make sentiment analysis on raw text.

To be able to use Amazon Comprehend, at first we need to create as AWS account. Once we have an account, from Amazon’s Identity And Access Management option (IAM), we need to generate an access key.

Under Users, we should find our user, and then by selecting the Security credentials tab a Create access key button should appear. Using this option we can generate the key, and save it into the project folder of our R codes. (WARNING: You should never publish this key anywhere).

To use Amazon Comprehend with R, we should use the “aws.comprehend” package. In the following code snippet, you can observe how to install and set this package with the previously taken access keys.

Sentiment analysis

Once we have Amazon Comprehend, we are able to use it to get the sentiment of each and every tweet the we scraped with the Twitter API before. To do this I used the following code:

With the output data frame of this code, I received about 1600 tweets with their sentiment (Positive, Negative, Neutral, Mixed). I hope you are still reading this, hence now comes a fun part.

In the first code snippet of the article, I am using rtweet’s built-in text cleaner function, to get the plain text out from tweets. (We need this as URL for example do not have sentiment, therefore we need to get rid of them). I also tried to clean data by myself, but from the following charts, you can see that they give a somewhat similar result on average, therefore I used the built-in cleaner for the following analytics. (The manually cleaned has less Neutral, which means that it might be better, but the difference is not that significant)

Race day sentiment distribution of manually cleaned tweets
Race day sentiment distribution of tweets cleaned with the built-in rtweet function

From both charts, we can see, that the majority of the tweets were positive, or neutral, but there is a great amount of negative portion as well. I think, that it is not the result of fanatics hating some drivers, but something else which is the core of this article.

I believe it worth it to watch the video above. For me, as an engineer, it is fascinating that a human being could survive this accident (55G, 20 seconds in the fire), which is the result of the tremendous amount of safety development. If I tell you, that this was an accident in the first lap of the race, anyone can tell from the following graph, that what was the time of the race start.

It can be easily seen, that there is a huge spike around the accident in the number of tweets. After the crash, the race was suspended for about an hour. Furthermore, we can see, that after that, the majority is positive (except there is another small red spike at 5 pm which was another accident in the race).

Last but not least I visualized the hourly sentiment distribution of tweets that contain “BahreinGP” on the given race day. From the following bar chart, I would highlight, that the portion of neutral tweets on the topic decreasing (everyone is starting to get an opinion?) and the portion of positive is increasing (their Hero, posting videos from the hospital that he is fine).

Summary

To summarize my article, Amazon Comprehend is a powerful, and easy to use tool to make sentiment analysis without any background knowledge in it. I should highlight that it is not for free, you can see the pricing on this link.

This project can be further developed by comparing this race day to other race days, however, it is not easy as the Twitter API allows only scraping from the previous 7 days. Additionally, it might worth a try to check smaller time intervals (5–30 minutes) and play with the effect of emojis on the sentiment analysis.

--

--