Setting new goals…

Oluwafunmilayo C. Sofuwa
6 min readApr 27, 2021

An NLP analysis on 2015 New Year’s Eve Resolution tweets downloaded from Maven Analytics Digital Playground

Image gotten from google photos search

Social media has become such an important platform and tool nowadays. Individuals and businesses leverage social media platforms such as Twitter and Instagram for various purposes. Businesses leverage social media in advertising their services and products, getting feedbacks or reviews regarding their services as well as a range of other things such as getting insights about peak periods of sales, customer segmentation and so on.

It has become sort of a norm for individuals to set goals, plans, resolutions regarding things they want to change, do better at or even achieve as a new year is about to unfold. This was my first time using NLP to analyze data. My dataset was downloaded from Maven Analytics digital playground. The data is based on new year’s eve resolution tweets made by individuals towards the year 2015.

DATA ANALYSIS PROCESS

DOWNLOADING THE DATA

The first step was to download my data from maven analytics digital playground. Maven analytics has really good datasets that can be used for visualizations. I chose to use the 2015 new year eve’s resolutions data. I wanted to practice analyzing text data using NLP and the volume of data is definitely what I needed to start off with.

DATA CLEANING

The next step was definitely to clean my data. I dropped columns which would not be useful to my analysis and visualization. I also had to deal replacing the null values in the retweet column with 0. The last column I needed to clean was the location column. To do this, I used regex and join.

DATA MANIPULATION

Next step was to manipulate the data to extract and create new features. I manipulated the date, tweet and location data.

From the date column, I extracted and created two new columns, hour and time of day. After extracting the hour from the date column, I used the information to group the hours into the different times of the day (midnight, early morning, morning, afternoon, evening and night).

In manipulating the tweets column, I wanted to extract the hashtags to see if there was any unique tag. Usually, tags such as #newyearnewme, #newyearresolution, for example, are commonly used and that reflected in the data.

To manipulate the location data, I employed geocoding. I used the Here developer geocoding api. All locations in the data were based in USA. The dataset had over 4000 rows of data and it took over an hour to run the geocoding. I chose to geocode my location to extract longitude, latitude, state code and county name. Some of the results returned had more than one location. To tackle this, I examined some of those rows and noticed that the actual location was presented as the last which is what I ended up extracting. My data already had a state code. I compared the existing state code with the geocoded state code and extracted the geocoded values if both state codes were a match.

Location Geocoding

TWEETS PROCESSING

After the long process of manipulating data, I had to process the tweets. Now tweets contain emojis, punctuations, single alphabets, numbers, hashtags and sometimes just weird unicode characters. I had to clean all these from the tweets first. Once this was done, I applied lemmatization to the tweets to get the root of words used in each tweet in the context which it is being used. For example, words like ‘go’, ‘goes’, ‘going’,’ went’ have the lemma ‘go’. I also filtered to exclude words that have a count of 3 and below. The last step I did was to apply pos_tagging (Part of Speech tagging). I used this to get an idea of the specific words that shaped people’s resolution. I only extracted the words that appeared in the context of a noun. Such words gave me more context than other parts of speech. For example, a tweet like ‘I want to get pregnant next year’, in this context ‘get pregnant’ is a noun.

DATA EXPLORATION

After the major heavy lifting had been done, I explored the data. I did a couple of visualization in python but the major visualization was done in Tableau. I created a word cloud to visualize the words used in the context of a noun.

TABLEAU DASHBOARD VISUALIZATION

Finally, the last step!!!

To visualize my data in Tableau, I made used of excel which was where I stored my final output from my analysis. I also made use of SQL. I used SQL to query the most retweeted tweet per category. I tried using several tableau formulas for this but it didn’t quite come out right. This was probably because some categories had more than one tweet having the same maximum number of retweets and I only wanted categories that had one tweet having the most retweets.

SQL query

INSIGHTS AND RECOMMENDATIONS

The overall most retweeted tweet came from the finance category while the personal growth category had the highest number tweets created. Companies that focus on finance and personal growth (eg online courses) can leverage this information to get better information on what customer segment is tweeting about these resolutions. With this, targeted ads or services can be directed at that particular customer segment. It can also be an opportunity to turn potential customers to actual customers.

New York had the highest number of tweets created.

The time of day with the highest tweet activity was in the morning specifically around 9am. This is not unusual as most individuals are awake at this time. However, we cannot just conclude based on this assumption alone as other details come into play such as the nature of their job, age range and so on. This would help in making more informed decisions when it comes to marketing, advertising and knowing what customers to target.

LESSONS LEARNT

This was my first time analyzing text data with NLP and I definitely learnt some things

1. Try as much as possible to save your data after geocoding your location preferably in a json format. In this way, if for any reason you have a break in connection or need to restart your laptop, you can pick up from the saved data rather than the geocoding location all over again. I learnt this the very hard way.

2. When creating bins, ensure that all values that are being binned have the corresponding bin value. I didn’t realize until I was visualizing my data in Tableau that the hour 0 which was meant to be grouped as ‘morning’ was showing null instead. Thankfully I was able to rectify this in Tableau by grouping the null values with the morning category.

3. Employing other means of extracting data not just Tableau formulas. For example, using SQL, Excel or Google sheets.

I hope you enjoyed going through this post with me.

To view my notebook, click here

For a visual perspective, click here

Connect with me on LinkedIn

--

--