Since the textbook was written Twitter has changed its approach to mining of data. Beginning in 2013 Twitter made it more difficult to access the API. Now OAuth authentication is needed for almost everything. This means you need to go on Twitter and create an ‘app.’ You won’t actually use the app for anything — you just need the password and authentication code. You can create your app here. For more detailed instructions on creating the app take a look at this presentation.
Register Your App
In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.
The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application. You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration page of your app, you can also require an access token and an access token secret. Similarly, to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.
I have already done these steps for you. The source code is here in Figure 1:
Figure 1 – Twitter Source Code
You can also download the source code HERE: mis5208-twitter.zip
If you don’t want to run the python script here is test data for you to use with the lab.
The supplied script downloads tweets that reference the four top candidates in the US presidential elections. Please import this data into Splunk and perform the following analysis. You should continuously monitor this script and let it run for 5 to 10 minutes. This will create a significant amount of data. Please be sure you have about 100 Mb of free space on your machine.
- Load the results from the script into Splunk. The default settings should be sufficient.
- Count the number of delete events within the file we collected:
- * | stats count(delete.status.id)
- Count and visualize with a line graph the number of tweets by hour:
- host = “*” | stats count by date_hour
- Count and visualize with a horizontal bar graph the number of tweets by day of the week:
- host = “*” | stats count by date_wday
- Obtain a list of the time zone names by UTC offset. Also count the results.
- host = “*” | stats count(user.time_zone) by user.utc_offset
- Count the number of tweets posted in reply to comments made by TV personality Sean Hannity.
- host=”*” in_reply_to_screen_name=seanhannity
- For our next search, we want to find the most popular words in a tweet. We will use a couple of new commands to achieve this. The main idea is to break down the text field into a multivalued field, where each value is an individual word. We consider a word to be anything enclosed by two spaces.
- Count the number of the top words in the first 5000 tweets
- host = “*” | head 5000 | makemv text | mvexpand text | top text
- Do the same for the first 50000 tweets
- NOTE: You may exceed the 500Mb memory constraint of the Splunk community edition – if you have gone past the 30-day trial of Splunk enterprise
- Combine one or more of these queries to see how each of the candidates is doing:
- Count Tweets for each of the public figures: Clinton, Cruz, Sanders, Trump
- SUBMIT YOU RESULTS.
Zadrozny, Peter; Kodali, Raghu (2013-05-21). Big Data Analytics Using Splunk: Deriving Operational Intelligence from Social Media, Machine Data, Existing Data Warehouses, and Other Real-Time Streaming Sources (Expert’s Voice in Big Data) (p. 215). Apress. Kindle Edition.