Analyzing Twitter Data - Twitter sentiment analysis using Spark streaming
Analyzing Twitter Data - Twitter sentiment analysis using Spark streaming.
Twitter Spark Streaming - We will be analyzing Twitter data and we will be doing Twitter sentiment analysis using Spark streaming. You can do this in any programming language Python, Scala, Java or R.
Main menu: Spark Scala Tutorial
Spark streaming is very useful in analyzing real time data from IoT technologies which could be your smart watch, Google Home, Amazon Alexa, Fitbit, GPS, home security system, smart cameras or any other device which communicates with internet. Social accounts like Facebook, Twitter, Instagram etc generate enormous amount of data every minute.
Below trend shows interest over time for three of these smart technologies over past 5 years.
In this example we are going to stream Twitter API tweets in real time with OAuth authentication and filter the hashtags which are most famous among them.
Prerequisite
Authentication file setup
Create a text file twitter.txt with Twitter OAuth details and place it anywhere on your local directory system (remember the path).
File content should look like this.
Basically it has two fields separated with single space - first field contains OAuth headers name and second column contains api keys.
Make sure there is no extra space anywhere in your file else you will get authentication errors.
There is a new line character at the end (i.e. hit enter after 4th line). You can see empty 5th line in below screenshot.
Write the code!
Now create a Scala project in Eclipse IDE (see how to create Scala project), refer the following code that prints out live tweets as they stream using Spark Streaming. I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding.
However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this.
// Our package
package com.dataneb.spark
// Twitter libraries used to run spark streaming
import twitter4j._
import twitter4j.auth.Authorization
import twitter4j.auth.OAuthAuthorization
import twitter4j.conf.ConfigurationBuilder
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
import scala.io.Source
import org.apache.spark.internal.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream._
import org.apache.spark.streaming.receiver.Receiver
/** Listens to a stream of Tweets and keeps track of the most popular
* hashtags over a 5 minute window.
*/
object PopularHashtags {
/** Makes sure only ERROR messages get logged to avoid log spam. */
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
/** Configures Twitter service credentials using twiter.txt in the main workspace directory. Use the path where you saved the authentication file */
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("/Volumes/twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
// Main function where the action happens
def main(args: Array[String]) {
// Configure Twitter credentials using twitter.txt
setupTwitter()
// Set up a Spark streaming context named "PopularHashtags" that runs locally using
// all CPU cores and one-second batches of data
val ssc = new StreamingContext("local[*]", "PopularHashtags", Seconds(1))
// Get rid of log spam (should be called after the context is set up)
setupLogging()
// Create a DStream from Twitter using our streaming context
val tweets = TwitterUtils.createStream(ssc, None)
// Now extract the text of each status update into DStreams using map()
val statuses = tweets.map(status => status.getText())
// Blow out each word into a new DStream
val tweetwords = statuses.flatMap(tweetText => tweetText.split(" "))
// Now eliminate anything that's not a hashtag
val hashtags = tweetwords.filter(word => word.startsWith("#"))
// Map each hashtag to a key/value pair of (hashtag, 1) so we can count them by adding up the values
val hashtagKeyValues = hashtags.map(hashtag => (hashtag, 1))
// Now count them up over a 5 minute window sliding every one second
val hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow( (x,y) => x + y, (x,y) => x - y, Seconds(300), Seconds(1))
// Sort the results by the count values
val sortedResults = hashtagCounts.transform(rdd => rdd.sortBy(x => x._2, false))
// Print the top 10
sortedResults.print
// Set a checkpoint directory, and kick it all off
// I can watch this all day!
ssc.checkpoint("/Volumes/Macintosh HD/Users/Rajput/Documents/checkpoint/")
ssc.start()
ssc.awaitTermination()
}
}
Run it! See the result!
I actually went to my Twitter account to see top result tweet #MTVHottest and it was trending, see the snapshot below.
You might face error if
You are not using proper version of Scala and Spark Twitter libraries. I have listed them below for your reference. You can download these libraries from these two links Link1 & Link2.
Scala version 2.11.11
Spark version 2.3.2
twitter4j-core-4.0.6.jar
twitter4j-stream-4.0.6.jar
spark-streaming-twitter_2.11-2.1.1.jar
Thank you folks. I hope you are enjoying these blogs, if you have any doubt please mention in comments section below.
Navigation menu
1. Apache Spark and Scala Installation
2. Getting Familiar with Scala IDE
3. Spark data structure basics
4. Spark Shell
5. Reading data files in Spark
6. Writing data files in Spark
7. Spark streaming
Comments