Mapping Sentiment with IKANOW and R
With the right methods, mapping sentiment can be a powerful activity for any organization. That’s why we want to show you how to visualize this data with the IKANOW open analytics platform and the popular open source language called R.
In a previous post, our director of product engineering Alex Piggott discussed how to use IKANOW’s REST API to support other business analytics and visualization tools. This post continues on that theme using the same example problem, aggregating “happiness” from Tweets by region, to show how you can use the open source statistical programming language R as way to visualize data enriched using the IKANOW API. The example was originally described in this Meetup presentation and has proven to be a fairly popular example of the kinds of things you can do with enriched unstructured data.
Mapping Sentiment by Region
There are several potential scenarios where mapping sentiment can be useful. A few of these scenarios are:
- Political candidates wanting to know how constituents feel about particular issues across the regions these candidates represent.
- Businesses wanting to track local/regional attitudes towards their products or towards the brand itself.
- Diplomats wanting want to gauge how foreign populations feel about an issue or country when traditional polling methods are not available.
Outside of traditional opinion polls, however, it is difficult to get the kind of sentiment data necessary to perform the analysis described above. Websites and social media provide plenty of raw material to process, but finding the right material in such a large data set and processing it properly is not always an easy task. If you’re an IKANOW enterprise customer or using the open source version, Infinit.e, you can perform all the necessary tasks to map sentiment entirely using the IKANOW GUI. The interface allows users to explore their data using a series of pre-built existing widgets. Altering these widgets to visualize your data in non-standard ways isn’t difficult, but if you do not have access to the GUI (e.g. you just use our API) or want to perform exploratory analysis on new data sets where you don’t yet know what visualizations will be useful, there are other options.
In particular, R has a fairly flexible and dynamic set of statistical methods and plotting tools that support exploratory analysis of data sets. Users can quickly use tools like R-Studio to install additional R packages to segment, fit, and plot data in a variety of ways. Additionally, as a very popular programming language for statisticians and data scientists, some users may simply be more comfortable using R. The IKANOW API lets these users combine the best of both worlds by export data from the IKANOW REST API and using R for its wide range of statistical methods and graphical plotting functions.
For this example, we used the IKANOW platform to ingest a random sample of tweets using a non-specific query direct from the Twitter API. We enriched this data to extract basic metadata (i.e. accounts, hashtags, retweets, etc) and keywords with sentiment using AlchemyAPI. These keywords with sentiment were aggregated on the basis of geographic region and turned into a JSON output, which we will access, visualize, and manipulate using R and several custom R packages.
Twitter Limitations
Only about 5-10% of tweets are directly tagged with precise geolocation either from the sending device or accurate profile information. Users can also provide text descriptions of their locations, but there is no way to confirm the veracity of these values and they often do not provide more than a country or city. We take these nonspecific locations and run them through our built-in geo locator to add additional geographic information where able, but even this improved set may still not be representative of a complete query match for various terms of keywords. Care should be taken during analysis to account for this discrepancy.
Additionally, automated measurement of sentiment is not a perfect science. Twitter’s short 140 character format does not offer much context for natural language processing tools to derive much context around various keywords and non-standard language (often used to game the character limit) further complicates the process. This means that sentiment measurements against individual tweets may have significant error. Taken in aggregate, these errors may balance out, but the measurements presented are rough estimates at best and should be treated as such in any decision making process that incorporates this kind of Twitter analysis.
Using IKANOW to enrich and aggregate data
In our previous post, Alex wrote a Hadoop map/reduce function to take a non-specific subset of Twitter data and aggregate the sentiment in each tweet across geographic location. We got this data from the Twitter API, but it can also come from another data provider such as Gnip or DataSift. There are several reasons to do this processing. Some of these reasons are described below:
- The query engine returns various simple statistics against documents, “entities” (people, places, companies, technologies, etc), and associations. The fact that the statistics are calculated across all documents matching a query, combined with the power of the query engine (particularly Lucene-powered full text searches) makes it possible to accurately pinpoint document sets.
- For more complex and/or custom analytics, IKANOW provides a plugin interface for custom analytic modules using Hadoop/MapReduce. Again, the query set on which the analytics run can be specified. Since writing Java is slightly hard work for simpler calculations, we also provide a JavaScript scripting engine.
- Finally, the REST API can be used to export JSON or XML that can be imported into other tools, including R, as explained in the next section.
The following code took the sentiment values for keywords we derived from our Twitter sample and aggregated them across a geo grid:
function map(key, val) {
var label_lat = Math.round(val.docGeo.lat/5)*5;
var label_lon = Math.round(val.docGeo.lon/5)*5;
var label = label_lat.toString()+ ':' + label_lon.toString();
for (ent_i in val.entities) {
var ent = val.entities[ent_i];
if ((null != ent.sentiment) && (ent.type == "Keyword")) {
emit({label: label, label_lat: label_lat , label_lon: label_lon }, {sentiment: ent.sentiment, count: 1});
}
}
}
function reduce(key, vals) {
var retval = { sentiment: 0.0, count: 0 };
for (x in vals) {
retval.sentiment += parseFloat(vals[x].sentiment);
retval.count += parseInt(vals[x].count);
}
emit(key,retval);
}
combine = reduce;
The above code was loaded into IKANOW using the plugin manager (see screenshot below), together with a MongoDB query that selects tweets with associated geotags. We left the MongoDB query generic for this example, this query could and likely would be targeted based on analytic needs in a production environment.
The resulting table could then be accessed as JSON by name (JSONView is a useful utility for rendering JSON in Chrome/Firefox):
Getting a Key for Non-API Users
If you’re not using the developer API (e.g. you are an Enterprise customer on your own IKANOW instance or a user hosting the Community Edition on Amazon EC2), you can set a manual API key directly using the Manager in the IKANOW graphical U/I. Go to Manager -> People -> Your Profile and then manually enter an API key in the field. Save this value and you should be good to go.
ROOT_URL/api/custom/mapreduce/getresults/twitterSentiment_geo?infinite_api_key=API_KEY
{
response: {
action: "Custom Map Reduce Job Results",
success: true,
message: "Map reduce job completed at: Mon Dec 24 14:51:02 EST 2012",
time: 42
},
data: {
lastCompletionTime: "Dec 24, 2025 2:51:02 PM",
results: [
//...
{
_id: "50d8b226e4b08323c79d7169",
sentiment: "0.119314",
count: "5",
key: {
label_lon: "-10",
label: "55:-10",
label_lat: "55"
}
},
{
_id: "50d8b226e4b08323c79d716a",
sentiment: "-1.1041417",
count: "30",
key: {
label_lon: "-100",
label: "20:-100",
label_lat: "20"
}
},
//...
]
}
}
Using R to visualize and manipulate aggregations
I prefer to use R-Studio when working with R code. It’s certainly not a requirement to use it, but I recommend it over the base package available from the CRAN website. Whatever method you choose, you’ll need to include a few user created packages to perform the tasks described below. The three R packages to work this example are.
- RJSON. This package includes most of the methods you will need to ingest JSON files:
- plyr. This package offers methods to split, process, and then recombine data - used in this example to parse IKANOW’s JSON format.
- googleVis. This example will use googleVis for displaying our data, but the “maps” package offers an alternative way to produce relatively simple plots of this data as well.
To access these packages, use the commands in the example below. If they have not yet been installed, be sure to run the command install.packages(“PACKAGENAME”) in the R-Studio command line prompt prior to running the code.
Load Libraries
#include required libraries
library("rjson")
library("plyr")
library("googleVis")
Once the appropriate libraries have been installed, the code below allows you to access the map/reduce output table via a user defined API. Be sure to fill in your root URL and API key in the example below.
Access the API
#assign location of JSON file, be sure to insert your root URL and API key
jsonfile <- "http://ROOTURL/api/custom/mapreduce/getresults/twitterSentiment_geo?infinite_api_key=APIKEY"
Next, the fromJSON() and ldply() functions are used to get and convert our custom map/reduce output table into a data frame.
Convert to Data Frame
#ingest JSON data from JSON file
json_data <- fromJSON(paste(readLines(jsonfile), collapse=""))
#convert 'results' section of the Ikanow JSON into a data frame format
df <- ldply(json_data$data$results,data.frame)
With the data in this format - many common functions within R can be used explore the data. Below, the head() and summary() methods direct to the R command line give us, respectively, a view of the first few entries in the data set and basic summary information. Both of these views give a pretty good overview of the data in its new tabular format.
Summary Information
> head(df)
X_id sentiment count key.label_lon key.label key.label_lat
1 512bcfa6e4b0dec51124d431 2.043988 115 -75 40:-75 40
2 512bcfa6e4b0dec51124d432 3.463216 14 -75 45:-75 45
3 512bcfa6e4b0dec51124d433 0.000000 1 -75 5:-75 5
4 512bcfa6e4b0dec51124d434 0.000000 6 -80 0:-80 0
5 512bcfa6e4b0dec51124d435 3.404981 26 -80 25:-80 25
6 512bcfa6e4b0dec51124d436 -0.205775 17 -80 30:-80 30
> summary(df)
X_id sentiment count key.label_lon key.label key.label_lat
512bcfa6e4b0dec51124d431: 1 Min. :-2.0154 Min. : 1.0 Min. :-160.000 40:-75 : 1 Min. :-45.0
512bcfa6e4b0dec51124d432: 1 1st Qu.: 0.0000 1st Qu.: 2.0 1st Qu.: -80.000 45:-75 : 1 1st Qu.: 5.0
512bcfa6e4b0dec51124d433: 1 Median : 0.0000 Median : 4.0 Median : 0.000 5:-75 : 1 Median : 30.0
512bcfa6e4b0dec51124d434: 1 Mean : 0.2078 Mean : 10.9 Mean : -3.648 0:-80 : 1 Mean : 23.6
512bcfa6e4b0dec51124d435: 1 3rd Qu.: 0.1955 3rd Qu.: 9.0 3rd Qu.: 60.000 25:-80 : 1 3rd Qu.: 45.0
512bcfa6e4b0dec51124d436: 1 Max. : 5.8862 Max. :241.0 Max. : 170.000 30:-80 : 1 Max. : 65.0
(Other) :190 (Other):190
Each entry in the JSON was converted to an individual observation (row) in the data frame with each sub-object flattened out into multiple columns. For instance, the nested object “key”: {…} in the JSON output becomes three columns: key.label_lon, key.label, and key.label_lat. The tabular format plugs in easily to other visualization packages. For this example, we are going to use Google’s googleVis package, and the gvisGeoChart() method in particular (which this blog post was helpful in implementing).
Load Libraries
#develop a Google Visualization Geo Chart object using the data frame
geo <- gvisGeoChart(df, locationvar = "key.label", colorvar="sentiment", sizevar="count",
options = list(backgroundColor="lightblue", height = 500, width = 800, region="world", dataMode="markers",
colorAxis="{values:[null, 0, null],colors:['red','yellow','green']}"))
#plot the geo chart
plot(geo)
Running the above code produces the output you see above. Each marker is anchored on the geo grid coordinates produced during the map/reduce phase. The colors represent aggregate sentiment values originating from that region with aggregate negative values showing red, neutral values showing yellow, and positive values trending green. Finally, each marker is sized based on the count variable for each region, identifying areas where more data was available to make an aggregate judgment. Mousing over an individual marker pops up the specific data point, and a magnify function allows for differentiation in tight areas on the map. This map can then be embedded in a web page or other application - even back into an IKANOW GUI widget if you so desire.
The complete R code used to ingest, process, and display our data is below:
Final R Code
#include required libraries
library("rjson")
library("plyr")
library("googleVis")
#assign location of JSON file, be sure to insert your root URL and API key
jsonfile <- "http://ROOTURL/api/custom/mapreduce/getresults/twitterSentiment_geo?infinite_api_key=APIKEY"
#ingest JSON data from JSON file
json_data <- fromJSON(paste(readLines(jsonfile), collapse=""))
#convert 'results' section of the Ikanow JSON into a data frame format
df <- ldply(json_data$data$results,data.frame)
#develop a Google Visualization Geo Chart using the data frame
geo <- gvisGeoChart(df, locationvar = "key.label", colorvar="sentiment", sizevar="count",
options = list(backgroundColor="lightblue", height = 500, width = 800, region="world", dataMode="markers",
colorAxis="{values:[null, 0, null],colors:['red','yellow','green']}"))
#plot the geo chart
plot(geo)
Conclusions
IKANOW’s unstructured text analysis and NoSQL data models make it relatively simple to connect with other tools like R to perform your own statistical analysis and data visualization. The worked example - mapping sentiment by geo - shows how you can turn Twitter data into potentially actionable insights for business or government.
While we used a random set of data, using targeted queries can easily let you perform the data mining operations necessary to get at, process, and visualize your data by exporting it via the REST API into R.
In future posts, we’ll show additional ways to take dynamic query results and custom map-reduce output tables as input to use against some of R’s other powerful features.
Learn More
Organizations of all sizes use IKANOW Enterprise Edition to synthesize structured and unstructured data for actionable intelligence.
About the Author: Andrew Strite
As a Solutions Architect, Andrew works directly with clients and IKANOW’s delivery team to create analytic solutions and manages the execution of the solutions from start to finish. Andrew has 6 years of experience in program management, strategic analysis, and requirements development. Before joining IKANOW, Andrew was a U.S. Air Force Intelligence Officer. Andrew holds a M.A. in Intelligence Studies from American Military University and a B.A. in History from the University of Delaware. Andrew describes his hobbies stating, “I’m an avid gamer; I especially love strategy games or those with open worlds. Lately, I’ve also been wedding planning with my fiancée.”
1 Comment. Leave new
[...] data processing techniques similar to some of our previous examples, we constructed sentiment indicators from the Enron emails. This technique let us use sentiment [...]