Integrating Infinit.e with Pentaho - whys and hows

Jan 08, 2013

Alex Piggott

1 Comment

Introduction

Pentaho is a popular open source Business Intelligence (BI) tool. It doesn’t have a huge overlap with Infinit.e because we focus primarily on unstructured text analysis and “NoSQL” data models (precisely because of the existence of mature BI/relational tools like Pentaho!)

There are, however, a number of ways in which Infinit.e can add value to traditional BI analysis (e.g. via integration with Pentaho), and also ways in which Pentaho can be used as a utility for typical Infinit.e activities. As we’ll see, Infinit.e’s logical REST API makes integration with tools like Pentaho easy:

Generating business intelligence from text

Most importantly, unstructured data can yield really valuable structured data that can then be combined with traditional metrics (with care, because of the loss of dimensionality!) to provide additional insight. Our worked example below will focus on a semi-fictional example of this, aggregating “happiness” by region, originally described in this Meetup presentation.

Infinit.e provides several techniques to enable this:

1) Our query engine returns various simple statistics against documents, “entities” (people, places, companies, technologies, etc), and associations. The fact that the statistics are calculated across the query set (i.e. matching documents), combined with the power of the query engine (particularly Lucene-powered full text searches) makes it possible to control the query set with more accuracy.

2) For more complex and/or custom analytics, Infinit.e provides a plugin interface for custom analytic modules using Hadoop/MapReduce. Again, the query set on which the analytics run can be specified. Since writing Java is slightly hard work for simpler calculations, we provide a JavaScript scripting engine. The worked example below demonstrates how a few simple lines of script can generate structured insight from semi-structured sources.

3) Finally, the REST API can be used to export JSON or XML that can be imported into other relational or big data tools (including Pentaho, as touched on below).

After a very brief description of other ways in which Infinit.e and Pentaho can work together, with Pentaho acting as a utility, the remainder of this post comprises a worked example of generating a structured table from Twitter and importing it into Pentaho in a way which would make it easy to combine with traditional BI data.

Using Pentaho as an Infinit.e utility

For Infinit.e users with Pentaho installations (or who don’t mind installing it: Pentaho community edition and postgresql -or MySQL is a good alternative), there are a few ways in which Pentaho’s mature and community-supported capabilities can be utilized.

1) Leaving aside the integration of traditional BI data and Infinit.e-derived insight, Pentaho can be used as an easy way of visualizing the results of custom plugins.

Currently, to do this inside of Infinit.e’s GUI requires modifying the code of one of the existing widgets, recompiling, etc; which is not difficult, but an undesirable level of effort for prototyping and experimentation. (Coming soon - we’ll be providing versions of the widgets that ingest custom tables, unless an existing Open Source alternative proves sufficient in the meantime.)

An alternative is to import the data into Pentaho, use Mondrian to turn the fields into dimensions and measures, and then use the available visualizations. All of this can be done graphically. Although the built-in views aren’t very exciting, there is a community dashboard project, including a port of Protoviz which should satisfy most needs. A previous blog post described how to view Infinit.e data directly in Protoviz, now D3.js, and a future blog post will show Infinit.e data rendered in the community dashboard.

Note that the only difference between this and the worked example below is the intent, and the techniques described there all apply.

2) Infinit.e has a pretty powerful ETL capability, covering files (JSON, XML, “office“), databases, HTTP/RSS, and providing the ability to create metadata using regex, xpath, and JavaScript.

However, as you would expect from a popular and mature product, Pentaho’s Kettle ETL tool provides a wide range of functions that a newer and smaller platform like Infinit.e cannot yet match, and is significantly more user-friendly (this is a weakness of the Infinit.e ETL, we’re working on it!)

So in some cases (i.e. importing from sources Infinit.e doesn’t support but Pentaho does; fusing multiple sources into a single set of records) it makes sense to use Kettle to do the heavy lifting and then output to XML/JSON files or SQL for Infinit.e to ingest (i.e. formats where metadata is automatically generated from structured fields and can be converted into entities and associations with the Infinit.e-side configuration).

This function is a pretty standard Kettle configuration, only using the output to JSON or XML as a (possible) alternative to writing to a DB table.

3) Finally, the same Kettle ETL application described above can be used to export data from the Infinit.e REST API (as described in the example below), taking advantage of Pentaho both for its wide range of export connectors and its graphical transform functions.

Worked example

The main thrust of this section is the integration with Pentaho, but we’ll first briefly describe how we use the Infinit.e custom engine to generate some structured data, starting from some Twitter data, e.g. as provided from either Gnip or DataSift(Note: we will soon be releasing a version of our Amazon Marketplace PaaS that integrates directly with DataSift - contact us for more details).

Our document-oriented data model is described here. Relevant to the scenario: AlchemyAPI (or Lexalytics, for DataSift) extracts an approximate sentiment for each tweet, decomposed per keyword; in addition geo information tagged to the tweet is written into the “docGeo” field.

(The most usual case is that a general location is given rather than the precise coordinates - the built-in geo locator can be used where there’s no lat/long, though Pentaho also has some geo location anyway).

Here’s a few lines of code that aggregates sentiment by geo grid:

function map(key, val) {

var label_lat = Math.round(val.docGeo.lat/5)*5;

var label_lon = Math.round(val.docGeo.lon/5)*5;

var label = label_lat.toString()+ ',' + label_lon.toString();

for (ent_i in val.entities) {

var ent = val.entities[ent_i];

if ((null != ent.sentiment) && (ent.type == "Keyword")) {

emit({label: label, label_lat: label_lat , label_lon: label_lon }, {sentiment: ent.sentiment, count: 1});

}

function reduce(key, vals) {

var retval = { sentiment: 0.0, count: 0 };

for (x in vals) {

retval.sentiment += parseFloat(vals[x].sentiment);

retval.count += parseInt(vals[x].count);

}

emit(key,retval);

}

combine = reduce;

There are a number of plausible scenarios in which this could be useful: for example comparing tweets about a set of topics related to a product to regional sales trends.

(Warning: there are many issues with naively approaching using Twitter for opinion mining, particularly in commercial contexts such as brand management, as described in this presentation. We are currently developing an “Insight-as-a-Service” that addresses some of these issues - contact us for more details.).

The above code is loaded into Infinit.e using the plugin manager (see screenshot below), together with a MongoDB query that selects tweets with associated geotags (for actual business cases this query could be more targeted).

The resulting table can then be accessed as JSON by name (Note I use JSONView as a handy utility for rendering JSON in Chrome/Firefox):

ROOT_URL/api/custom/mapreduce/getresults/twitterSentiment_geo?infinite_api_key=API_KEY

{

response: {

action: "Custom Map Reduce Job Results",

success: true,

message: "Map reduce job completed at: Mon Dec 24 14:51:02 EST 2012",

time: 42

},

data: {

lastCompletionTime: "Dec 24, 2025 2:51:02 PM",

results: [

//...

{

_id: "50d8b226e4b08323c79d7169",

sentiment: "0.119314",

count: "5",

key: {

label_lon: "-10",

label: "55,-10",

label_lat: "55"

}

},

{

_id: "50d8b226e4b08323c79d716a",

sentiment: "-1.1041417",

count: "30",

key: {

label_lon: "-100",

label: "20,-100",

label_lat: "20"

}

},

//...

]

}

OK, on to the Pentaho integration! Acknowledgement: this blog post made working through the integration steps much easier.

The easiest way of handling authentication is to specify an API key. This key is generated automatically for Mashery users, or AMS/PaaS/OSS users can be assigned from the manager UI (see screenshot below).

The following screen capture shows an instance of the Kettle ETL pipeline for Infinit.e integration (eg after launching “Spoon.bat” from the community Pentaho install).

The pipeline comprises the following steps (the orphaned “Output>Text File Output” step was just used for debugging the pipeline and plays no functional role):

1) Input>Generate row: for simplicity, I added a single row “url” containing the entire URL (“ROOT_URL/api/custom/mapreduce/getresults/twitterSentiment_geo?infinite_api_key=API_KEY”). As described in the aforementioned blog post, a more maintainable approach is to use multiple fields to split the URL into component parts (e.g. root URL, REST address, custom table name, and API key) and then to recombine them using a transform step.

2) Transform>REST client: a simple GET using the “url” field, write it into a field we’ll call “result” - note “result” is treated as a giant string containing the JSON, it is not deserialized until step 4.

3) Transform>Select values: remove the “url” field.

4) Input>JSON input: In this step we use JSONPath to map the JSON record output from the REST “get” call into SQL-like fields. The precise fields are obviously scenario-specific (aside from part of the URL in step 1, this is the only scenario-specific step), for this example we used the following:

(Note that I renamed the lat/long fields - for some reason the visualizer step doesn’t work if the Pentaho-side fields are named “label_lat” and “label_lon”).

5) Transform>Select values: remove the “result” field (this is important otherwise the giant JSON string is tagged to each of the hundreds of records in the table).

6) Output>Table output: Connect to the SQL database that backs the Pentaho implementation (e.g. in my case, a localhost postgresql on the default port of 5432), pick a table name, select “truncate table”, and then (the first time/any time the JSON mapping changes) press the “SQL” button at the bottom (which generates the SQL commands necessary to configure the database) and press the “execute” button.

Once the data had been imported, the model can be built, right-click the “write to table” step, then analyze>visualize.

The following dimensions and measured are specified in the “Analysis” tab:

Measures:
- Sentiment
- Count
Dimensions:
- Label: as shown above, select “Lat/Long” for the geography type, then assign “Latitude” and “Longitude” fields to the “latitude” and “longitude” attributes respectively.

Finally for the model, select the “Reporting” tab and drag (eg) the “id” field as the lone category (in this example we don’t use reporting so this is just necessary for the model to be complete).

Then save the model and launch the visualization by pressing the “Go” button (top left).

The above screenshot shows the completed visualization, getting to that point from the model is easy:

From the chart button (top right), select the map view.
Drag the “Label” dimension into the “Geography” layout.
Drag the “Sentiment” measure into the “Color By” layout.
Drag the “Count” measure into the “Size By” layout.

Voila! The above map is generated, with a grid of circles showing the aggregate “happiness” (as measured by AlchemyAPI’s sentiment extraction) of areas where people tweet. The greener the circle the “happier” they are, the larger the circle, the more tweets are generated.

Conclusions

This article provided a brief overview of why you might want to integrate Pentaho and Infinit.e, and how it can be easily achieved using Kettle and Infinit.e’s REST API.

A real example was shown requiring only a few lines of simple code to generate a meaningful aggregation using Infinit.e’s custom processing engine, and no lines of code at all to display the results.

The example provides a template which can be re-used for other analytics. Future blog posts will give more examples of useful analytics and how they can be visualized in open source platforms like Pentaho.

About the Author:

Alex Piggott is IKANOW’s Director of Product Development. Alex runs IKANOW’s research and development activities and leads the development team. He has 15 years of experience designing and developing complex real-time mission-critical distributed systems. In addition, Alex graduated from Oxford University with a first-class degree in Mathematics.

1 Comment. Leave new

Visualizing text analytics using jsfiddle.net (and friends!) - IKANOW

July 30, 2025 7:32 am

[…] the results of various simple analytics generated using Infinit.e plugins: geographic analyses in Pentaho and R, and more complex temporal analytics in R. In this post, we’ll show one of our favorite […]

Integrating Infinit.e with Pentaho - whys and hows

Introduction

Generating business intelligence from text

Using Pentaho as an Infinit.e utility

Worked example

Conclusions

1 Comment. Leave new

Leave a Reply Cancel reply

Subscribe

Recent Posts

Contact Us

Recent Posts

About Us