Aliasing Big Data - Use Cases
Introduction
As anyone conducting research can attest, common variations in spelling, nicknames, misspellings, transliteration of foreign languages, and true aliases make compiling a comprehensive set of data about a person, organization, or product fairly difficult. Various natural language processing tools can identify various names, but disambiguating accurately can be a challenge. In some cases, they over-correct saying two things are the same when they are not; in other cases, they fail completely to match two entities. Between these two extremes, there’s a sweet spot. To reach this sweet spot, we’ve created and deployed a widget and entity recommendation system to allow analysts to implement aliases they know will meet their project needs.
Watch the webinar on Aliasing Big Data:
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Identifying the Need for Aliasing
As we noted, the need to alias entities can occur due to a lot of reasons. An example where many of those factors were in play is the former Libyan dictator Moammar Gaddafi. The transliteration of his Arabic name is largely open to interpretation and therefore several dozen variations of his name evolved over the course of his dictatorship: from Moammar Qudhafi to Moamar El Kadhafy.
Though he is no longer alive, his digital footprint across global media still exists and many outstanding issues following the collapse of his regime require research into these documents. Effectively searching for information of value requires reconciling the dozens of variations, misspellings, and monikers he went by to track his activity across the African continent and elsewhere.
Our aliasing process starts with a cursory search for just one of Gaddafi’s names against several of our of generic news communities consisting of hundreds of RSS feeds of international news and blogs. As a user begins to type a search term into the query bar, our entity suggest feature offers recommendations of entities that may match the partial query from the communities being queried against. As I enter one of the less common spellings of his first name I am offered a long list of entities to choose from: about a dozen person entities as well as some organization and position entities. This is my first indication of what will become a larger issue.
Searching against any individual entity will only return documents containing that variation. An exact or free text search would yield slightly wider, but mostly similar results. Selecting one of these entities at random, ‘muammar qadhafi’ for example, my query yields only 20 results. An event graph of these results displays only 12 events and very few insightful associations (seen below). Testing a second variation of Gaddafi’s name yields only 1 result.
Realizing the scope our broad news communities is not producing the results I’m looking for, I created a specific community focusing on Libya, Northern Africa, and Middle Eastern news sources. Running a query against this new community using one of the same entities from above now yields 144 results, however the entity suggest list of alternate Gaddafi spellings has grown and I have reason to believe there are many more.
Gaddafi happens to be a particularly obvious case, however the need to alias entities will not necessarily always stand out as clearly by just looking at the entity suggest list or low result counts. Luckily, we offer several other ways an analyst or user can identify the need for aliasing.s.
Our Query Metrics widget breaks down all of the entities and associations contained within the results returned by a query and lists them by type. In this case, I search under entities, select person as the entity type and rank by entity spelling, then scroll down to the M’s where I find groupings of similar Gaddafi spellings as well as their document counts. Again, as an analyst interested in this individual, I would not want to waste my time querying against every one of these entities, or building long query strings not knowing whether or not I’ve accounted for all variations.
Another option is the entity significance widget, which presents a bar graph visualization of entities within the query set in increments of 10 and ranked by significance or frequency. In this view I find four entities for Qaddafi in the top ten alone, and many more as I continue scrolling.
So now that I’ve identified probable aliases for my target of interest, I can deploy the entity alias creator widget. The widget provides a similar layout to query metrics and allows you to convert an extracted entity into a master entity. This master entity can then be reconciled with all associated entities as determined by an analyst, whether alternate spellings or true aliases. On the left side of the widget the entities returned by the query results are listed out and can be filtered by type or name. On the right side is a master entity box on top and the alias box below it, where users are able to drag and drop entities accordingly.
Aliasing
To begin the aliasing process, I first filter the entities on the left by selecting person entity type from the drop down menu and scroll through the entity list to search for all of the entities created for Gaddafi. A total of 32 different person entities were found with document counts ranging from 1 to 1,700. ‘Mommar Qaddafi’ was the most commonly occurring spelling and therefore selected as the master entity by dragging and dropping in the master entity list. I then drag all other matching aliases to the alias list. Once I am done, I select the ‘Save Aliases’ button at the bottom right. A message appears along the bottom left of the widget indicating the results will propagate in approximately 1 minute.
Once the aliasing is complete, future queries against the master entity will search against each and every alias entity and therefore return all associated documents. If a user enters any one of the alias spellings into the query bar, the master entity will appear at the top of the entity suggest list. The user still has the option to select and search against any individual alias, however only the documents associated with that entity will be . Running a query against the Moammar Qaddafi master entity now yields over 10,000 results, a drastic increase from my initial query which produced only 20 results.
Now that I am confident I have a complete data set surrounding my subject, I can continue my research and use some of our other widgets to piece together the 360 degree view around my subject. Taking a look at the results using the Even Graph widget, which previously yielded only 12 results for one Gaddafi entity, I now have over 1,022 associations and facts to investigate. Within minutes these graphs can be refined into finished products for reporting purposes or for simply generating leads for further analysis.
Alternate Use Cases
The need for aliasing applies to just about any Big Data problem out there. It goes beyond trying to reconcile multiple names assigned to a single person. Organizations are composite entities made up of many different components. For instance, if a financial analyst was building a company profile about a major corporation, such as JP Morgan Chase Inc., they should expect several versions of the company names to appear in news articles and blogs. Additionally, they may want to look for the company stock ticker, high profile executives names, and the names of subsidiaries or sister companies. Querying against each of these items at the same time will capture the whole picture, but aliasing allows you to quickly and easily capture the entire picture by creating an abstract master entity that represents each of the pieces in aggregate.
Aliases can also serve a purpose in market research. A single product line can often have many variations and consumers may refer to the line by its full name or by a single product or feature within the line. For instance, if Ford wanted to keep their finger on the pulse surrounding the Ford Fusion line as a whole, they’d need to account for each and every product model that will be singled out during the extraction process as an individual product.
Closing
We hope you’ll find this new aliasing feature as useful as we have, as we come up with new and interesting ways to use it just about every day! The aliasing capability is available to all Ikanow users: by manually uploading a simple JSON document for Community Edition users, or using the demonstrated alias widget for Enterprise Edition users. Please feel free to send us any questions or comments regarding aliasing, or anything for that matter, as we are always looking for ways to improve our software.
Download or view the slides from our Aliasing webinar:
No comments