Sunday, July 3, 2011

Semantic gap between text and semantic markup

No matter how advanced our Semantic Web technology becomes, we still have an inherent problem, namely, the semantic gap between simple, plain text and our semantic markup. How do we correlate a textual representations and semantically marked-up representations?
At the most basic level, we need to be able to correlate semantic entities with textual references to them. Sometimes that can be a simple text lookup, but often there are multiple semantic entities that have similar if not identical textual representations, especially when the textual representations are frequently shorthand notations rather full, detailed entity references.
Lookups are complicated by the fact that some entities have names that are raw natural language prose so that they cannot be unambiguously distinguished from simple prose. For example, names of bands, songs, plays, books, movies, parks, etc. As an even more complex example, a movie based on a book may have the same name.
Even for references to people, people use nick names and some people have the same name. For examples, "Krupansky, J." may be a reference to me in the bibliography of a technical paper, or it may be a reference in a legal document to one of two court judges. This particular example suggests that context can aid in the identification process, but with the two judges even context can be problematic. A human can tell the two judges apart since one was at the state level and the other at the federal level, but both were in Ohio. They in fact were brother and sister, but with no apparent relation to me. How a computer would differentiate those two or even all three of us without significant guidance or hand-coded "intelligence" is an open question.
One simple identification issue is the use of articles in entity names. Technically, the Beatles are really "The Beatles" and "The" is quite significant when referring to "The Office." A lot of traditional text processing algorithms like to ignore punctuation, articles, and so-called "stop words", but increasingly these ephemera are becoming more significant. Yahoo is really "Yahoo!". And then there is the musician formerly known as "The Artist Formerly Known as Prince" with a non-textual symbol as his formal entity name. The point is that casual and even somewhat formal textual references to entities can be quite far from the pure, true, formal, literal entity identifier.
References to the works of an entity or to characteristics of an entity can be similarly problematic in raw text representations. Ultimately there may be a single, hard URI for the referenced entity, but getting from raw text to URI can be a real challenge.
In some cases, even our best computational efforts may still result in ambiguous references. Then we have a really tough choice, either to pick the "best" reference by some measure or heuristic, or to simply represent a list of possible references. The latter works semi-well for display for a human user, such as the results from a search engine, but is somewhat problematic when a computer program is processing the results and expecting a singular result.
The good news is that in many cases just a little context can go a long way. If someone is querying about computers and software, I would have a higher probability of being a match than the judges. If someone is querying about legal cases, then Krupansky the judge(s) could be selected, although even in that case we still have an ambiguity.
Correlating bands and songs is at least superficially a slam dunk since the mapping between bands and songs tends to be relatively sparse, but there are no guarantees and the state of the art for automated software is that some form of guarantee is needed.
Misspelling of entity names is also a problem. If you know the category of the entity, such as that it is a band or a song, then traditional spell-checking algorithms may be sufficient, but if you are just looking at a fragment of raw text with no context or category, the problem becomes much harder. A mis-phrased song or book title can look a lot like a lot of raw prose. Still, traditional phrase matching algorithms may do reasonably well telling you if a fragment of text happens to match up with one or more entity names, but you could also get a lot of false positives when the user is simply making a casual statement rather than intentionally referring to a named entity. Still, alerting the user to the possible entity reference can have at least some value even if it may not be 100% relevant. The harder problem is if there are a very large number of partial matches; then the user could well  be overwhelmed rather than aided.
A simple solution is faceting where the user is told not the list of all possible matches, but the categories of matches. This can dramatically reduce the amount of information to be presented to the user. The user can then drill down for more detail. Still, even this approach may result in information overload.
Another tool is a user-generated dictionary that fills in the particular user's preference for a partial or ambiguous entity reference the first time it needs to be resolved. Not that any user would necessarily need to manually create such a dictionary. In fact a collection of such resolution dictionaries may be automatically supplied with just a little context about the user and their tasks. Once source is to find other users of similar characteristics and then offer the dictionaries for that other user as a starting point. Maybe the user could supply a list of people they "think like" or are interesting in following and that can be used to seed the user's resolution dictionary collection.
In summary, matching textual entity references embedded in raw text is an open problem. Yes, there are a lot of tools readily available that may address the problem, more work in this area may be quite helpful. And, most importantly, bridging the semantic gap between the worlds of text and semantic entities is an important goal.


At October 13, 2011 at 7:33 AM , Blogger nano said...

I love all your posts and blogs.

At March 27, 2015 at 11:30 PM , Blogger Sharmayne said...

I enjoyed reading your blog and learned a lot from it..
Informative URL for Alaska Fishing Trips packages

At October 24, 2015 at 10:49 PM , Blogger Chelseayhun said...

the semantic gap between the two is very different.

Veritable Discount Laminate Flooring

At October 28, 2015 at 2:20 AM , Blogger Catrina Rodgers said...

Very useful information. Thanks for bringing this up!

Veritable Cloud Hosted Phone Systems

At December 28, 2015 at 12:26 AM , Blogger Kyle said...

Its really confusing and mind bugling too! But still i got the point. haha!

Heard about Digium Switchvox Mobile

At January 6, 2016 at 7:52 PM , Blogger Zaritah Pierre said...

they have very big difference

Heard about Seattle Assisted Living website

At January 9, 2016 at 4:44 PM , Blogger HannahWinslett said...

yeah i agree they have both lots of difference

Heard about Top Austin Towing

At January 30, 2016 at 2:49 AM , Blogger Zachariah Teh said...

that was really fascinating difference

You won't believe this Westport Fishing Charter Captain Don Davenport

At February 1, 2016 at 1:19 AM , Blogger sai mangune said...

Text and semantic entities is an important goal. - yes indeed
You won't believe this Allied Locksmiths

At March 2, 2016 at 2:58 AM , Blogger Hermione Mora said...

that was so fascinating i will share it

Cherry Hill Campus Security


Post a Comment

Subscribe to Post Comments [Atom]

<< Home