Sunday, July 3, 2011

Semantic gap between text and semantic markup

No matter how advanced our Semantic Web technology becomes, we still have an inherent problem, namely, the semantic gap between simple, plain text and our semantic markup. How do we correlate a textual representations and semantically marked-up representations?
 
At the most basic level, we need to be able to correlate semantic entities with textual references to them. Sometimes that can be a simple text lookup, but often there are multiple semantic entities that have similar if not identical textual representations, especially when the textual representations are frequently shorthand notations rather full, detailed entity references.
 
Lookups are complicated by the fact that some entities have names that are raw natural language prose so that they cannot be unambiguously distinguished from simple prose. For example, names of bands, songs, plays, books, movies, parks, etc. As an even more complex example, a movie based on a book may have the same name.
 
Even for references to people, people use nick names and some people have the same name. For examples, "Krupansky, J." may be a reference to me in the bibliography of a technical paper, or it may be a reference in a legal document to one of two court judges. This particular example suggests that context can aid in the identification process, but with the two judges even context can be problematic. A human can tell the two judges apart since one was at the state level and the other at the federal level, but both were in Ohio. They in fact were brother and sister, but with no apparent relation to me. How a computer would differentiate those two or even all three of us without significant guidance or hand-coded "intelligence" is an open question.
 
One simple identification issue is the use of articles in entity names. Technically, the Beatles are really "The Beatles" and "The" is quite significant when referring to "The Office." A lot of traditional text processing algorithms like to ignore punctuation, articles, and so-called "stop words", but increasingly these ephemera are becoming more significant. Yahoo is really "Yahoo!". And then there is the musician formerly known as "The Artist Formerly Known as Prince" with a non-textual symbol as his formal entity name. The point is that casual and even somewhat formal textual references to entities can be quite far from the pure, true, formal, literal entity identifier.
 
References to the works of an entity or to characteristics of an entity can be similarly problematic in raw text representations. Ultimately there may be a single, hard URI for the referenced entity, but getting from raw text to URI can be a real challenge.
 
In some cases, even our best computational efforts may still result in ambiguous references. Then we have a really tough choice, either to pick the "best" reference by some measure or heuristic, or to simply represent a list of possible references. The latter works semi-well for display for a human user, such as the results from a search engine, but is somewhat problematic when a computer program is processing the results and expecting a singular result.
 
The good news is that in many cases just a little context can go a long way. If someone is querying about computers and software, I would have a higher probability of being a match than the judges. If someone is querying about legal cases, then Krupansky the judge(s) could be selected, although even in that case we still have an ambiguity.
 
Correlating bands and songs is at least superficially a slam dunk since the mapping between bands and songs tends to be relatively sparse, but there are no guarantees and the state of the art for automated software is that some form of guarantee is needed.
 
Misspelling of entity names is also a problem. If you know the category of the entity, such as that it is a band or a song, then traditional spell-checking algorithms may be sufficient, but if you are just looking at a fragment of raw text with no context or category, the problem becomes much harder. A mis-phrased song or book title can look a lot like a lot of raw prose. Still, traditional phrase matching algorithms may do reasonably well telling you if a fragment of text happens to match up with one or more entity names, but you could also get a lot of false positives when the user is simply making a casual statement rather than intentionally referring to a named entity. Still, alerting the user to the possible entity reference can have at least some value even if it may not be 100% relevant. The harder problem is if there are a very large number of partial matches; then the user could well  be overwhelmed rather than aided.
 
A simple solution is faceting where the user is told not the list of all possible matches, but the categories of matches. This can dramatically reduce the amount of information to be presented to the user. The user can then drill down for more detail. Still, even this approach may result in information overload.
 
Another tool is a user-generated dictionary that fills in the particular user's preference for a partial or ambiguous entity reference the first time it needs to be resolved. Not that any user would necessarily need to manually create such a dictionary. In fact a collection of such resolution dictionaries may be automatically supplied with just a little context about the user and their tasks. Once source is to find other users of similar characteristics and then offer the dictionaries for that other user as a starting point. Maybe the user could supply a list of people they "think like" or are interesting in following and that can be used to seed the user's resolution dictionary collection.
 
In summary, matching textual entity references embedded in raw text is an open problem. Yes, there are a lot of tools readily available that may address the problem, more work in this area may be quite helpful. And, most importantly, bridging the semantic gap between the worlds of text and semantic entities is an important goal.
 

Friday, July 1, 2011

What color is an apple?

I have been trying to think about how to encode even relatively simple human knowledge in simple RDF triples and what issues arise. What could be simpler than... an apple and its color? Sure, some things are simpler, but so much is much more complicated.
 
In a simple, toy system I might define a class of objects called "fruit", a sub-class called "apple", and have instances of the apple class. Simple enough. I might have a "color" property. Simple enough. Hmmm... but what are the values of color? A literal string like "red"? A numeric set of RGB values? Shades of primary colors? Add another object of class "color", and push out the same questions to the propertiesof that class? So, in my simple, toy system, I would now have objects of class "apple" each of which has an associated "color" object. Although, I am not so comfortable saying that basic properties must be promoted to the level of objects even if having all values be objects may be a better system architecture. This is starting to be a lot of complexity for simple things.
 
One question: Does each object have its own color? Sure, that makes sense? But, shouldn't I be able to ask questions about objects of that class in general? Sure. Now, there are two ways to go about that: 1) do a statistical analysis of all instances of the class and then summarize the results, possible as a histogram or something like that, or 2) contrive an abstract rule that generally describes what the population of the class would be, even if that is only an approximation. I might simply want to know "generally", or "typically", or "as a common case" what color are apples. Some are yellow, some green, but many are red, so how can I represent that information as a "rule" rather than have to do a massive data collection effort and sift through the results?
 
But even for apples that are "red" or "green" or whatever, rarely would they be exactly and perfectly one single color. They may be "reddish" or "greenish" or "mostly red" or "mostly green."
 
In fact, rather than an individual apple having a specific color, you can come up with a wide range of colors by sampling various points on the object's surface, each of which can have its own color. Once again, we can do a massive amount of data collection and then look at distributions of values and highlight dominant values or average values. And that is just for one instance of the class and we would like to do similar analysis across the population of the class instances. And this is just for a relatively simple object. Once again, we would like to come up with relatively simple rules to describe the class and its instances.
 
The general problem here is not simply how to encode information about objects and classes, but in what forms do users want to examine that information? What is the user's context? Do they want a simple, general answer? Do they want a simple, general, and specific answer that may not be technically accurate for individual objects (e.g., "Most apples are red."), but nonetheless be generally useful? Maybe they want that statistical summary. Maybe they want that vaguer but more accurate simple answer, "reddish." Maybe they want that full, sampled image with its discrete values for their own analysis. Or maybe they have a reference color and they simply want to know if it is an "acceptable" or even "normal" (not "rotten") value for an apple.
 
Maybe there are a range of generic "query formats" for fetching object and class properties that can be shared across all objects and classes or at least over broad classes of objects. So, one of these generic formats could be chosen by the user and the actual property value(s) would be transformed to match the requested format.
 
And maybe they just want to start somewhere with a basic answer and then examine it and drill down for more detail or more abstraction on their own as they see fit.
 
Maybe the user can specify a "degree of specificity" in their request and that would guide what specific form the returned property value would take.
 
And now we go on to other classes of objects and their "color" and we wonder to what extent we can compare or use colors across those disparate classes. Is a particular car the same color as objects of the class apple?
 
I am probably only scratching the surface, but these are some of the issues with trying to represent human-level knowledge in a technology world where representing even only basic information is still a real challenge.
 
Hmmm... I wonder if developing functionally complete ontologies for apples and colors may be even more of a challenge than even a dozen doctorate dissertations? Toy ontologies, no problem; human-level knowledge, that's a harder nut to crack.