Friday, May 29, 2009

Is good enough the enemy of vaguely better?

There is no question that with Semantic Web technologies we could produce a "better" Knowledge Web. The open question is how much better it would be. If a real user were to query a Semantic Web database how much better would the query results really be? The answer is unknown because it would all depend on the nature of the query processing infrastructure, the forms of inference and "reasoning" that are implemented, and how the database is structured. All of those aspects are continuing to evolve and as of today none of that infrastructure is in place in a form to perform queries comparable to even a simple Google query. Sure, we believe the results would be better, but that's about the strongest statement we can make today.

Just this morning I was following a discussion on a Semantic Web email list and David Huynh made the statement:

It's a case of "good enough is the enemy of vaguely better", unfortunately.

So, yes, we know that queries to the Semantic Web will be better, but our degree of specificity in how much better is clearly vague. In the face of a radically different approach that is completely unproven, it is not uncommon that "good enough" wins by default.

Interestingly, Microsoft's new Bing "decision engine" might have to deal with this same issue. Even if their results for many queries are actually "better", the question is whether overall their results are so clearly better that Google's "good enough" will not carry the day. A big unknown, but that is the nature of introducing innovation.

I myself often user a similar metaphor for the greater success of Microsoft-based PCs compared to Apple Macs: Maybe the Mac is better, but if the PC is "good enough" what does it matter if the Mac is somewhat "better"?

The real goal of a true Knowledge Web is that intelligent agents can do a lot more of our tasks for us. The risk today is that even if we succeeded in building such a knowledge web, the actual and perceived benefits relative to the costs and radical shift in mindset required to use such a web and agents might result in a similarly vague relative benefit in comparison with existing "good enough" approaches.

-- Jack Krupansky

Wednesday, May 27, 2009

What is the difference between a URI and a URL?

Anybody who has browsed the Web knows that a URL is the web address of a web page on a web site. Meanwhile, the Semantic Web is based on the URI. So, what is a URI, and how are they different? The short answer is that all URLs are by definition URIs and in the context of the Semantic Web the preferred term is URI.

Part of the answer is historic: URL (Uniform Resource Locator) is the original term for a web address, the location of a web resource or web page on a web site, but technically we should be using the newer term URI (Uniform Resource Identifier.)

Going further back in history, at one stage URI meant Universal Resource Identifier, but that usage has been superceded by Uniform Resource Identifier.

There is a little bit more to it. While all URLs are in fact URIs, a subset of URIs are not URLs, in particular, the subset known as URNs or Uniform Resource Names. An example of a URN might the ISBN number for a book such as "URN:ISBN:0-062-51587-X".

So:

  • A URI is either a URL or a URN.
  • Every URL is a URI.
  • Every URN is a URI.
  • A URN is never a URL.
  • A URL is never a URN.

For HTML Web pages, it still makes since to refer to the URL of a web page, even though URI is now the technically more precise term, since an HTML Web page URI is in fact always a URL (and vice versa).

For RDF statements, the subject, predicate, and objects of an RDF triple are by definition referred to as URIs. They may at times in fact be URLs and refer to resources which are files on Web servers, but that is not required in all cases.

If you really want to get technical, there is a discussion in IETF RFC 3305 entitled "Report from the Joint W3C/IETF URI Planning Interest Group: Uniform Resource Identifiers (URIs), URLs, and Uniform Resource Names (URNs): Clarifications and Recommendations".

-- Jack Krupansky

Tuesday, May 26, 2009

Semantic Drift

Semantic drift refers to the change in the meaning of a term or concept over time to the members of a community.

Obviously, it would be advantageous if the meaning of a term or concept did not vary over time, but reality is a force to be reckoned with.

The semantics of a term or concept can change because:

  • Changes in the real world, including people, technology, and the physical world, require updating of the meanings of terms and concepts.
  • What was considered important may no longer be considered important.
  • What was considered unimportant may no longer be considered unimportant.
  • New members of the community may have different values and requirements and need or choose to de-emphasize some aspects of the existing meaning and emphasize or add new aspects.
  • Existing members of the community may drop out and their influence on the importance of various aspects of the meanings of terms and concepts may wane. Some terms may become more strict, others looser.
  • Emergence of new or significantly different domains may borrow or modify existing semantic meanings of terms.
  • Communities can split or splinter and the new sub-communities could diverge in their interests and emphasis on the essential meanings of terms and concepts.
  • Communities can merge or overlap so that disjoint collections of terms and concepts will need to be merged and conflicting meanings for the same syntactic terms need to be resolved.
  • Bugs or other deficiencies may be discovered and "fixed."

Technically, there are some categories of semantic mapping which are not technically semantic drift, but may still be informally considered as such:

  • Distinct communities may have distinct meanings for superficially identical terms or even concepts. Bridging between communities is needed.
  • Proprietary communities within a single industry or interest area may contrive meanings of their own invention for what appear to be superficially identical terms or even concepts. Standards are needed.
  • Personal and place names, especially in distinct geographic areas.

The whole point is that we need a semantic infrastructure which acknowledges and helps us cope with semantic drift and all other forms of semantic mapping.

-- Jack Krupansky

Monday, May 25, 2009

Tim Berners-Lee's dream for the Web

Just for future reference, I have reproduced here Tim Berners-Lee's brief statement of his dream for the Web, including the Semantic Web and intelligent agents, from his book Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. He starts out Chapter 12, Mind to Mind, by saying:

I have a dream for the Web... and it has two parts.

In the first part, the Web becomes a much more powerful means for collaboration between people. I have always imagined the information space as something to which everyone has immediate and intuitive access, and not just to browse, but to create. The initial WorldWideWeb program opened with an almost blank page, ready for the jottings of the user. Robert Cailliau and I had a great time with it, not because we were looking for a lot of stuff, but because we were writing and sharing our ideas. Furthermore, the dream of people-to-people communication through shared knowledge must be possible for groups of all sizes, interacting electronically with as much ease as they do now in person.

In the second part of the dream, collaborations extend to computers. Machines become capable of analyzing all the data on the Web -- the content, links, and transactions between people and computers. A "Semantic Web," which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy, and our daily lives will be handled by machines talking to machines, leaving humans to provide the inspiration and intuition. The intelligent "agents" people have touted for ages will finally materialize. This machine-understandable Web will come about through the implementation of a series of technical advances and social agreements that are now beginning (and which I describe in the next chapter).

Once the two-part dream is reached, the Web will be a place where the whim of a human being and the reasoning of a machine coexist in an ideal, powerful mixture.

Realizing the dreeam will require a lot of nitty-gritty work. The Web is far from "done." It is in only a jumbled state of construction, and no matter how grand the dream, it has to be engineered piece by piece, with many of the pieces far from glamorous.

In short, he envisioned a Semantic Web of machines talking to machines comprising a machine-understandable Web.

It is also important to recognize that the Semantic Web is part of the overall Web dream and is not intended to be completely separate from part one of the Web.

Even eight years later, the book is just as relevant to the Web of today, and the future.

-- Jack Krupansky

Conceptual distance

One of the big canyons in the Semantic Abyss is how to compare concepts and sense their similarity or differences as well as their relations to other concepts. Sometimes a user can be laser-precise as to what concept is desired, but even then the user may not be aware that other concepts may be quite similar or related in some way. Sometimes it is desirable to treat very similar concepts as virtually identical, while other times it may be desirable merely to offer the user alternatives that might meet the desired objective. In any case, the starting point is to quantify the conceptual distance between concepts. As might be expected, that is likely to be much easier said than done.

Much of the existing research relates to determining conceptual distance of document from query terms, also known as document relevance. Here, the objective is to compare the terms or concepts themselves to determine how close they are and which are closest.

It is not clear if any absolute conceptual distance can be determined. Usually, a relative conceptual distance for a set of concepts is all that is needed, or maybe all that is possible.

Some of the reasons for comparing conceptual distances are to determine:

  • similarity
  • related
  • equivalence
  • equality (say, in a social sense)
  • same as
  • comparable
  • synonym

It may be true that any given application or even a given user of an application may have different criteria for how close the conceptual distance must be to satisfy their needs. Control over the looseness or tightness of the fit is probably also desirable.

A big challenge of the Semantic Web is that different developers and communities have different conceptions of the meanings of concepts. Sometimes seemingly different terms are used to refer to what are logically similar or even identical concepts. This means that we need a sophisticated level of concept matching that can transparently handle the bridging of superficial semantic gaps, as well as to alert the user were semantic gaps exist that cannot be automatically bridged but maybe the user can manually accept them as if they were automatically bridged.

Another problem is that superficially identical concepts may in fact be quite distinct at a deeper semantic level so that the concept matching should reject them as matches. In the alternative, the user can be alerted to these false concept matches and maybe redefine a new set of concepts to effectively bridge the perceived semantic gaps so that matching is more semantically correct.

In any case, the ability of the software to give the user excellent feedback on conceptual distance is a very important tool.

-- Jack Krupansky

Thursday, May 14, 2009

Mereology - the study of the relations between integral objects and portions of stuff

I was reading a post by Steffen Staab on the Semantic Web email list and ran across a link to a paper on Mereology, which is basically the study of the relations between complete or integral objects and the component parts that comprise the whole object as well as the relations between the parts themselves.

The Wikipedia article on Mereology tells us:

In philosophy, mereology (from the Greek μερος meros part and the ending -logy study, discussion, science) is a collection of axiomatic first-order theories dealing with parts and their respective wholes. In contrast to set theory, which takes the set-member relationship as fundamental, the core notion of mereology is the part-whole relationship. Mereology is both an application of predicate logic and a branch of formal ontology.

The Stanford Encyclopedia of Philosophy article on Mereology tells us:

Mereology (from the Greek μερος, 'part') is the theory of parthood relations: of the relations of part to whole and the relations of part to part within a whole. Its roots can be traced back to the early days of philosophy, beginning with the Presocratics and continuing throughout the writings of Plato (especially the Parmenides and the Thaetetus), Aristotle (especially the Metaphysics, but also the Physics, the Topics, and De partibus animalium), and Boethius (especially De Divisione and In Ciceronis Topica). Mereology occupies a prominent role also in the writings of medieval ontologists and scholastic philosophers such as Garland the Computist, Peter Abelard, Thomas Aquinas, Raymond Lull, Walter Burley, and Albert of Saxony, as well as in Jungius's Logica Hamburgensis (1638), Leibniz's Dissertatio de arte combinatoria (1666) and Monadology (1714), and Kant's early writings (the Gedanken of 1747 and the Monadologia physica of 1756). As a formal theory of parthood relations, however, mereology made its way into our times mainly through the work of Franz Brentano and of his pupils, especially Husserl's third Logical Investigation (1901). The latter may rightly be considered the first attempt at a thorough formulation of a theory, though in a format that makes it difficult to disentangle the analysis of mereological concepts from that of other ontologically relevant notions (such as the relation of ontological dependence). It is not until Leśniewski's Foundations of a General Theory of Manifolds (1916, in Polish) that a pure theory of part-relations was given an exact formulation. And because Leśniewski's work was largely inaccessible to non-speakers of Polish, it is only with the publication of Leonard and Goodman's The Calculus of Individuals (1940) that mereology has become a chapter of central interest for modern ontologists and metaphysicians.

This is quite heavy-duty stuff, but does show the increasing trend for the intersection of computer science and philosophy especially as we get deeper into the Semantic Web.

The original link pointed to the abstract for a paper entitled A Temporal Mereology for Distinguishing between Integral Objects and Portions of Stuff by Thomas Bittner and Maureen Donnelly. It discusses three categories of "stuff":

  • Integral objects, such as a car or computer.
  • Structured stuff, such as blood or the tissue of an organ.
  • Unstructured stuff, such as air and water that is homogenous.

They give the example of the distinction between the liver as an integral object and liver tissue as the structured stuff the comprises the liver. The two are obviously related, but need to be treated distinctly depending on your intentions and purposes.

In the case of blood, we can refer to human blood in general, the blood of a particular human, a sample or portion of the blood of that particular human, and the "structured stuff" within that portion as it might be processed and separated into the components of red and white cells, platelets, and plasma.

This post is primarily intended as more of a bookmark for later reference, so my apologies for not giving a more concise or more detailed account of mereology.

-- Jack Krupansky

Sunday, May 10, 2009

Truth should not be hard-coded but somehow emergent

I ran across an interesting statement in a post on the W3C Semantic Web email list by Jeremy J. Carroll, Chief Product Architect of TopQuadrant in reply to a post from John Sowa, that relates to truth in the context of the Semantic Web:

... truth should not be hard-coded but somehow emergent.

I would add that there may be many competing truths on any given issue and that the user will have to choose between competing value systems that each arrived at their own versions of the truth of an issue.

Each user may have their own preferred authorities and sources from which they may choose to select the appropriate value system.

All of this may evolve over time. Authorities and sources can change their minds. Underlying data can change. Calculations can change. Rules can change. Theories can change. Experiments can be re-evaluated or examined in a new light. The world can change. Authorities and sources can come and go and their influence can wax and wane. User preferences can change.

So, you cannot capture truth at a moment and hold it forever. You need to re-execute your query to determine the truth of some assertion at the time you need it. Of course, even the Semantic Web cannot give you the real truth, but merely the modeled truth, as it emerges and as it continues to evolve, even over time, with time being a variable required for determining the truth of a proposition.

-- Jack Krupansky