Sunday, March 6, 2011

Linked lists for consumer-generated content for the Semantic Web

RDF and other Semantic Web technologies are powerful tools for hard-core information professionals to publish data for the Semantic Web, but are hardly usable for mere mortals such as consumers and other average users who wish to make their own content available on the Semantic Web. I propose what I call linked lists as a possible approach to publishing consumer-generated content for the Semantic Web. I am not using the term in the sense of traditional computer science (the linked list data structure), but more as a derivative of Linked Data and the Linked Open Data (LOD) movement. I started by noting that people like to keep and reference lists: lists of things to do, lists of people, lists of places, lists of songs, lists of movies, lists of restaurants, and even lists of lists. Lists tend to have a simple structure, easily processed by computer programs, and much of the data on the lists can relatively easily be translated into RDF-style URIs, at least in theory, and assuming that a sufficient library of the underlying concepts is developed, which is of course the segue into the world of Linked Open Data.

It is not the purpose or intent of this post to go into technical details, but simply to raise awareness of the basic concept of using consumer-generated lists as a way to introduce average users into being not just consumers of the Semantic Web, but generators of Semantic Web content as well.

Some lists are simple single-column lists of named entities. Simple enough, but the names may be nick names, incomplete partial names, misspelled names, ambiguous names, etc. That raise the point about the importance of entity name resolution for "entry" into the world of the Semantic Web. I see this as a solvable problem, but it does illustrate just how yawning is the chasm between the world of real people and the Semantic Web itself. One opportunity here is that the multiple items on the list itself can provide a form of context that can help identify the category to be used for the list. Do the items look like names of places, names of things, names of people, movies, songs, bands, etc.? Once the category is identified, entity name resolution is substantially simpler. In some cases automated methods can complete 100% of the resolution, in some cases the user can be presented with a single likely match for confirmation, and in other cases a list of possible matches can be offered.

Multi-column lists would seem to be a harder problem, but the columns provide context. A name column may not be unique, but address or phone number may provide enough disambiguation. A song name may not be unique to a performer and spelled out properly, but adding a band or album name column might be plenty to disambiguate. The song name and performer names might both be incomplete or partially wrong, but combined they may actually be sufficient for disambiguation or to at least dramatically reduce the possible likely options.

Multiple columns may be unnecessary other than as memory aids and for disambiguation. After all, the LOD cloud should have all of the public data for an any entity. So, the user can simply maintain their own stripped-down representation for any entity and then let the SemWeb itself supply any additional desired information. As long as enough info is supplied to identify the entity (or even plural entities), there is no need for the user to keep more detailed info in their own list. So, maybe the user can conceptually think of their lists as having two sides or parts: 1) their own raw list in their own preferred format (e.g., simple text file or spreadsheet), and 2) their preferred representation of the actual referenced LOD entities. Note that the SemWeb representation might be in a non-list format such as a graphical map or other structured format or even a full spreadsheet or database layout, if that is what the user has chosen. Of course, the user could choose any number of formats.

There will likely be some interest in templates for multi-column lists, but I don't see them as a requirement since the rows of the list provide disambiguating context. In fact, generally, the category of most lists will be quite obvious to even relatively simple automated analyzers, presuming there are enough rows. This does highlight the importance of being able to identify the category of SemWeb entities.

The user could of course author and maintain their lists in their favorite local editing tool such as a text editor or spreadsheet, but it is likely that keeping lists online would be preferable. Presumably sites would spring up which specialize in maintaining and publishing SemWeb lists. Of course there would be privacy controls so that private lists remain completely private or only shared as the user decides, but it should be dirt-simple easy to quickly publish a user-generated list. And once a user-generated list gets published to the Semantic Web, presto, it is now a candidate for getting linked into the LOD cloud.

Linking of user lists can occur in two ways: 1) A simple, direct link, such as a user-generated "list of favorite lists", or 2) creating a derivative list based on one or more existing published lists. Besides creating their own list from scratch or by wholesale copying of an existing published list the user could reference an existing list and tell the software that the user wants to "start with" the existing list and then supplement it, adding some items and deleting others. The user might even request that multiple lists be combined. Or maybe include only some columns of data. A common usage would be for a user to identify a trendsetter (maybe just a friend) and supplement that list with their own personal interests. The key is to maintain is a dynamic reference to the base list and the user's full, published list will change as any base lists change.

The user's lists would be as the user creates and maintains them and completely devoid of formal URIs or other arcane SemWeb concepts. The published version would of course be in hard-core RDF, but with the clear-text source as well. The user would also have the option of automatically "cleaning up" their list to correct spelling errors, complete names, etc.

Linked lists provide an opportunity for dramatically increasing the scope of the Sematic Web and also provide an opportunity to escape from the current paradigm of web sites such as Facebook and LinkedIn being walled gardens holding user data captive.

The issue of exactly where online user lists would be published and store is open, but the simple answer is: anywhere. In some sense user lists would be similar to blogs in that a user might have their own domain or chose a hosting site that caters to their personal skills and interests. The real point is that it truly does not matter where linked lists reside once they are identified or registered as being part of the Semantic Web. That raises the question of how to register new lists, but I am sure there will be plenty of sites and users ready and willing to fill that void.

-- Jack Krupansky

The semantic gap between bits and knowledge

We have a wide-range spectrum of levels of abstraction for representing information in computers, none of which is particularly well adapted to representing human knowledge in a form that is readily comprehended by computer programs. At the low end of the spectrum we have bits, bytes, characters, text, databases, XML, and even RDF for the Semantic Web. We have specialized abstractions for specialized applications as well. Somewhere in the middle of the spectrum we have various so-called knowledge representation languages which purport to being able to represent knowledge, but only in a host of well-defined, limited, constrained, forms that are still not representative of true human knowledge and are not directly recognizable and usable by mere mortals. Sad to say it, but text for natural language is the closest form we have in computers to something that is recognizable and usable by mere mortals. Unfortunately, free-form text is not readily and easily recognizable and usable to computer programs (as a surrogate for human knowledge.) So, we have a vast semantic gap between the bits of computers and the knowledge of humans.

I wish I had some graphic ability so that I could draw a fancy diagram of this spectrum of information and knowledge representation, but I don't, so I'll the spectrum as a simple list, starting at the low end:

  1. Bits - zero and one, on and off.
  2. Bytes
  3. Characters and numbers
  4. Strings - sequences of characters representing individual words or identifiers
  5. Text - free-form sequences of strings or words, possibly even natural language prose
  6. Structured text - tabular lists (e.g., CSV)
  7. Databases
  8. Application-specific data formats
  9. XML
  10. RDF
  11. Big Gap #1
  12. Knowledge representation languages
  13. Big Gap #2
  14. Human knowledge and human language

RDF is a knowledge representation language of sorts, but is more specialized and adapted to representing raw information than more humanly-recognizable knowledge.

It is worth noting that there is a distinction between knowledge and communication, but that is beyond the scope of the main issue of the point about bits vs. human knowledge. One distinction is the concept of tacit knowledge which is knowledge that defies straightforward communication or representation in language.

This information/knowledge spectrum layout immediately begs the question of the sub-spectrum of knowledge representation languages, a topic worthy of attention, but that is beyond the scope of the immediate issue.

One of the most notable causes of the vast sematic gap between current knowledge representation languages and human knowledge is the issue of vocabulary definition. Computer-based systems strive to minimize and eliminate ambiguity while human knowledge and language embrace and thrive on ambiguity. Coping with ambiguity may be the ultimate abyss to be hurdled before computers can have ready access to human knowledge.

One major problem with knowledge representation languages is that there are a lot of them, a virtual Tower of Babel of them, so that we do not have a common knowledge language that can be leveraged across all forms of knowledge and all application domains. Leverage is a very powerful tool for solving problems with computers, but lack of leverage is one of the most serious obstacles to solving problems with computers. Leverage can rapidly accelerate the adoption of new technology, but lack of leverage seriously retards adoption. XML and RDF were big leaps forward in leverage, but still nowhere near enough.

One open question is whether a rich-enough knowledge representation language can be built using RDF as its lower level, or whether something richer and more flexible than RDF is needed. This may hinge on what stance you take on ambiguity.

-- Jack Krupansky