Sunday, March 6, 2011

Linked lists for consumer-generated content for the Semantic Web

RDF and other Semantic Web technologies are powerful tools for hard-core information professionals to publish data for the Semantic Web, but are hardly usable for mere mortals such as consumers and other average users who wish to make their own content available on the Semantic Web. I propose what I call linked lists as a possible approach to publishing consumer-generated content for the Semantic Web. I am not using the term in the sense of traditional computer science (the linked list data structure), but more as a derivative of Linked Data and the Linked Open Data (LOD) movement. I started by noting that people like to keep and reference lists: lists of things to do, lists of people, lists of places, lists of songs, lists of movies, lists of restaurants, and even lists of lists. Lists tend to have a simple structure, easily processed by computer programs, and much of the data on the lists can relatively easily be translated into RDF-style URIs, at least in theory, and assuming that a sufficient library of the underlying concepts is developed, which is of course the segue into the world of Linked Open Data.

It is not the purpose or intent of this post to go into technical details, but simply to raise awareness of the basic concept of using consumer-generated lists as a way to introduce average users into being not just consumers of the Semantic Web, but generators of Semantic Web content as well.

Some lists are simple single-column lists of named entities. Simple enough, but the names may be nick names, incomplete partial names, misspelled names, ambiguous names, etc. That raise the point about the importance of entity name resolution for "entry" into the world of the Semantic Web. I see this as a solvable problem, but it does illustrate just how yawning is the chasm between the world of real people and the Semantic Web itself. One opportunity here is that the multiple items on the list itself can provide a form of context that can help identify the category to be used for the list. Do the items look like names of places, names of things, names of people, movies, songs, bands, etc.? Once the category is identified, entity name resolution is substantially simpler. In some cases automated methods can complete 100% of the resolution, in some cases the user can be presented with a single likely match for confirmation, and in other cases a list of possible matches can be offered.

Multi-column lists would seem to be a harder problem, but the columns provide context. A name column may not be unique, but address or phone number may provide enough disambiguation. A song name may not be unique to a performer and spelled out properly, but adding a band or album name column might be plenty to disambiguate. The song name and performer names might both be incomplete or partially wrong, but combined they may actually be sufficient for disambiguation or to at least dramatically reduce the possible likely options.

Multiple columns may be unnecessary other than as memory aids and for disambiguation. After all, the LOD cloud should have all of the public data for an any entity. So, the user can simply maintain their own stripped-down representation for any entity and then let the SemWeb itself supply any additional desired information. As long as enough info is supplied to identify the entity (or even plural entities), there is no need for the user to keep more detailed info in their own list. So, maybe the user can conceptually think of their lists as having two sides or parts: 1) their own raw list in their own preferred format (e.g., simple text file or spreadsheet), and 2) their preferred representation of the actual referenced LOD entities. Note that the SemWeb representation might be in a non-list format such as a graphical map or other structured format or even a full spreadsheet or database layout, if that is what the user has chosen. Of course, the user could choose any number of formats.

There will likely be some interest in templates for multi-column lists, but I don't see them as a requirement since the rows of the list provide disambiguating context. In fact, generally, the category of most lists will be quite obvious to even relatively simple automated analyzers, presuming there are enough rows. This does highlight the importance of being able to identify the category of SemWeb entities.

The user could of course author and maintain their lists in their favorite local editing tool such as a text editor or spreadsheet, but it is likely that keeping lists online would be preferable. Presumably sites would spring up which specialize in maintaining and publishing SemWeb lists. Of course there would be privacy controls so that private lists remain completely private or only shared as the user decides, but it should be dirt-simple easy to quickly publish a user-generated list. And once a user-generated list gets published to the Semantic Web, presto, it is now a candidate for getting linked into the LOD cloud.

Linking of user lists can occur in two ways: 1) A simple, direct link, such as a user-generated "list of favorite lists", or 2) creating a derivative list based on one or more existing published lists. Besides creating their own list from scratch or by wholesale copying of an existing published list the user could reference an existing list and tell the software that the user wants to "start with" the existing list and then supplement it, adding some items and deleting others. The user might even request that multiple lists be combined. Or maybe include only some columns of data. A common usage would be for a user to identify a trendsetter (maybe just a friend) and supplement that list with their own personal interests. The key is to maintain is a dynamic reference to the base list and the user's full, published list will change as any base lists change.

The user's lists would be as the user creates and maintains them and completely devoid of formal URIs or other arcane SemWeb concepts. The published version would of course be in hard-core RDF, but with the clear-text source as well. The user would also have the option of automatically "cleaning up" their list to correct spelling errors, complete names, etc.

Linked lists provide an opportunity for dramatically increasing the scope of the Sematic Web and also provide an opportunity to escape from the current paradigm of web sites such as Facebook and LinkedIn being walled gardens holding user data captive.

The issue of exactly where online user lists would be published and store is open, but the simple answer is: anywhere. In some sense user lists would be similar to blogs in that a user might have their own domain or chose a hosting site that caters to their personal skills and interests. The real point is that it truly does not matter where linked lists reside once they are identified or registered as being part of the Semantic Web. That raises the question of how to register new lists, but I am sure there will be plenty of sites and users ready and willing to fill that void.

-- Jack Krupansky

2 Comments:

At April 2, 2011 at 10:55 PM , Blogger Asfand Mudassir said...

This comment has been removed by the author.

 
At April 5, 2011 at 11:14 AM , Blogger Rohan said...

Isn't twitter like a linked list as well. Just add a semantic analysis similar to what you'd do in case of user generated lists.

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home