Thursday, April 30, 2009

Need for a Casual Semantic Web

Current Semantic Web technologies are difficult to utilize, even by highly-skilled professionals. Unlike basic HTML where interesting Web pages and blogs can be assembled and hyperlinked with very little effort and grow at a "viral" rate, the growth of the Semantic Web is proceeding at a snail's pace. A high level of sophistication is needed to develop even basic Semantic Web content. It should not be that way. What is needed is a Casual Semantic Web where even naive consumers and low-skilled workers can rapidly put together interesting Semantic Web content.

Existing Semantic Web technologies are extremely flexible and enable very complex information structures, but consumers and low-skilled workers do not need all or even any of that complexity. They need simple constructs.

They need little more than "elements" for concepts such as:

  • Names
  • Places
  • Addresses
  • Phone numbers
  • Email addresses
  • IM ids
  • Social networking ids
  • Dates
  • Ages
  • Activities
  • Interests
  • Preferences
  • Opinions
  • Polls
  • Rankings
  • Ratings
  • Friends
  • Colleagues
  • Businesses
  • Governmental agencies
  • Non-profit institutions
  • Hospitals
  • Doctors
  • Employers
  • Employees
  • Teams
  • Team members
  • Groups
  • Associations
  • Membership
  • Travel plans
  • Children
  • Parents
  • Relatives
  • Roles
  • Lists

Of course they need convenient methods to publish their personal Semantic Web.

They need convenient Semantic Web browsing tools, although that capability may simply fold right into the traditional Web browser.

Traditional search engine and blog "crawling" technology would be sufficient to aggregate data to enable queries to correlate between users, groups, organizations, interests, etc. There would also be plenty of opportunity for specialized aggregators or mirroring or caching services to evolve, but none would have a monopoly or be able to act as gatekeepers to innovation since the underlying data would always be freely available to all.

Client apps (including for the iPhone and other mobile devices) could provide the kind of user-friendly access UI that people have come to expect from current social networks, but the "open" nature of the "networks" would provide greater flexibility and opportunity for innovation.

Users also need access control for privacy.

They also need a mechanism to manage their identity.

Elsewhere I have suggested the utility of a Data Union for storage of personal data.

The Casual Semantic Web would in fact be a step in the direction of open garden social networking in which the users are in control rather than being under the thumb of the "keepers" of current walled-garden social networks.

Users would be capable of introducing innovative social networks rather than dependent on others to provide (and control) them.

Overall, the main starting point is an extremely user-friendly vocabulary that does not require a computer science degree or advanced training just to publish relatively basic information.

-- Jack Krupansky

The Semantic Web swamp

Swamps are interesting places, but not if you are looking to make rapid progress. They are an unfortunate hybrid between dry land and open water. A land vehicle will get mired in the muck. Ditto with a water vehicle. Sure, there may be patches of dry earth here and there or pools of water here and there, but not enough of either in a connected fashion to exploit either. This is how a lot of the current Semantic Web feels to me. There is simply too much technological "muck" that slows progress.

Just yesterday (and into today) I was following an email thread on the OWL list about the relatively simple concepts of subclass and superclass, but the discussion simply goes on and on because there is no clarity in the specifications. Maybe if somebody points you to the precisely right passage it will all become clear (or maybe not), but that should not be required.

Sure, there are books and tutorials and seminars and consultants, but none of that should be required, at least for the level that the technology is at today.

It is an open question whether tools or additional layers can be built on top of the current Semantic Web technologies that are sufficient to hide the "muck" of the "swamp." I am hopeful that is the case, but there are no guarantees.

Can we "flood" the swamp to turn it into a navigable lake or sea? Maybe.

Can we "fill" in the swamp to create solid, traversable dry land with the underlying swamp as an "aquifer"? Maybe.

Besides the concerns about usability of current Semantic Web technologies, there is the larger question of whether it is so complex even at this stage that even seasoned professionals may be unable to verify that Semantic Web constructions are technically correct and valid for their intended applications and not too fragile and are readily maintainable by other than their original developers. Five or ten years from now could we end up with a knowledge crisis analogous to the current banking crisis simply because we do not know the location or magnitude of risks?

-- Jack Krupansky

Wednesday, April 29, 2009

The Quest for Computable Knowledge

Check out Stephen Wolfram's thoughts on "The Quest for Computable Knowledge" on the new Wolfram|Alpha blog. He acknowledges Leibniz's role in collection of knowledge, reasoning, and computation:

I've always been particularly struck by Gottfried Leibniz's role. He really had pretty much the whole idea of Wolfram|Alpha--300 years ago.

At the end of the 1600s he came to believe that somehow there must be a way to mechanize the resolution of all human arguments.

He imagined that one could represent human discourse using logic and mathematics. Then he imagined that one could use a machine to work out answers from this--and in fact he even built some small mechanical calculators himself.

He also realized that to provide raw material for his mechanization it would be necessary to assemble lots of knowledge. So he worked hard to get libraries constructed, and to invent systems for organizing them.

Of course there were some elements missing. But Leibniz really had the right basic idea.

-- Jack Krupansky

Tuesday, April 28, 2009

Levels of language for knowledge

Although I am still not convinced that the current Semantic Web technologies, based on RDF, are in fact the optimal foundation for a true knowledge web, I will continue to proceed on the assumption that RDF is in fact a reasonable starting point. That said, there is a real question of what exactly we can model in RDF. Maybe, in theory, we can model anything and everything in RDF, but is it really an efficient and effective "language" for the higher-levels of knowledge? Even if it "works", in theory, is it really practical?

In computer programming languages we have "levels" of language:

  1. Machine code. The actual bits for the "instructions" executed by the hardware (or interpreter.)
  2. Assembly language. Mnemonic opcodes, symbolic names, macros, and other convenient features, but there is still a one-to-one relationship with machine code instructions.
  3. High-level languages. A compiler or interpreter translates declarations, expressions, "statements", functions, and classes into machine language instructions. These tend to be "procedural" languages.
  4. "4GL". User-oriented "query" languages that allow the user to interact in terms closer to the real world. These tend to be "declarative" languages -- the user says "here is what I want" and the computer figures out how to do it. Maybe even a little natural language or a structured subset.
  5. "5GL". Use of artificial intelligence, such as to infer what the user really wants. Deeper and broader support for natural language.

It may in fact be rather dangerous and counterproductive to assert that a knowledge web can be built and used based on such a hierarchy of languages, but for now it at least seems to be a reasonable conjecture to contemplate, at least until there is some clear and convincing evidence that it is a bad idea.

Using this programming language level model, RDF seems to "fit" as the assembly language level for knowledge. Names, in the form of name spaces and URIs, may be rather cryptic, but they are certainly symbolic, at least to some degree. Triples have a nice, fixed format, with three "fields" (object, predicate, subject), much as machine/assembly language "instructions."

Most significantly, an assembly language is a great tool for advanced, leading edge professionals, but an exceedingly poor tool for "users" such as subject matter experts who know their domain but not necessarily the nuances of the Semantic Web technologies such as RDF.

Clearly there is a need for higher-level knowledge languages. I do not have any detailed answers here and now, but this is obviously an area to think about.

I would close here by noting that we should be careful not to confuse languages and tools. Graphic interactive tools and environments will certainly be as useful in working with knowledge webs as they are in traditional computer programming, but it is still important to be clear about what level of language is being modeled directly behind the fancy graphical images. Putting a pretty GUI frontend on an RDF editor does not magically give the user the ability to converse in a 4GL. In short, a GUI frontend will be appropriate for each level of knowledge language, and the GUI may be radically different for different language levels.

-- Jack Krupansky

Tuesday, April 21, 2009

What is the unit of knowledge?

A lot of talk about knowledge, but what exactly is the unit of knowledge?

Computers have bits, bytes, words, integers, floating point, and strings, but how do we even talk about the units of knowledge?

Before continuing, I would note one interesting answer that I stumbled upon using Google. According to Dan Markovitz of TimeBack Management, a common saying around Toyota is that:

The basic unit of knowledge is a question.

That may have some utility, but begs the question is to what the unit of questions might be, leaving us with not much more than we started.

A variation of that adage might be an axiom about units of knowledge:

The basic unit of knowledge is the most narrow and focused question that we can formulate about knowledge.

A corollary of that axiom would be:

The basic unit of knowledge is the response to the most narrow and focused question that we can formulate about knowledge.

But, I am not so sure that such an axiom must necessarily be true. A question is like a tool, a measuring and manipulation device, used to access knowledge. But in the real world it seems as if matter has an even finer structure than the finest tools we can construct for measuring and manipulating matter. On the other hand, maybe that merely means that we simply are not yet smart enough to envision such tools. In some cases, such as with subatomic particles, we use indirect tools such as particle accelerators to smash particles apart so we can observe the results. So, maybe my axiom is not so far off, for now.

WikiAnswers.com has an interesting answer:

Q: What is the smallest unit of knowledge?

A: The adjective.

That is along the lines of a thought I had, that attributes of objects may be the smallest units of knowledge.

I mostly think of knowledge as collections of statements about objects, phenomena, or beliefs.

We could say that the statement is the "unit" of knowledge, but to me a statement is more a form of knowledge, a container rather than the contents of the container. We are more interested in the units of the contents of statement "containers."

Operationally, nouns, pronouns, adjectives, verbs, adverbs, prepositions, interjections, and conjunctions (the eight parts of speech) are the basic natural language units for knowledge. Or you could say that words are the units of natural language knowledge. This is certainly true, but seems to sidestep the issue of true "knowledge" in the sense that an assembly of words can suddenly conjure up a meaning that quite distinct from the meanings of the individual words.

A dictionary might contain all of the words used in a novel, but the real question is what is the unit of storytelling that makes a novel what it is rather than just a sequence of statements.

An operational definition from the world of the Semantic Web is the RDF statement or RDF triple which consists of a subject, predicate (or property), and object. An RDF statement can be somewhat analogous to an adjective. At least in the context of the Semantic Web, RDF triples are clearly the unit of "knowledge." But, that begs the question of whether the Semantic Web as currently envisioned is comprehensive enough to represent all knowledge.

For now, I am comfortable using the statement as the unit of basic knowledge. For example:

  • The apple is red.
  • Some apples are red.
  • Not all apples are red.
  • The apple is on the table.
  • There is no apple on the table.

Next, there are various forms of statements:

  • Existence. The fact that some object, phenomenon, or belief does or does not exist.
  • Attributes. Such as the color or location or size of an object.
  • Relationships to other objects (or phenomena or beliefs). How do the objects in the world interact.

We can also refer to such simple statements as facts. There is some appeal to suggesting that facts are the units of knowledge. Whether facts and statements are the same or dissimilar in some way is left for further consideration in the future.

An immediate question is the status of questions relative to statements. My current thesis is that questions are simply another form of statement, a kind of mirror reflection of statements:

  • Is the apple red?
  • Are all apples red?
  • Is there an apple on the table?
  • Where is the apple?

We could presume that the form of the answer or response to any question is the unit of knowledge.

Next, there is the issue of compositional structuring of statements, collections of statements that are related somehow. This is where things get, literally, interesting, since such collections of statements may in fact be the unit for storytelling, for constructing elaborate stories, including novels. These collections of statements may in fact represent a unit of meaning that is in fact far richer than the level of simple, factual statements. So, we have this issue of whether facts or story-level meaning should be our unit of knowledge.

Google has a project called knol which is billed as "a unit of knowledge". A knol is in fact a full-blown paper or essay or article, comparable to a Wikipedia article. That is a rather different usage of the term "unit." One could propose that a "unit" of knowledge is an interesting and usable package of knowledge, including books, web pages, PDF documents, magazines, movies, podcasts, blogs, blog posts, Twitter "tweets", etc. Fair enough.

Maybe my final thought, for now, is that a unit of knowledge is any form of knowledge that is usable, as is. Even a passage of text clipped out of the middle of a paragraph might be a usable unit of knowledge.

I have not answered the initial question precisely, but I think there is enough foundation to proceed without having a precise definition, for now.

-- Jack Krupansky

Cultivating knowledge vs. garbage in, garbage out

One day we will have a sufficiently rich and robust infrastructure capable of supporting the development of a true knowledge web, but will we be ready for it? Even with the proper tools in our hands, will we know how to use them effectively?

What is needed is some sense of how to cultivate knowledge so that we do not end up with vast mountain ranges of crap that suffer from GIGO (garbage in, garbage out.)

At a simplistic level we need tools, methods, and discipline for knowledge curation, but that is much easier said than done.

Further, we need a culture of knowledge that is compatible with and accepted by average consumers so that we can in fact build a vast consumer-centric knowledge web that does not depend on vast legions of human knowledge curators just to accumulate relatively simple tidbits of knowledge that consumers produce on a daily basis.

In short, we need a whole science of consumer-centric knowledge cultivation. Otherwise, we could end up producing a knowledge web that is not terribly useful relative to its promise.

-- Jack Krupansky

Monday, April 20, 2009

Software agents for virtual browsing and virtual presence

With so many places to go and so many things to see and do on the Web, it is getting almost impossible to keep up with the proliferation of interesting information out there. We need some help. A hefty productivity boost is simply not good enough. We need a lot of help. Browser add-ons, better search engines, and filtering tools are simply not enough. Unfortunately, the next few years holds more of the same.

But, longer term we should finally start to see credible advances in software agent technology which help to extend our own minds so that we can engage in virtual browsing and have a virtual presence on the Web so that we can effectively reach and touch a far broader, deeper, and richer lode of information than we can with personal browsing and our personal presence.

Twitter asks us what we are doing right now, but our online activity and presence with the aid of software agents will be a thousand or ten thousand or even a million or ten million times greater than we can personally achieve today. What are each of us interested in? How about everything?! Why not?

The gradual evolution of the W3C conception of the Semantic Web will eventually reach a critical mass where even relatively dumb software agents can finally appear to behave in a relatively intelligent manner that begins to approximate our own personal activity and personal presence on the Web.

It may take another five to ten years, but the long march in that direction is well underway.

The biggest obstacle right now is not the intelligence of an individual software agent per se, but the need to encode a rich enough density of information in the Semantic Web so that we can realistically develop intelligent software agents that can work with that data. We will also need an infrastructure that mediates between the actual data and the agents.

-- Jack Krupansky

Monday, April 13, 2009

Using Data Unions as repositories of personal data

In order to facilitate the development of open garden social networks it is necessary to have a safe place for consumers to place their personal data, not just where it can be stored and accessed, but also to control access and to provide a reliable digital identity. Many years ago I thought up a scheme I called a data union, kind of a cross between a data bank and a credit union, which would provide exactly that form of reliable and safe storage for a consumer's personal data. I finally wrote up a rough, summary description back in 2005, but I have not yet pursued the concept any further.

The intention is not so much to store a consumer's bulk data such as documents, photos, media, etc., but simply to store and control the attribute information that might be needed for online transactions and promotion of products and services, such as name, address, phone numbers, social security number, age and birth date, gender, interests, and whatever. The intention was to give the consumer great control over exactly what personal information is available to whomever.

It would be a natural extension to have a data union safety deposit box, which would be a modest amount of digital storage, maybe in the megabytes or a "few" gigabytes, sufficient for documents, valuable images, etc., but not intended for full-blown personal storage.

A data union would be an ideal repository for online digital identity credentials, or at least as a digital identity validation service. For example, the consumer could approve an entity with which they are willing to transact and then the consumer could provide a transaction code to that entity which the data union could verify.

A data union would enable the consumer to be as open and visible and transparent or as closed and hidden and secretive as they wish.

-- Jack Krupansky

Sunday, April 12, 2009

Open garden social networking vs. walled gardens

I am truly tired of social networking sites that are walled gardens, requiring some form of registration and holding my personal data hostage by maintaining it behind the walls of the "walled garden." What is the alternative? Is there an alternative? No, there is no alternative currently, but in the longer term we can hope that developers and entrepreneurs will recognize that open garden networks have distinct advantages over walled gardens.

The esence of an open garden social network is that users maintain their data wherever they want as long as it can be crawled by whatever sites wish to aggregate that data. Since the data is maintained publicly, it can easily be shared by more than one social networking aggregator.

The immediate technical obstacles are that: 1) the average consumer has no obvious public location to store their data and 2) we do not have a technology and public infrastructure in place for consumers to "sign" their personal data to associate it with their digital identity.

Who knows, maybe open garden social networking will take off in another five or ten years.

One of the key benefits of open garden personal data is that it will open up vast new opportunities for innovation in open garden social media since each innovator can piggyback on the existing (in the future) public open garden infrastructure rather than need to go through the time and expense of reinventing the wheel unnecessarily for each new social networking aggregator site.

-- Jack Krupansky

Monday, April 6, 2009

Semantic input for consumers

As yet there are no perfect methods for consumers to enter semantic data. Entering free text is certainly convenient, but we just don't have "perfect" natural language processing software, yet.

The common forms of consumer data for which semantic data are desirable include:

  • email messages
  • email address books
  • blog posts
  • Twitter "tweets" and other forms of micro-blogging
  • IM instant messages
  • cell phone calls
  • text messages
  • digital camera pictures
  • transaction data, including credit card transactions and online ecommerce forms.

Unless the consumer is an "English geek", it is unlikely that they will be willing to create structured sentence diagrams to express the meaning of even simple statements.

The full range of methods for semantic input include:

  1. Natural language processing (NLP) for text and audio.
  2. Controlled vocabularies (e.g., Structured English)
  3. Text mining.
  4. Full semantic map editing (e.g., ala sentence diagrams)
  5. Detection of object references in free text (e.g., proper names and nick names for people, places, and things), possibly based on customizable dictionaries.
  6. Form-based input, including drop-down lists for direct selection of semantics
  7. Transaction device data (e.g., GPS location, date and time, etc.)
  8. Transaction information (e.g., online ecommerce data)
  9. Tagging
  10. Background review and re-entry by trained "semantic coder" (e.g., in an "offshore" market.)
  11. Feedback and enhance - mine consumer input for apparent concepts and ambiguity, annotate the original, and allow consumer to approve and select between alternatives and "hints"

I am sure that there are a variety of other methods, existing, proposed, or not yet imagined, but these are a starting point for discussion, as well as an illustration of how much more research and innovation are needed.

I have been trying to avoid a reliance on full-bore NLP, but the simple truth is that it may in fact be the best foundation.

-- Jack Krupansky

URI-based resource location

I have never been happy with the Semantic Web concept of associating resources with specific Web locations using URLs that specify a server location such as a domain name. The main issues:

  1. Makes it difficult to move a resource to another domain.
  2. Increases the likelihood that a server might become a performance bottleneck, especially as popularity grows and the Semantic Web begins to scale up in size dramatically (so-called "exponential growth.") Wiring in a server location simply does not scale up.
  3. Encourages ad-hoc caching. Worse, as the Semantic Web scales up it requires a dependence on ad-hoc caching.

Although some form of caching is clearly part of the solution, the main component of a solution is to switch from URL-based resource locating to URI-based resource locating.

Rather than specifying a single URL and then depending on the existing, non-Semantic Web Domain Name System (DNS) to look up the actual path to "the" server, we need a non-DNS lookup mechanism that takes one or more URIs and does more of a "keyword" lookup (treating each URI as a "keyword" (actually a Semantic Web analog to a keyword)) and then redirects through a caching inrastructure that is designed to meet the needs of caching resources for the Semantic Web.

A Semantic Web resource URI list might also be supplemented with various attributes, such as version number or version requirements and other attributes needed to constrain and control resource access.

The SW resource infrastructure should be able to manage multiple versions for a resource and efficient and controlled propagation of changes.

One use of multiple URIs is to control the degree of specialization of a generic resource name. A single URI would be the most general resource reference and provide the most adapaptability, while adding on specialization URIs would provide access to resources that meet additional requirements. This is analogous to base and derived classes in object-oriented programming, but is not necessarily required.

One key attribute of such a resource infrastructure, besides scalable performance itself, would be that even a very small, under-powered web site could be the source host for even extremely popular Semantic Web resources, and migrating such resources to another host should be completely transparent to the "users" (user agents, UAs, or software agents) that have the URI list for the resource "wired" into their "code."

Alas, I am not optimistic that such an aechitecture will soon or even ever by made available for the Semantic Web as we know it today. The change may have to wait for whatever follows the Semantic Web, or maybe even for Ray Kurzweil's Singularity.

Still, it is useful to contemplate what a proper solution might look like.

-- Jack Krupansky

Geopolitical reference data from FAO

A key issue with building a semantic web is having robust base reference data, including geopolitical data such as the countries of the world. The Food and Agriculture Organization of the United Nations (FAO) has produced an ontology for geopolitical information that:

... manages information about territories and groups, such as, the different names in English, French, Spanish, Chinese and Arabic; associated classification codes, like, UN code -- M49, ISO-3166 Alpha-2 and Alpha-3, UNDP code, GAUL code, FAOSTAT, etc; historical changes; and specific relations like "x has border with y" and "x is member of group z".

That sounds like a great start, although I have not examined the details.

-- Jack Krupansky