Wednesday, January 28, 2009

Semantic Web challenges 2009

Here are my current thoughts about the challenges facing the Semantic Web, vintage January 2009:

  • Mind the Gap
    • Thesis: There is a dramatic semantic gap between how users think and communicate about knowledge and the mechanisms that the Semantic Web supports for organizing knowledge.
      • Superset of the semantic data mining problem
    • How to jump from comfort with natural language to comfort with the Semantic Web
    • Extent to which the user "sees" the Semantic Web as opposed to the Semantic Web simply being more power "under the hood" in a completely transparent manner
  • Mind the Gap II
    • How do we map and transition between natural language and the Semantic Web
      • How to represent natural language in the Semantic Web
        • Concepts, statements, reasoning, processes, prose passages, stories, outlines
  • Semantic search engines
    • Not just raw text, semantic inferences as well
    • No single best form for database, need open access to create specialized databases
  • Inference Broker
    • Need for inference brokers to mediate between creators of knowledge and users of knowledge. Due to:
      • Desire for privacy
      • Protection of intellectual property
      • Massive scalability requirements - divide and conquer
      • Division of labor, factoring large problems into smaller problems
  • Social structure of knowledge
    • Individuals have only some of the puzzle pieces of knowledge
    • Propositions of uncertain classification
    • Social groups aggregate and classify knowledge
  • A medium for intelligent agents
    • Software agents can act more intelligently with a richer, knowledge-centric information stream
  • Statements that are not strict, objective facts
    • Personal facts, opinions, speculation, gossip, questions
    • "Creations" - text, graphics, images, audio, video [? Separate challenge?]
    • False statements
      • May be outright lies, deceptions, misunderstandings, misstatements, changed information
  • Medical record difficulties
    • Pen on paper still most convenient for input
    • Quick human scan of paper still most convenient for browsing
    • Input decision process still far too intrusive
    • Semi/un-structured data still far too inconvenient
  • Distributed resource storage
    • Extremely diversified storage to assure timely and efficient access
    • Needs to be part of net infrastructure that is automatic and not subject to human whim and error
  • Robust personal and organization identity, as well as roles and interests
  • Authority and provenance identification and tracking
  • The cost of knowledge engineering, especially maintenance and testing
    • Who can really afford it?
  • Semantic matching challenges
    • Apparent differences that are easily bridged by a human
    • Subtle or apparently insignificant distinctions that a human would say are too significant for a match
      • For example, improper reuse of a resource for a different "meaning"
    • Incomplete matches due to cultural differences
    • Concept matches but with differences in contractual commitments (for services)
  • Distributed semantic matching/mapping services
    • Manual creation of libraries of semantic "logic" services for bridging semantic gaps - hide the details for how to get from "A" to "B"
  • Support for time and version dimensions of information

-- Jack Krupansky

Monday, January 26, 2009

More lies!

In addition to my previous list of types of false statements, also include:

  • Scams
  • Confidence games
  • Honest disagreements
  • Ideologial disagreements - each side firmly believes that the other is false
  • Definitional disagreements - how terms are interpreted by different parties
  • Temporary or transient truth - a statement may in general be false but happen to be true at the moment, or in general be true but happen to be false at the moment, or its truth may simply be volatile with or without any discernible pattern
  • Time delay (latency) since truth was validated - or multiple agents get different truth values due to differences in validation latencies
  • Information overload - too many statements to verify with available resources
  • Placeholder - a temporary statement of dubious truth but with the intent of replacing it with a proper statement in the future
  • Inadvertent - author has no reason to challenge the veracity of the statement, but may simply have failed to validate the statement
  • Accidental mistake - author knew truth but entered the correct statement improperly
  • Rumor
  • Gossip
  • Innuendo
  • Passthrough or cascaded or misguided/naive transitivity - author obtained truth from another party and passed the statement along as if true without further validation.
  • Mismeasurement - no intention to mislead, but measurement of source data was faulty

-- Jack Krupansky

That is a lie!

The heart of semantics is truth, the ability to examine a proposition and determine whether it is true or false. Sometimes we may not have enough information to determine whether a given statement or network of statements is true, but sometimes claims may simply not be true in an objective sense. False claims may be inintentional or intentional. Regardless, any semantic system or semantic agent needs to be able to make judgments as to the truth of statements and propositions.

Some of the ways in which even simple statements can be false are:

  • Outright lies
  • Deceptions that hide behind some legalism
  • Misleading by artful presentation of mostly truthful information
  • Honest mistakes
  • Simple misstatements
  • Subjective truth
  • Misunderstandings
  • Confusion
  • Misinterpretation
  • Incomplete information
  • Fuzzy statistical data
  • Changed information
  • Different points of view
  • Wishful thinking
  • Conjecture and speculation based on a weak foundation
  • Semantic mismatches, contextual mismatches - true in one system of reasoning, but not necessarily true in a different system of reasoning
  • Jokes and pranks
  • Hoaxes
  • Fraud
  • Madness of crowds
  • Emperor's New Clothes syndrome
  • Folklore
  • Political dogma
  • Exaggeration
  • Paradoxes
  • Poor estimation
  • Works of fiction
  • Dramatization
  • Hypotheticals
  • News reports - it may have been "said", or reported to have been said, but is it true?

Semantic data mining in particular needs to be able to classify statements as to their truth content, not simply whether a statement is believed to be true, but what form of untruth it might be.

Semantic agents need to be able to validate the veracity of claims that it encounters.

How to do all of this? Overall, unknown at the present time, but there of lots of special cases and plenty of room for heuristics.

Maybe even a heuristic could be considered a "lie" to some extent.

-- Jack Krupansky

Saturday, January 17, 2009

Semantic Web training videos

Marco Neumann of the New York Semantic Web Meetup suggests VideoLectures.net as a good source for Semantic Web training videos. In fact, here is a link to the results of searching that site for "Semantic Web": VideoLectures.net for "Semantic Web". The videos range from events, to invited talks, keynote presentations, lectures, panels, and tutorials.

I have not watched any of these videos in any detail to recommend them, but the one entitled Introduction and Overview to the Semantic Web by James A. Hendler sounds like a good starting point, or at least as a place to get the viewpoint of one of the original "biggies" of the Semantic Web.

There is also A short Tutorial on Semantic Web by York Sure and Invited Tutorial: An Introduction to the Semantic Web by Fabio Ciravegna.

I can't wait to watch some of these videos.

-- Jack Krupansky

Semantic Web books

I recently queried Marco Neumann of the New York Semantic Web Meetup to suggest a path for learning about the Semantic Web. In addition to recommending attending the New York Semantic Web Meetup, he has suggested several books:

I personally have not checked out these books, yet, other than looking at the blurbs on Amazon, but if Marco recommends them, they must be good.

Note: I do get a tiny commission from Amazon if you buy any of these books after clicking on any of the cover images or links above that redirect to Amazon. Thanks!

-- Jack Krupansky

Semantic Web training seminars

I asked Marco Neumann of the New York Semantic Web Meetup to suggest a path for learning about the Semantic Web. He has suggested several training seminars:

  • TopQuadrant (with Jim Hendler): Getting Ready for the Semantic Web with TopBraid Suite. Price: $1,795.
  • Stanford Protege team: Protege-OWL Short Course - provides an introduction to ontology development in OWL, both from a theoretical standpoint and from a practical standpoint through hands-on use of the Protege platform. The course also emphasizes how to use OWL ontologies, and other semantic technologies like SWRL, to build semantic applications with examples from real-world use cases. Price: $1,500.
  • Wilshire Conferences: Designing and Building Business Ontologies led by Dave McComb and Simon Robe - An Intensive 4-DAY SEMINAR with Workshops and Demonstrations, on Semantically Enabling the Enterprise - An ontology is a formal description of the meaning of the information stored in a system. It resembles a conceptual model, but goes much beyond a conceptual model in that the formal definitions allow the system to infer class membership based on properties. Additionally, inference engines, running on ontologies, allow users to extract and integrate information stored in distributed systems.
    This workshop, which will contain a number of live demos, will cover practical issues in employing ontologies. Price: $2,495.

-- Jack Krupansky

Semantic Web Meetup in New York City

Even though I moved to New York City back in May 2008, I only recently bothered to check for any Meetup group for the Semantic Web. It turns out that there are 14 Semantic Web Meetup groups around the world, in 12 cities in 4 countries with 1,897 members. The Semantic Web Meetup Web page is here and lets you click on a city or search by name.

The New York Semantic Web Meetup Web page is here and is run by Marco Neumann who is the principal of KONA. The meetup's charter is:

Meet local people interested in the Semantic Web, an initiative by the W3C [http://www.w3c.org] to make the web "one giant database": The Data Web. We address technologies such as RDF, RDFS, OWL and applications that help to develop or that use ontologies, controlled vocabularies and rules systems in the enterprise and on the World Wide Web.

The Meetup has 578 members and is meeting actively, with two meetups this past week and another scheduled in two weeks (SHER: A Scalable Highly Expressive Reasoner & Semantic Web at NYU at 6:30 p.m. on Thursday, January 29, 2009.)

The New York Semantic Web Meetup also has a wiki Web site.

Alas, the recent and coming meetups run into a conflict with my schedule. The good news is that presentation slides and blog posts are available. And, of course, the meetup has an email list that you can join, as I did.

-- Jack Krupansky

Facts, opinions, secrets, gossip, speculation, and questions in the Semantic Web

Whether one is mining text for embedded semantics or offering a structured interface for directly entering semantics, a user's information needs to be properly classified if it is to be used properly in the Semantic Web. Assuming one considers user input to be a sequence or collection or graph of statements, each statement would need to be classified as one or more of:

  • Fact. A statement that is believed to be true in some objective sense.
  • Opinion. A statement that the speaker believes is likely to be true, at least for themselves, regardless of the opinions of others.
  • Secret. A personal statement that is not intended to be shared with others, except possibly on a very selective basis.
  • Gossip. A statement about others that is intended to be shared to some extent, probably without attribution as to its originator.
  • Speculation. A statement that the speaker believes might or could hypothetically be true. It is not assumed to be true, but neither is it assumed to be false. The intention is to incite at least a subtle bias in the conjectural thinking of others.
  • Question. A purely interrogatory statement, a proposition whose truth or answer is essentially unknown, but whose answer is desired by the speaker.

It seems quite clear that a useful semantic mining tool would need to be able to classify its input stream according to these qualities.

On the other hand, a tool may simply categorize to the extent that it can and correlation between similar statements from multiple sources might reveal or suggest the proper, likely, or possible classification.

There should probably be an unknown category as well.

My original motivation in coming up with this classification scheme was to think about how a user interface might assist even average users in capturing at least some aspects of the semantics of their personal information at the time it is captured. For example, to offer the user some category headings that they can click on.

An interface tool could also show the user how other users have classified the same statement. That could be the default unless the user overrides with a desired classification.

-- Jack Krupansky

Exploring New Interaction Designs Made Possible by the Semantic Web

The Journal of Web Semantics has issued a call for papers for a special issue on the topic of "Exploring New Interaction Designs Made Possible by the Semantic Web." They tell us that they:

... seek papers that look at the challenges and innovate possible solutions for everyday computer users to be able to produce, publish, integrate, represent and share, on demand, information from and to heterogeneous data sources. Challenges touch on interface designs to support end-user programming for discovery and manipulation of such sources, visualization and navigation approaches for capturing, gathering and displaying and annotating data from multiple sources, and user-oriented tools to support both data publication and data exchange. The common thread among accepted papers will be their focus on such user interaction designs/solutions oriented linked web of data challenges. Papers are expected to be motivated by a user focus and methods evaluated in terms of usability to support approaches pursued.

Offering some background, they inform us that:

The current personal computing paradigm of single applications with their associated data silos may finally be on its last legs as increasing numbers move their computing off the desktop and onto the Web. In this transition, we have a significant opportunity – and requirement – to reconsider how we design interactions that take advantage of this highly linked data system. Context of when, where, what, and whom, for instance, is increasingly available from mobile networked devices and is regularly if not automatically published to social information collectors like Facebook, LinkedIn, and Twitter. Intriguingly, little of the current rich sources of information are being harvested and integrated. The opportunities such information affords, however, as sources for compelling new applications would seem to be a goldmine of possibility. Imagine applications that, by looking at one's calendar on the net, and with awareness of whom one is with and where they are, can either confirm that a scheduled meeting is taking place, or log the current meeting as a new entry for reference later. Likewise, documents shared by these participants could automatically be retrieved and available in the background for rapid access. Furthermore, on the social side, mapping current location and shared interests between participants may also recommend a new nearby location for coffee or an art exhibition that may otherwise have been missed. Larger social applications may enable not only the movement of seasonal ills like colds or flus to be tracked, but more serious outbreaks to be isolated. The above examples may be considered opportunities for more proactive personal information management applications that, by awareness of context information, can better automatically support a person's goals. In an increasingly data rich environment, the tasks may themselves change. We have seen how mashups have made everything from house hunting to understanding correlations between location and government funding more rapidly accessible. If, rather than being dependent upon interested programmers to create these interactive representations, we simply had access to the semantic data from a variety of publishers, and the widgets to represent the data, then we could create our own on-demand mashups to explore heterogeneous data in any way we chose. For each of these types of applications, interaction with information -- be it personal, social or public -- provides richer, faster, and potentially lighter-touch ways to build knowledge than our current interaction metaphors allow.

Finally, they pose their crucial question:

What is the bottleneck to achieving these enriched forms of interaction?

For which they propose the answer:

Fundamentally, we see the main bottleneck as a lack of tools for easy data capture, publication, representation and manipulation.

They provide a list of challenges to be addressed in the issue, including but not restricted to:

  • approaches to support integrating data that is readily published, such as RSS feeds that are only lightly structured.
  • approaches to apply behaviors to these data sources.
  • approaches to make it as easy for someone to create and to publish structured data as it is to publish a blog.
  • approaches to support easy selection of items within resources for export into structured semantic forms like RDF.
  • facilities to support the pulling in of multiple sources; for instance, a person may wish to pull together data from three organizations. Where will they gather this data? What tools will be available to explore the various sources, align them where necessary and enable multiple visualizations to be explored?
  • methods to support fluidity and acceleration for each of the above: lowering the interaction cost for gathering data sources, exploring them and presenting them; designing lightweight and rapid techniques.
  • novel input mechanisms: most structured data capture requires the use of forms. The cost of form input can inhibit that data from being captured or shared. How can we reduce the barrier to data capture?
  • evaluation methods: how do we evaluate the degree to which these new approaches are effective, useful or empowering for knowledge builders?
  • user analysis and design methods: how do we understand context and goals at every stage of the design process? What is different about designing for a highly personal, contextual, and linked environment?

In addition to traditional, full-length papers, they are also soliciting shorter papers as well as one to two page short, forward-looking more speculative papers addressing the challenges outlined above. I am tempted to submit one of the latter, possibly based on my proposal for The Consumer-Centric Knowledge Web - A Vision of Consumer Applications of Software Agent Technology - Enabling Consumer-Centric Knowledge-Based Computing. Or, maybe a stripped-down version of that vision that is more in line with the "reach" of the current, RDF-based vision of the Semantic Web.

-- Jack Krupansky

Wednesday, January 14, 2009

Text for the Semantic Web

I am having second thoughts as to whether the text for terms, definitions, descriptions, and other text belong in RDF or should be stored externally. I am not convinced one way or the other.

On the one hand, storing text externally could make it more manageable with traditional text tools, and even searchable using traditional text-oriented search engines.

On the other hand, storing all of that text separately increases the number of resources and may be more unmanageable than embedding the text directly in RDF.

One hybrid approach would be to store the "source" for the text in traditional text documents or simpler XML files, with labels, and then have a processing step that takes an intermediate form of RDF that has the labels and substitutes the associated text. This processing might in fact simply be done using XSLT.

Ultimately, I might simply prefer the "simplest" approach, but sometimes simplicity is not the cheapest or most flexible and maintainable approach.

The Semantic Web is still in its infancy, so techniques and tools are evolving, so that those techniques and tools in vogue today may not be the preferred approach in the future.

-- Jack Krupansky