Monday, August 31, 2009

More thoughts on the book: Wired for Thought by Jeffrey Stibel

Previously, I gave a rather lackluster mini-review of the new book Wired for Thought: How the Brain Is Shaping the Future of the Internet by Jeffrey M. Stibel which claims that "The Internet is more than just a series of interconnected computer networks: it's the first real replication of the human brain outside the human body", but I have had a few more thoughts, in particular related to the concept of a "collective consciousness."

My main regret is that I failed to note that the World Wide Web as a whole does to a fair extent represent a dynamic snapshot of the collective consciousness of the millions of people who use the Web. Blog posts and Twitter streams do in fact give a reasonably accurate sense of the topics that are at the front of our collective minds and the tip of our collective tongues.

The Web itself does not sense or have consciousness, but users using the Web as a wall to write on and read from can convey their thoughts and reactions through the Web.

But, I think that is about as far as I feel comfortable going on this idea of the Web being analogous to the human brain.

After all, this collective consciousness is not really a consciousness per se in the way the human brain has a consciousness. There is no single voice of the collective. There is no I. There is no sense of self.

We cannot have a true dialogue with the collective.

We cannot ask a question and get an answer.

The collective does not have a personality.

You cannot have a one-to-one or one-on-one interaction with the collective.

The collective never makes a decision.

The collective does not have a responsibility. Nor does it have any obligations.

The collective does not exhibit common sense.

Nonetheless, the book does contain some interesting insights and is well worth a browse even if you do not purchase it.

-- Jack Krupansky

Monday, August 24, 2009

Sentiment vs. facts

There was an interesting article in The New York Times by Alex Wright entitled Mining the Web for Feelings, Not Facts about how companies are beginning to "mine" online social media such as blogs and social networks for consumer attitudes towards companies and their policies, products, and services. The emerging field of sentiment analysis aims at translating vague or not so vague opinions into hard data. The key thing here is that companies are much more interested in how consumers feel about companies and their policies, products, and services than in traditional hard, factual data.

In addition to how people feel, companies are also interested in identifying who are the more influential opinion holders.

Organizing and presenting all of this data is also a key challenge.

The one point I would make is that this is all fine and dandy for companies, but I think that consumers would like to access similar data and analyses.

There is obviously a lot in common for what a company and a consumer would like to do in terms of understanding sentiment towards companies and their policies, products, and services, but there are differences. In some sense, consumers may have even more intense needs and desires to seek and be at the bleeding edge of consumer trends. After all, it is the consumers who both have an intense passion for being part of the latest trends as well as setting the trends.

The obvious difference is that consumers won't be paying an arm and a leg for expensive software and services for sentiment analysis.

Consumers already have some amount of familiarity with sentiment analysis as there are a wealth of lists of top topics, hit topics, most read stories, top keywords, ranking and sharing of preferences for web pages, etc.

My hunch is that there are probably more consumers that have a keener sense of sentiment on the Internet than your average corporate suit in traditional market intelligence.

In any case, consumers need ever-greater tools and capabilities for recording and monitoring sentiment, both as producers of sentiment and consumers of sentiment.

As we evolve an infrastructure for a true knowledge web, representation and access to sentiment knowledge and data needs to be a key focus.

-- Jack Krupansky

Book: Wired for Thought by Jeffrey Stibel

Yesterday I was browsing through the new book table at Barnes & Noble near Lincoln Center and found an interesting book entitled Wired for Thought: How the Brain Is Shaping the Future of the Internet by Jeffrey M. Stibel that informs us that "The Internet is more than just a series of interconnected computer networks: it's the first real replication of the human brain outside the human body" and that a "collective consciousness" is being created. Sounds fascinating. The Amazon blurb tells us that:

In this age of hyper competition, the Internet constitutes a powerful tool for inventing radical new business models that will leave your rivals scrambling. But as brain scientist and entrepreneur Jeffrey Stibel explains in "Wired for Thought", you have to understand its true nature. The Internet is more than just a series of interconnected computer networks: it's the first real replication of the human brain outside the human body. To leverage its power, you first need to understand how the Internet has evolved to take on similarities to the brain. This engaging and provocative book provides the answer. Stibel shows how exceptional companies are using their understanding of the Internet's brain like powers to create competitive advantage - such as building more effective Web sites, predicting consumer behavior, leveraging social media, and creating a collective consciousness.

The promise sounded truly compelling, but after five minutes of leafing through the book I was not able to isolate more than a few stray details that had any bearing on fulfilling the promise. There was was too much "pop puff" which may thrill the average reader ignorant of the relevant technology, but I simply was unable to find any substantive justification for the central thesis of the book. It may in fact be there since I did not read the book cover to cover, but if it is so compelling and presumably pervasive, how could I have missed it?

Nonetheless, this book may have a solid position simply as a statement of "the state of the art", telling us not how close we are to real success, but simply where we happen to be today. Yes, we are getting closer to the mountain, but that does not automatically translate into closeness to the peak.

There is a lot that we do not yet deeply compehend about the human brain, mind, consciousness, and intellect, so I am not sure how much mileage we can get out of comparing the Internet to the human brain. In fact, I have a hunch it might be an exercise in futility at this stage. Sure, we can paint a broad-brush picture and draw lots of fuzzy analogies, but none of that will necessarily result in true enlightenment.

By all means, browse the book yourself and make up your own mind whether it meshes with your own expertise and interest levels. The book does have a web site with chapter excerpts.

For me, I put down the book pondering the question 'Where's the beef?".

Oddly, Amazon does not have a picture of the book cover, but I was able to find it on the Harvard Business School Press web site since they are the publisher. Note: I get a small commission if you buy the book by clicking on any of my links to the book on Amazon.

-- Jack Krupansky

How reliable are questions?

In a recent post I commented on our dependence on the reliability of knowledge. Now, I'll extend that inquiry to the reliability of questions themselves. You might wonder how could a question be unreliable? How could a question be false? How could a question be misleading? Good questions. That is the point.

A question is really simply an implied statement at a foundation level. The implied statement is a characterization of a quantity of information or knowledge that is desired, coupled with a request or demand or command that the requested information be provided.

So, how can a question be unreliable?

  • The person asking the question may not really need the information. In that sense, the implied statement "I need X" may be a lie.
  • The person may already have the information so that supplying the information may not be necessary. They may merely be seeking confirmation or maybe testing the other party.
  • The question may be overly broad due to poor phrasing.
  • The question may be overly narrow due to poor phrasing.
  • The question may refer to knowledge that simply does not exist.
  • The timing of the question may be inappropriate, either too early or too late to get a reasonable answer.
  • The questioned party may not be a reasonable source for the answer.
  • The question may be an imposition or unfair or disrespectful or discourteous of the questioned party.
  • The tone of the question may be inappropriate.
  • The represented need for an answer may not be appropriate.
  • The implied statement may be offensive.
  • The question may be illegal or a violation of the questioned party's rights.
  • The complexity of the answer may be far out of proportion to the expectation of the questioner.
  • The two parties may not agree on compatible interpretations of the terms used within the question.
  • The questioner may be using private interpretations of some terms without disclosing those interpretations.
  • The context may not be explicit in the question.
  • The context may be incomplete or ambiguous in the question.
  • The question may be ambiguous. Even simple English words can be ambiguous.
  • The question may really be simply a statement in the form of a question with no intent that an answer is expected.
  • The question may be rhetorical. No "answer" is expected, but the question is intended to "hang" over the interaction.
  • The question may seem simple, but have underlying complexity that the questioner or the questioned party may be unaware of.
  • The answer to a question may have a radically different context than the questioner was prepared for and that seemed implied by the question.
  • The question might be loaded so that any answer might be misleading.
  • The question might be leading so that the answer might be improperly biased.
  • The question might be designed so that the legitimate answer might indirectly mislead a third party monitoring the interaction.
  • The question might be worded in such a way that the answer might be misleading if viewed by itself without the full context of the question.
  • The statement implied by the question might be internally inconsistent.
  • The question might be intended as a distraction rather than a genuine query for information.
  • The questioner may be asking the right question but the wrong person.
  • The questioner may not have done sufficient due diligence to identify a source that can reasonably be considered reliable.

At heart, the issue is whether the questioned party or a computational system being questioned can reasonably be expected to respond with an acceptable answer. And even if the response is considered "acceptable", was the question itself reliable enough for the questioner to be able to depend on the answer (assuming the answer itself is also reliable.)

Context is essential. The questioned party may be able to infer all or part of the questioner's context, but assumptions about context can be somewhat risky and unreliable, possibly leading to an unreliable question despite the best of intentions on the part of the questioner.

Answering questions reliably certainly requires very careful attention to detail, but there is still plenty of craft if not outright art that needs to go into constructing reliable questions. There is an old saying in the data processing world, "GIGO - Garbage In, Garbage Out."

All too often, people have a false sense of confidence in the reliability of their questions which can lead to a false sense of confidence that the answers are valid for the questions they thought they were asking.

The only tried and true method I know of to even come close to assuring the reliability of a question is to ask multiple parties the same question and to ask multiple corollary questions so that the multiple answers can be examined to reinforce the most reliable answer. This doesn't even come close to avoiding all of the reliability factors I listed above, but at least it is a start.

-- Jack Krupansky

Sunday, August 23, 2009

How reliable is knowledge?

We depend on knowledge in our daily lives. We presume that what we consider "knowledge" is true or at least highly likely to be true. But, how reliable is any of what we call "knowledge"? This raises some questions:

  • How can we know that any purported knowledge is in fact true?
  • How can we verify that any knowledge is true?
  • How can we determine how to verify any knowledge?
  • How can we have any confidence in our belief in any knowledge?
  • What can we really do when we are unsure whether any knowledge is really true?
  • What can we confidently say about the truth of any knowledge that we believe in?
  • What statements can we safely make about the reliability of any knowledge?
  • What disclaimers should we give regarding the reliability of any knowledge?
  • How certain do we need to be before we can assert that a statement or claim is in fact knowledge?

Ultimately, we need to be able to point to a piece of knowledge and ask and get the answer to one simple question: How reliable is this knowledge?

This implies that there needs to be some record of the history of asking and answering these questions for each and every bit of knowledge.

But, even with such a historical record, how reliable is any of that history and how can we even believe that any of it is reliable?

Maybe the bottom line is that every bit of knowledge is of dubious reliability, even if we do not quite express or acknowledge it.

Nonetheless, we need to have some sense of the reliability of every bit of knowledge.

Trust probably has a role. To wit, if we know who believes a bit of knowledge, we can then judge that person or institution's credibility for having good reason to believe in that knowledge.

Ultimately, we do depend on our own judgment of the veracity of any knowledge, but at least some of us know better than to trust our own judgment too far.

There are also two quite different statements any of us can make about knowledge:

  1. Do we believe and accept that a given bit of knowledge is valid?
  2. Do we have good reason for that belief?

Maybe a simple statement about why we believe in the validity of any knowledge is good enough or maybe even as good as it gets.

It would be nice for a knowledge web to have links for each bit of knowledge that say who believes it and who or what they can reference as to why they believe it. That is clearly not enough to judge the ultimately reliability of a bit of knowledge, but is surely a great start.

-- Jack Krupansky

Wednesday, August 5, 2009

What is the future of the English language, especially for the Semantic Web?

Despite all of the myriad technological advances in computer hardware and software and all of the wonderful specialized computer languages, it is amazing that natural language, in particular English, is still such a dominant force in the world. This is where we are today, but what about the future?

Computer language designers and application developers are quite busily at work incrementally chipping away at more and deeper and broader niches where computer languages can supplant natural language as the preferred "tongue." Still, progress is very slow. Natural language is still the choice for expressiveness and flexibility and ease of use. That seems unlikely to change any time soon.

Low birth rates in "English-speaking" countries make it increasingly likely that fewer and fewer people will consider English to be their native tongue in the decades to come. Still, somehow, English continues to have value to "open doors" across cultures, especially in business, government, science, and engineering, and especially computer software.

The Web makes it very easy for people to use their local language, which is fine when the intended audience is local, but many Web sites either use English or have an alternate set of Web pages in English to cater to a global audience.

Then we come to the Semantic Web. In some sense the Semantic Web is a direct parallel to the traditional non-semantic Web. It is difficult to say whether data in the Semantic Web is any more global than the old Web. Maybe initially more of the efforts are for a global audience, just as they were with the traditional Web in the early days, but over time we should hope that very specialized databases and applications will be tailored heavily for local audiences. Rest assured that the Semantic Web technologies are designed for internationalization and localization.

But since globalized Semantic Web applications and code libraries, by definition, know something about the data they are processing, that implies that this "knowledge" about the data needs to be in a language-independent form.

To be truly useful, software agents, especially intelligent agents, need to access the meaning of data and to access it globally. This means, once again, that knowledge about data needs to be represented in a form that is not hidden by localized natural languages.

As things stand today, the three main tools for globalizing applications are:

  • Use of English as the "core" language.
  • Maintaining data in conceptual, language-neutral form.
  • Tools for localizing data to the native or preferred tongue of the user.

Automatic language translation is still fairly primitive and unlikely to be "solved" in the near future.

As technologies, especially the Semantic Web, are under development and dynamically evolving at a rapid pace it makes sense to focus on a single, core natural language to assure that information is communicated as rapidly and widely as possible.

But as the technology matures (maybe in another ten to twenty years), the need for such broad communication will rapidly diminish. Sure, the elite will still communicate globally, but the average practitioner will likely serve a local audience. All important documents and specifications will have been translated into all the significant natural languages. In that kind of environment the need to "work" in English will effectively vanish, much as we see today with local Web sites, blogs, and other social media. Twitter is the latest to experience this localization phenomenon.

That still leaves the case of software agents. An unsolved problem. Sure, they can be heavily localized as well, but that is not a solution per se. Maybe initially a new development in agent technology might be English-only or English-centric, but as that technology matures, it is only natural that developers seek to refine the technology to exploit localized intelligence. That may mean that such an agent is less usable at a global level, but it may not be as important at that stage. Also, agents can be programmed with split personalities so that they can still operate at a global level albeit at a somewhat lower level of capability than the more specialized localized intelligence. That also requires greater effort and discipline on the part of developers. That is less than optimal.

There is also the underlying issue that besides superficial language issues, there are also cultural differences between countries, peoples, and regions of the world. Initial Semantic Web efforts may tend to be at a fairly high level where such cultural differences are rather muted, but as Semantic Web applications become deeper and more localized, cultural differences will start moving to the forefront.

Academic and high-end commercial developers have a need and interest to present their work globally, including marketing, journal papers, and conferences. English is the norm. Semantic Web content that is not in English will tend to not be preferred in such venues. Besides, high-end developers will tend to prefer to develop internationalized content that can be localized as needed.

Global communities of developers are also becoming a new norm. This includes open source community projects where, by definition, the initial and current contributors have no idea what country or culture future contributors will be from. This once again argues against doing development in anything other than English.

All of this leads me to believe that English will continue to be the dominant natural language for advanced and emerging computer software, especially the Semantic Web, for some time to come. Nonetheless, the issue of how to fully support and exploit local natural languages will remain and increasingly become a very thorny issue.

One lingering issue with the Semantic Web is the language to be used for constructing predicate and property names. They tend to be English, so far, which is okay since the end-user should never see them, but there is no requirement that they be English and some developers who have less global interests may begin to use their local language for predicate and property labels. This introduces a whole new level of mapping and matchmaking complexity. Sure, it is solvable, but it is also unsolved and a potential problem lurking around the corner. Sure, motivated developers can manually add the necessary mapping, but it simply tends to minimize the extent to which serendipity just works without manual intervention by developers.

All of this begs the question of how the English language itself might evolve over the coming decades. One interesting possibility is that at some point the technology "users" of English, including Semantic Web content developers, could become a much greater force than the native speakers that the users will realize that they are effectively in control of the English language and can evolve it as they see fit. It will be interesting to see how that post-English language evolves.

-- Jack Krupansky

Tuesday, August 4, 2009

OWL examples in Manchester Syntax

There is a small collection of OWL ontology reasoning examples from Manchester University in the UK. They are written in the Manchester OWL syntax, which is a more compact and easier to read frame-based syntax than the usual RDF/XML triple/axiom format of the Semantic Web. There are four example ontologies, People, Pets, Pizzas, and Sports Teams.

Here is an example of a class in the Manchester OWL syntax:

Class: man
    Annotations:
        rdfs:label "man"
    EquivalentTo:
        adult
        and male
        and person

-- Jack Krupansky

OWL examples in Manchester Syntax

There is a small collection of OWL ontology reasoning examples from Manchester University in the UK. They are written in the Manchester OWL syntax, which is a more compact and easier to read frame-based syntax than the usual RDF/XML triple/axiom format of the Semantic Web. There are four example ontologies, People, Pets, Pizzas, and Sports Teams.

Here is an example of a class in the Manchester OWL synatx:

Class: man
    Annotations:
        rdfs:label "man"
    EquivalentTo:
        adult
        and male
        and person

-- Jack Krupansky

Timestamping knowledge in the Semantic Web

Context is an important element of knowledge. Time is an important element of context. If we really want to understand a piece of knowledge, we need to know its context and its timing in the flow of events. Data and knowledge needs to be timestamped.

There is no single time or timestamp for any piece of knowledge. Various timings include:

  • When an observation was made, when the raw observation data was captured. This may be from a hardware sensor monitoring the real physical world, a process monitoring some data stream, or even a user interface.
  • When the raw observation data was analyzed to derive the nominal observation, the nominal knowledge.
  • When was the knowledge stored.
  • When was the knowledge validated.
  • When was the knowledge published or otherwise made available.
  • When the knowledge was calculated from other knowledge.
  • When is the validity of the observation expected to expire.

In some cases, the raw observation data might be preserved and re-analyzed at a later data with "improved" analytic capabilities and the nominal knowledge re-generated. In such cases there would then be multiple pieces of knowledge for each observation, each qualified by the time of analysis or re-analysis.

In some cases there may be latency between the raw sensor capture of the data and the reading of that raw data from the sensor device by the computational device that will record that sensor data. Typically that latency will be too small to matter, but for high-speed capture sequences it may be significant. Two separate timestamps may be needed. Or, a discrete timestamp for each processing step along the way.

A piece of knowledge may have been captured from multiple sources, so we need to represent the distinct sources and their distinct timings as well. Collectively they may still represent a single logical observation. An example might be a 3-D camera which is really multiple cameras.

One could also link a number of discrete but simultaneous observations, such as all cameras in a given area, so that collectively they can be considered a single super observation. That overall super observation can have its own timestamps, but there also needs to be a way to drill down to get all of the component timestamps.

The timing of capture by multiple sources may be close enough to be considered the same time, or maybe enough time had elapsed to suggest that they were different observations. Actually, they are different observations in any case, but the issue is whether they are equivalent, or more precisely equivalent in some particular sense. This concept of sense of equivalence needs to be explored more fully.

Each observation station may have its own timepiece and they may not be synchronized. One solution might be to suggest that timepiece synchronization should be a standard protocol when two or more devices are exchanging information that is time-sensitive. Maybe the local time is recorded and then a delta time is recorded for any data that is transferred between two devices.

Calculated data is especially problematic because each of the elements of data used in the calculation may have its own timestamps. The implication is that each piece of calculated data should have an element trail that references each of those pieces of knowledge used in the calculation so that they can be examined later if the data needs to be audited.

Now, how all of these timestamps would be represented and stored in the Semantic Web is another matter entirely and left for further contemplation.

-- Jack Krupansky

Monday, August 3, 2009

Reference data for the Semantic Web

If you want to construct a database of any serious data you quickly realize that you need to share as much common data as possible. A key concept is reference data, which is simply common data the is needed is a number of places within the database schema. Reference data is a tool to help you cope with complexity as well as interoperability. This allows you to leverage the extra effort spent on defining and refining the reference data so that the rest of the database can depend on the quality and detail of that reference data without having to reinvent the wheel every time a bit of the common data is needed. Examples of reference data include names (and other info) of countries, states, and cities, named entities such as businesses or venders, names of products and services, codings for colors, shoe sizes, any forms of units, classes of service, types of foods, types of meals, forms of payment, medical conditions and treatments, names of bones, names of animals and plants, etc. In general, any of these pieces of reference data would have at least a natural language name an description, but the most important thing is that each item has an ID or identifier that can be used in the body of the database rather than storing the natural language text repeatedly all over the database.

The generic concept at work here is factoring where one or more models are compared, common elements are identified, extracted, and then the common elements are referenced indirectly and managed and controlled separately.

In the context of the Semantic Web, reference data includes global information which is common to many SemWeb applications. A developer may be constructing an application-specific database, but they he should be able to leverage off of the work of others by referencing ontologies and reference data that has already been developed by the global community, such as in the context of the Linked Data Web.

The ID or identifier for a piece of reference data on the Semantic Web is of course represented as a URI, an RDF URI reference.

A broad array of reference data is a necessary requirement for a solid foundation on which application developers can develop domain-specific and application-specific databases.

Reference data is also a key to being able to match disparate databases which were developed at different times and places. Gradually, databases will begin to share reference data, but at least in the short-run databases can be merged or meshed by figuring out how their common data meshes through the mechanism of reference data.

There can be many levels of reference data. Some data is truly global and readily shared across virtually all other databases. At the other end of the spectrum, there might be a family of applications or a niche domain, such that reference data is a useful data structuring tool, but the impact is much more limited compared to the entire global Semantic Web.

In the traditional database world there is also the concept of master reference data. In that conception, reference data would simply be common within a single database, while master reference data would be common across multiple databases. Both are useful concepts. Individual databases can certainly be structured better using factoring and data can be globally more interoperable when factoring is done on a more global basis. In the context of the Semantic Web, I will continue to use the simpler term reference data, primarily to refer to global data factoring, but not intending to exclude factoring within individual databases. After all, some of the best global reference data might well originate within a single application before people eventually realize the global benefits.

-- Jack Krupansky