Thursday, October 30, 2008

WordNet - a lexical database for the English language

Given my interest in glossaries and concept dictionaries, I am intrigued by WordNet from the Princeton Cognitive Science Laboratory, which bills itself as "a lexical database for the English language." The web site says:

WordNet(R) is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

I have not dug into this enough to get a handle on whether or how they utilize RDF, but the effort does certainly seem quite interesting.

-- Jack Krupansky

Monday, October 20, 2008

RDFa - W3C recommendation for adding RDF annotations to HTML documents

The folks over at W3C have been working on a new scheme that allows RDF-like annotations to be added to HTML Web pages. W3C just announced that RDFa is now a full-fledged "Recommendation" (W3C standard) and has an updated Primer. Actually, the annotations are for XHTML documents. According to the RDFa Primer:

Today's web is built predominantly for human consumption. Even as machine-readable data begins to appear on the web, it is typically distributed in a separate file, with a separate format, and very limited correspondence between the human and machine versions. As a result, web browsers can provide only minimal assistance to humans in parsing and processing web data: browsers only see presentation information. We introduce RDFa, which provides a set of XHTML attributes to augment visual data with machine-readable hints. We show how to express simple and more complex datasets using RDFa, and in particular how to turn the existing human-visible text and links into machine-readable data without repeating content.

...

The web is a rich, distributed repository of interconnected information organized primarily for human consumption. On a typical web page, an XHTML author might specify a headline, then a smaller sub-headline, a block of italicized text, a few paragraphs of average-size text, and, finally, a few single-word links. Web browsers will follow these presentation instructions faithfully. However, only the human mind understands that the headline is, in fact, the blog post title, the sub-headline indicates the author, the italicized text is the article's publication date, and the single-word links are categorization labels. The gap between what programs and humans understand is large.

What if the browser received information on the meaning of a web page's visual elements? A dinner party announced on a blog could be easily copied to the user's calendar, an author's complete contact information to the user's address book. Users could automatically recall previously browsed articles according to categorization labels (often called tags). A photo copied and pasted from a web site to a school report would carry with it a link back to the photographer, giving her proper credit. When web data meant for humans is augmented with hints meant for computer programs, these programs become significantly more helpful, because they begin to understand the data's structure.

RDFa allows XHTML authors to do just that. Using a few simple XHTML attributes, authors can mark up human-readable data with machine-readable indicators for browsers and other programs to interpret. A web page can include markup for items as simple as the title of an article, or as complex as a user's complete social network.

RDFa benefits from the extensive power of RDF [RDF], the W3C's standard for interoperable machine-readable data. However, readers of this document are not expected to understand RDF. Readers are expected to understand at least a basic level of XHTML.

I personally have not studied RDFa yet, but I have a strong suspicion that it may be relevant to my interests.

OTOH, it may simply represent a steppingstone on the path to better things.

-- Jack Krupansky

Saturday, October 11, 2008

Hypocorism - pet name or term of endearment (or lack thereof)

Courtesy of the Merriam-Webster Word of the Day, I just learned that hypocorism is a linguist's term for pet names, including "baby talk", and endearing terms. I would include nicknames and terms of lack of endearment. Traditionally that is in the personal sense, but pet names and nicknames are all too common in computing technology and other fields. Common enough that glossaries and term definitions should include them to be able to more completely capture all references to an entity, concept, term, or topic. I would include jargon, techspeak, euphemisms, and "hacker slang" as well. A key criteria is that the term have a relatively widespread usage as opposed to being used by only a very small and unknown group or single individual for their local environment.

Some common examples from computing:

  • Big Blue - IBM
  • Redmond - Microsoft
  • Mr. Softie - Microsoft
  • Microsloth - Microsoft
  • Windoze - Microsoft Windows
  • net - Internet and sometimes World Wide Web
  • web - World Wide Web
  • PC - personal computer that primarily runs the Windows operating system
  • Mac - personal computer that primarily runs the Apple Macintosh operating system
  • app - software application
  • Googleplex - the headquarters of Google
  • bare metal - a computer without an operating system
  • bit bucket - mythical destination of and euphemism for data that has been lost and destroyed

In fact, it might be interesting or amusing to have a glossary consisting only of hypocorisms in computing. There is in fact something called The Jargon Lexicon, but it is a more narrow collection of terms used primarily by "hackers."

I am not yet comfortable with using this relatively unknown 10-gallon term for such a simple concept. For now, I may stick with nickname as my preferred sobriquet for hypocorism.

Note that there should be a semantic distinction between alternate names, synonyms, acronyms, and nicknames. Another key aspect of a nickname is that its usage is quite informal.

-- Jack Krupansky

Wednesday, October 8, 2008

Simple acronyms in SKOS Turtle and RDF

It wasn't hard at all to convert my simple acronym experiment to SKOS Turtle and then use the rdf:about Validator and Converter to generate the equivalent RDF. I spent more effort formatting the "code" for this blog post!

For example, my pure XML acronym for Agent-Based Computing (ABC) was:

<Acronym>
<Term>ABC</Term>
  <CompoundTerms>
   <CompoundTerm>Agent-Based Computing</CompoundTerm>
  </CompoundTerms>
</Acronym>

And in Turtle that is:

ac:agent_based_computing rdf:type skos:Concept;
  skos:prefLabel "Agent-Based Computing"@en;
  skos:altLabel "ABC"@en.

And the XML/RDF created by the validator is:

<skos:Concept rdf:about="http://agtivity.com/xml/agent_based_computing">
<skos:prefLabel xml:lang="en">Agent-Based Computing</skos:prefLabel>
  <skos:altLabel xml:lang="en">ABC</skos:altLabel>
</skos:Concept>

Alas, each of the three RSS definitions is a distinct SKOS concept, but that does make some sense since each of the three meanings is somewhat distinct even though they are all under the same umbrella concept. I will have to think about what it might mean to have the acronym itself be a distinct concept. Actually, there was some discussion of semantic relationships and acronyms in the primer.

I have it online at http://agtivity.com/xml/acronym5.txt:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix skos: <http://www.w3.org/2008/05/skos#>.
@prefix ac:http://agtivity.com/xml/.
 
ac:agent_based_computing rdf:type skos:Concept;
skos:prefLabel "Agent-Based Computing"@en;
skos:altLabel "ABC"@en.
ac:resource_description_framework rdf:type skos:Concept;
skos:prefLabel "Resource Description Framework"@en;
skos:altLabel "RDF"@en.
ac:really_simple_syndication rdf:type skos:Concept;
skos:prefLabel "Really Simple Syndication"@en;
skos:altLabel "RSS"@en.
ac:rich_site_summary rdf:type skos:Concept;
skos:prefLabel "Rich Site Summary"@en;
skos:altLabel "RSS"@en.
ac:rdf_site_summary rdf:type skos:Concept;
skos:prefLabel "RDF Site Summary"@en;
skos:altLabel "RSS"@en.

The XML/RDF generated by the validator is online at http://agtivity.com/xml/acronym5.rdf:

<?xml version="1.0"?>
<rdf:RDF
xmlns:ac="http://agtivity.com/xml/"
xmlns:skos="http://www.w3.org/2008/05/skos#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<skos:Concept rdf:about="http://agtivity.com/xml/agent_based_computing">
<skos:prefLabel xml:lang="en">Agent-Based Computing</skos:prefLabel>
<skos:altLabel xml:lang="en">ABC</skos:altLabel>
</skos:Concept>
  <skos:Concept rdf:about="http://agtivity.com/xml/resource_description_framework">
<skos:prefLabel xml:lang="en">Resource Description Framework</skos:prefLabel>
    <skos:altLabel xml:lang="en">RDF</skos:altLabel>
</skos:Concept>
  <skos:Concept rdf:about="http://agtivity.com/xml/really_simple_syndication">
<skos:prefLabel xml:lang="en">Really Simple Syndication</skos:prefLabel>
    <skos:altLabel xml:lang="en">RSS</skos:altLabel>
</skos:Concept>
  <skos:Concept rdf:about="http://agtivity.com/xml/rich_site_summary">
<skos:prefLabel xml:lang="en">Rich Site Summary</skos:prefLabel>
    <skos:altLabel xml:lang="en">RSS</skos:altLabel>
</skos:Concept>
  <skos:Concept rdf:about="http://agtivity.com/xml/rdf_site_summary">
<skos:prefLabel xml:lang="en">RDF Site Summary</skos:prefLabel>
  <skos:altLabel xml:lang="en">RSS</skos:altLabel>
</skos:Concept>
</rdf:RDF>

-- Jack Krupansky

Converting Turtle to RDF/XML

I wanted to experiment with converting my little test acronym file to Turtle or RDF and looked around and found a tool, RDF Validator and Converter on the rdf:about Web site run by Joshua Tauberer, that can in fact convert Turtle to raw XML/RDF. I tried it out with that simple Turtle example from my last blog post:

ex:animals rdf:type skos:Concept;
skos:prefLabel "animals".

But the validator complained that the "ex:" prefix was undefined. I went back to the SKOS Primer, and found that you define the prefixes with "@prefix", and you need that for the "ex:", "rdf:", and "skos:" prefixes, like so:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix skos: <http://www.w3.org/2008/05/skos#>.
@prefix ex:<http://www.example.com/>.


ex:animals rdf:type skos:Concept;
skos:prefLabel "animals".

The validator accepted that and informs me that the equivalent XML/RDF is:

<?xml version="1.0"?>
<rdf:RDF
xmlns:skos="http://www.w3.org/2008/05/skos#"
xmlns:ex="http://www.example.com/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <skos:Concept rdf:about="http://www.example.com/animals">
  <skos:prefLabel>animals</skos:prefLabel>
 </skos:Concept>
</rdf:RDF>

So, the core XML/RDF for that one SKOS concept is:

 <skos:Concept rdf:about="http://www.example.com/animals">
  <skos:prefLabel>animals</skos:prefLabel>
 </skos:Concept>

The validator also tells me that there are two underlying triples here, one for the "Concept" and one for the "prefLabel":

<http://www.example.com/animals>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2008/05/skos#Concept> .
<http://www.example.com/animals>

<http://www.w3.org/2008/05/skos#prefLabel>
"animals" .

So, it actually is not so difficult, once somebody gives you a bunch of the clues.

Now on to converting my raw XML acronym test into Turtle.

-- Jack Krupansky

Reading the Primer for Simple Knowledge Organisation Systems (SKOS)

As background research for my little project to represent glossaries of terms and acronyms, I have started reading (studying) the primer for the W3C Simple Knowledge Organisation Systems (SKOS). The primer introduces SKOS by saying:

SKOS -- Simple Knowledge Organization System -- provides a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary. As an application of the Resource Description Framework (RDF), SKOS allows concepts to be composed and published on the World Wide Web, linked with data on the Web and integrated into other concept schemes.

This document is a user guide for those who would like to represent their concept scheme using SKOS.

In basic SKOS, conceptual resources (concepts) are identified with URIs, labelled with strings in one or more natural languages, documented with various types of note, semantically related to each other in informal hierarchies and association networks, and aggregated into concept schemes.

In advanced SKOS, conceptual resources can be mapped across concept schemes and grouped into labelled or ordered collections. Relationships between concept labels can be specified. Finally, the SKOS vocabulary itself can be extended to suit the needs of particular communities of practice or combined with other modelling vocabularies.

This document is a companion to the SKOS Reference, which gives the normative reference on SKOS.

In W3C terminology, a document is "normative" when it describes features in terms of what is absolutely required, what is optional, and what absolutely must not be done, while a document is "nonnormative" (or merely "descriptive") when it more casually describes or explains or gives examples for features without necessarily giving the absolute requirements and full details.

The primer tells us that:

The Simple Knowledge Organization System (SKOS) is an RDF vocabulary for representing semi-formal knowledge organization systems (KOS), such as thesauri, taxonomies, classification schemes and subject heading lists. Because SKOS is based on the Resource Description Framework (RDF) [RDF-PRIMER] these representations are machine-readable and can be exchanged between software applications and published on the World Wide Web.

SKOS has been designed to provide a low-cost migration path for porting existing organization systems to the Semantic Web. SKOS also provides a lightweight, intuitive conceptual modeling language for developing and sharing new knowledge organization systems (KOSs). It can be used on its own, or in combination with more formal languages like the Web Ontology Language (OWL) [OWL]. SKOS can also be seen as a bridging technology, providing the missing link between the rigorous logical formalism of ontology languages such as OWL and the chaotic, informal and weakly-structured world of Web-based collaboration tools, as exemplified by social tagging applications.

The aim of SKOS is not to replace original conceptual vocabularies in their initial context of use, but to allow them to be ported to a shared space, based on a simplified model, enabling wider re-use and better interoperability.

In SKOS terminology, my little acronym project would be a concept scheme. A glossary is an example of a concept scheme. An individual acronym or term would be a concept in SKOS.

The starting point of SKOS is the concept which can have one or more labels as well as documentary notes. Then you can add semantic relationships between concepts.

Just to get started, here is a simple SKOS concept written in what is called TURTLE notation (not pure XML):

ex:animals rdf:type skos:Concept;   skos:prefLabel "animals".

This defines a concept named animals that has a preferred label of "animals." The "ex:" is simply a shorthand notation for referring to a name space where XML/RDF terms are defined.

Extending that concept a little to allow for various natural languages for the same concept:

ex:animals rdf:type skos:Concept;   skos:prefLabel "animals"@en;   skos:prefLabel "animaux"@fr.

Extending a little more to allow synonyms:

ex:animals rdf:type skos:Concept;   skos:prefLabel "animals"@en;   skos:altLabel "creatures"@en;   skos:prefLabel "animaux"@fr;   skos:altLabel "créatures"@fr.

SKOS is also designed to support "near-synonyms", abbreviations, and acronyms. For example:

ex:fao rdf:type skos:Concept;   skos:prefLabel "Food and Agriculture Organization"@en;   skos:altLabel "FAO"@en.

I am not completely happy with the idea that SKOS does not distinguish between the various forms of alternate labels, and in particular with representing acronyms.

SKOS supports notes for adding documentation, for example:

ex:documentation skos:definition       "the process of storing and retrieving information      in all fields of knowledge"@en.

A better example of a definition for a concept in SKOS:

ex:pineapples rdf:type skos:Concept;   skos:prefLabel "pineapples"@en;   skos:prefLabel "ananas"@fr;   skos:definition "The fruit of plants of the family Bromeliaceae"@en;   skos:definition         "Le fruit de la plante herbacée de la famille des broméliacées"@fr.

A simple example of defining a group of concepts as a concept scheme:

ex:animalThesaurus rdf:type skos:ConceptScheme;   dc:title "Simple animal thesaurus";   dc:creator ex:antoineIsaac.
ex:mammals rdf:type skos:Concept;   skos:inScheme ex:animalThesaurus. ex:cows rdf:type skos:Concept;   skos:broader ex:mammals;   skos:inScheme ex:animalThesaurus. ex:fish rdf:type skos:Concept;   skos:inScheme ex:animalThesaurus.

That example also illustrates how a concept such as ex:cows can be defined as being a member of a broader concept such as ex:mammals.

-- Jack Krupansky

Friday, October 3, 2008

uBio - Universal Biological Indexer and Organizer

I just heard about a biology taxonomy called uBio:

About uBio project

uBio is an initiative within the science library community to join international efforts to create and utilize a comprehensive and collaborative catalog of known names of all living (and once-living) organisms. The Taxonomic Name Server (TNS) catalogs names and classifications to enable tools that can help users find information on living things using any of the names that may be related to an organism.

I have no investigated it closely to determine whether it is simply a Web Service (SOAP interface) or whether it also is available on the Semantic Web (RDF).

More about what uBio does:

Information about organisms is often linked to a name.

This can create problems in information retrieval because:

uBio is working on tools for providers of biological information that address these problems.

The uBio Taxonomic Name Server acts as a name thesaurus.

Names have many different classes of relationships that can be used to organize and retrieve information that is annotated with names. These classes are divided into two inter-connected services.

NameBank is a repository of millions of recorded biological names and facts that link those names together. [more]

ClassificationBank stores multiple classifications and taxonomic concepts that are the result of expert opinions. It extends the functionality of NameBank. [more]

All data within these components are linked to mechanisms that provide credit and attribution to experts who provide name and linkage information within the TNS. [more]

Lastly, NameBank promotes the emergence of a layered biological informatics infrastructure that allows different expert systems to share common information. This conserves scarce resources and enhances the means to support continued expert work.

A foundation for collaboration

We are currently pursuing funding to separate the two logical components of the Taxonomic Name Server into separate services.

NameBank will become a biological name server focused on serving factual nomenclatural metadata. The ClassificationBank component derive taxonomic concepts from cached NameBank records. Formalizing this division into discreet components provides us with increased collaborative opportunity by facilitating multiple taxonomic models atop a common core set of factual metadata.

Different taxonomic systems can share common facts

A common nomenclatural resource allows different information systems to address different taxonomic issues, scopes, or user communities while sharing common reference data. Collaboration eliminates duplication, increases accountable attribution of work, and provides a common interchange core. A contributor to NameBank ca

We seek to establish that such an approach is technically sound and can reduce inefficient duplication and derivation of established facts while promoting a more effective attribution pathway that can increase the reach of the taxonomic profession without compromising quality. NameBank can also enhance interoperability between different infrastructures by providing a common address space.

The difference between a vocabulary and a taxonomy is that the latter organizes the terms in a hierarchical relationship.

-- Jack Krupansky

Defining terms and glossaries for domains and projects

I am going to leap ahead and start to conceptualize some semantic building blocks that are useful for acronyms, terms, glossaries and other foundation concepts.

My starting set of core concepts are roughly:

  • Term - a single word or phrase that has a particular meaning. This may be a single term (one word or two or more hyphen-separated words) or a compound term (two or more words or hyphen-separated words.)
  • Glossary - a collection of terms relevant to a particular project or domain.
  • Domain - a field of study or area of interest. Multiple domains may overlap or intersect (ala a Venn diagram)
  • Project - a collection of subsets of domains that are under study for some purpose.
  • Abbreviation - a shortened or shorthand form of a term.
  • Acronym - a stylized abbreviation for one or more compounds term that utilizes a sequence of the initial letters or abbreviations for the individual terms of which it is composed.

A glossary may contain actual term definitions or references to terms that are defined in anothe XML resource document by themselves of within another glossary. In its simplest form, a glossary would simply be a list of term references or pointers to externally-defined terms.

Glossaries should also be nestable so that a glossary can incorporate existing glossaries in their entirety.

A glossary would be associated with zero or more domains that may or may not overlap or intersect. For example, my software_agent glossary might list the domains software_agent_technology, computing, software, and distributed_processing.

One open question is the relationship between a glossary and a vocabilary. My current thinking is that a glossary is simply a subset of terms of interest in some project or a sub-field of a domain or even of interest to an individual. A vocabulary for a domain would consist of the universe of terms that are relevant for that domain, regardless of whether those terms are collected into one or more glossaries or exist as discrete XML resources not contained in glossaries. A vocabulary may be more of a computed collection whereas a glossary might be hand-crafted for its intended specific project.

One important nuance is that sometimes an existing term is mostly relevant but needs some modification to be more completely relevant for the domain of a glossary or to simplify and customize it to be more relevant to a project or domain. In such cases we want a local override definition for the term plus a reference to one or more existing term definitions upon which the term is based or from which it is synthesized.

-- Jack Krupansky

Thursday, October 2, 2008

Defining compound terms for acronyms

So far in my little acronym experiment, I defined a compound term simply as a string which happened to be a sequence of words. I actually started a separate experiment to look into defining a mini-dictionary or glossary of words and then use URI references to those XML resources in the definition of a compound term, but I ran into some issues that I was unable to resolve, so far. I may come back to that side experiment later, but it may become moot since I think the real solution is that each compound term should itself be a discrete XML resource and the acronym resource should simply tie the aconym term to the XML resource for the compound term.

I have not figured out all of the details yet, but rather than the acronym term "ABC" be defined as the string "Agent-Based Computing" or even the sequence of references to the XML resources for the individual terms "Agent-Based" and "Computing", the definition would be a single reference to a distinct XML resource for agent-based_computing.

Similarly, the definition for the acronym term "RSS" would be a collection of three references to distinct XML resources for really_simple_syndication, rich_site_summary, and rdf_site_summary.

I have not yet worked out the details, but I think I need to construct a standalone XML schema for a compound term, or maybe have the concept of a compound term glossary which is a list of compound terms relevant to a particular domain or subdomain. So, some compound terms could be represented as a single compound term in a single XML document, or a project could collect all of its compound terms into a glossary. There are pros and cons to both approaches.

The only problem here is that it introduces a separation between the abstract compound term for an acronym and the text of the words from which the individual letters of the acronym are derived.

One solution is to include both the text definition and the XML resource reference. Or, if the text of the compound term is included in the XML resource definition for the compound term then it can be obtained indirectly.

Or maybe the process by which the text of the acronym was derived is simply historical and is not strictly needed to operate at the purely semantic level.

Another approach is to actually decompose the words of the compound term and represent them in a structure that is organized by the sequence of letters of the acronym term. This structure would be kept with the acronym even though there is also a direct reference from the acronym resource to the XML resource for the compound term.

Incidentally, I already have a lot of resources on the Web for compound terms and acronyms, but they are in text in HTML documents rather than in XML. I will give some throughts as to how I might want to organize those compound terms and how to split the existing HTML into raw XML and presentation HTML that feeds off of that XML. There are links between many of my compound terms, which would mean XML resource references in the XML as well as synthesized HTML links for the presentation of the compound terms.

OTOH, it was not my intention to dive into how to solve the problem of representing full-blown term and compound term definitions at this time, but rather to tackle the simpler problem of acronyms. I need to figure out which portion of the problem to carve off to continue work on acronyms.

-- Jack Krupansky