Tuesday, September 30, 2008

Simple Knowledge Organisation Systems (SKOS)

I will continue to persue my little acronym project, but it is possible that it will eventually simply map into a Simple Knowledge Organisation Systems (SKOS) language. Oversimplifying, SKOS provides capabilities for defining controlled vocabularies.

From the Wikipedia:

SKOS Core [16] defines the classes and properties sufficient to represent the common features found in a standard thesaurus. It is based on a concept-centric view of the vocabulary, where primitive objects are not terms, but abstract concepts represented by terms. Each SKOS concept is defined as an RDF resource. Each concept can have RDF properties attached, including:

  • one or more preferred index terms (at most one in each natural language)
  • alternative terms or synonyms
  • definitions and notes, with specification of their language.

Concepts can be organized in hierarchies using broader-narrower relationships, or linked by non-hierarchical (associative) relationships. Concepts can be gathered in concept schemes, to provide consistent and structured sets of concepts, representing whole or part of a controlled vocabulary.

These features represent the stable part of SKOS Core. Other elements of the vocabulary are still considered unstable.

Acronyms might fit in as "alternative terms", but I am not so sure about that. To me, alternative terms should be full-fledged, first-class terms and not abbreviations or acronyms.

In any case, SKOS will be on my future reading list. For now, I want to keep the acronym project as simple as possible.

-- Jack Krupansky

Friday, September 26, 2008

Adding multiple definitions for an acronym

In the real world, there may be multiple definitions of the same acronym. Sometimes they are from distinct domains and unrelated but sometimes they have evolved over time within a single domain, possibly for variations in usage or different audiences. For example, RSS is commonly accepted to stand for Really Simple Syndication, but it technically stands for Rich Site Summary or even RDF Site Summary.

There are three ways to give multiple definitions for a single acronym:

  1. Define the acronym in multiple XML documents.
  2. Place multiple acronym definitions in a single XML document.
  3. Extend the schema definition for acronym to allow multiple definitions.

Ultimately, #1 is probably best and represents the distributed nature of the Web and Semantic Web and supports definitions within distinct domains. #2 can make sense when there is some obvious connection between the definitions such as for my RSS example. #3 is a tighter way of doing #2 and also ties the multiple meanings together.

I have created a sample XML document, acronym2a.xml, that illustrates placing multiple definitions of the same acronym term in a single XML document. Here is the fragment of that document for RSS:

<Acronym>
 
<Term>RSS</Term>
 
<CompoundTerm>Really Simple Syndication</CompoundTerm>
</Acronym>
<Acronym>
 
<Term>RSS</Term>
 
<CompoundTerm>Rich Site Summary</CompoundTerm>
</Acronym>
<Acronym>
 
<Term>RSS</Term>
 
<CompoundTerm>RDF Site Summary</CompoundTerm>
</Acronym>

This sample document uses the same schema as my second example, acronym2.xsd.

This approach basically works, but does nothing to suggest that these "meanings" are related and requires excessive verbiage.

Next, I modified the schema to allow an arbitrary list of compound term definitions for each acronym. Unfortunately, I have not yet been able to figure out how to design such a schema that does not require an extra level of XML element to represent the list. The new scheme does work, but is a bit more wordy than I would prefer.

So, using the old schema we wrote:

<Acronym>
  <Term>RDF</Term>
 
<CompoundTerm>Resource Description Framework</CompoundTerm>
</Acronym>

But with the new schema that same exact definition becomes:

<Acronym>
  <Term>RDF</Term>
 
<CompoundTerms>
   
<CompoundTerm>Resource Description Framework</CompoundTerm>
 
</CompoundTerms>
</Acronym>

I am still hoping that I can find a way to design the schema to make that extra level of XML element grouping optional, but for now at least this approach is functional.

Anyway, the XML that combines the three RSS definitions for one acronym now becomes:

<Acronym>
  <Term>RSS</Term>
 
<CompoundTerms>
   
<CompoundTerm>Really Simple Syndication</CompoundTerm>
   
<CompoundTerm>Rich Site Summary</CompoundTerm>
   
<CompoundTerm>RDF Site Summary</CompoundTerm>
 
</CompoundTerms>
</Acronym>

This is now finaly starting to look somewhat useful for structuring information, albeit at a very simple level.

One thing that immediately stands out for future work is that rather than "RDF" simply being a string, it would be preferable to actually link that first word of the third definition of RSS to the synonym definition for RDF. That would then start to have the feel of more of a "semantic" Web.

The full sample XML, acronym3.xml, is als available online:

<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com) -->
<Acronyms xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
    xsi:noNamespaceSchemaLocation="http://agtivity.com/xsd/acronym3.xsd">
  <Acronym>
    <Term>ABC</Term>
   
<CompoundTerms>
     
<CompoundTerm>Agent-Based Computing</CompoundTerm>
   
</CompoundTerms>
 
</Acronym>
  <Acronym>
   
<Term>RDF</Term>
   
<CompoundTerms>
     
<CompoundTerm>Resource Description Framework</CompoundTerm>
   
</CompoundTerms>
 
</Acronym>
 
<Acronym>
   
<Term>RSS</Term>
   
<CompoundTerms>
     
<CompoundTerm>Really Simple Syndication</CompoundTerm>
     
<CompoundTerm>Rich Site Summary</CompoundTerm>
     
<CompoundTerm>RDF Site Summary</CompoundTerm>
   
</CompoundTerms>
 
</Acronym>
</
Acronyms>

The full schema, acronym3.xsd, is starting to get a little verbose, but still fairly manageable:

<?xml version="1.0" encoding="utf-8" ?>
<!--Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com)-->
<xs:schema elementFormDefault="qualified"
   
xmlns:xs="http://www.w3.org/2001/XMLSchema">
 
<xs:element name="Acronyms" type="AcronymList" />
 
<xs:complexType name="Acronym">
   
<xs:all>
     
<xs:element name="Term" type="xs:string" />
     
<xs:element name="CompoundTerms" type="CompoundTermList" />
   
</xs:all>
 
</xs:complexType>
 
<xs:complexType name="AcronymList">
   
<xs:sequence minOccurs="0" maxOccurs="unbounded">
     
<xs:element name="Acronym" type="Acronym" />
   
</xs:sequence>
 
</xs:complexType>
 
<xs:complexType name="CompoundTermList">
   
<xs:sequence minOccurs="0" maxOccurs="unbounded">
     
<xs:element name="CompoundTerm" type="CompoundTerm" />
   
</xs:sequence>
 
</xs:complexType>
  <xs:simpleType name="CompoundTerm">
   
<xs:restriction base="xs:string" />
 
</xs:simpleType>
</
xs:schema>

Basically, I added two defined types, a complex type named CompoundTermList that is a container for the arbitrary list of acronym definitions, and a simple type named CompoundTerm that represents a single compound term. The other change was that the second element of Acronym is now a reference to a CompoundTermList rather than being a simple string. I could have stayed with simple strings for the elements of a CompoundTermList, but I have throughts about wanting to allow for more structure within a compound term in the future, such as "RDF" being a URI reference to the RDF synonym.

Once again, do not despair if a lot of this seems like total gibberish -- because it is! The goal at this stage is simply to get a flavor of XML, schemas, and Semantic Web Technologies so we have a sense of footing before diving too far and deep off the deep end.

The next thing I am thinking about is to produce rudimentary term and phrase schemas so that an acronym can refer to a term as a full-fledged XML resource and so that a compound term would be a sequence of references to term resources rather than literal string values.

-- Jack Krupansky

XML tutorial info

One online source of XML tutorial material that I have found useful is W3Schools.com, especially the XML Schema Tutorial.

-- Jack Krupansky

Wednesday, September 24, 2008

Dirt simple XML schema for acronyms

Although it was not my original intent to dive into XML "code" so soon, I was feeling more than a little disoriented and felt a need to get at least some footing before delving into all of the conceptual angles. In particular, I figured that by trying out an interactive XML schema design tool I could very quickly get a small schema running without the need to master all of the nuances of XML Schema. The process did not go quite as smoothly as I had expected, but several hours later I do have two small test schemas for acronyms, as well as two test XML files based on those schemas. Without any further ado I will present the two test XML files, but I do not intend to offer a tutorial on all of the XML angles at this time. Some stuff is obvious and some stuff may not even be explainable in even a series of blog posts. Focus on what is obvious and ignore the rest, for now. One might wonder why I do not present the schemas first, but the simple facts are that XML schemas are somewhat cryptic and it is much simpler to have pre-visualized some sample XML text in your head before trying to make sense of the schemas. You may also be wondering why I have two schemas, but that will be clear in a moment.

All of my XML-related files will be kept on my Software Agent Web site, Agtivity.com.

The tool that I used to create the XML schemas and XML test files is Liquid XML Studio 6.1.17.0 - Free Community Edition from Liquid Technologies Limited.

So, here it is, my first test XML file for acronyms, acronym1.xml:

<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com) -->
<Acronyms xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
    xsi:noNamespaceSchemaLocation="http://agtivity.com/xsd/acronym1.xsd">
  <Acronym Term="ABC" CompoundTerm="Agent-Based Computing" />
  <Acronym Term="RDF" CompoundTerm="Resource Description Framework" />
</
Acronyms>

It only has two acronyms, but it should be fairly obvious how to add more. They are completely expressed by these two lines:

  <Acronym Term="ABC" CompoundTerm="Agent-Based Computing" />
  <Acronym Term="RDF" CompoundTerm="Resource Description Framework"
/>

Each acronym has a term and the equivalent compound term. Pretty simple stuff, or so it would seem. In XML parlance Term and CompoundTerm are known as attributes. In this schema, each acronym has two attributes, a Term, and a CompoundTerm.

With this image of what the XML data actually looks like, it will be easier to make sense of the XML schema.

So, here it is, my first XML Schema for acronyms, acronym1.xsd:

<?xml version="1.0" encoding="utf-8" ?>
<!--Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com)-->
<xs:schema elementFormDefault="qualified"
   
xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Acronyms" type="AcronymList" />
  <xs:complexType name="Acronym">
    <xs:attribute name="Term" type="xs:string" />
    <xs:attribute name="CompoundTerm" type="xs:string" />
  </xs:complexType>
  <xs:complexType name="AcronymList">
    <xs:sequence>
      <xs:element minOccurs="0" maxOccurs="unbounded"
         
name="Acronym" type="Acronym" />
    </xs:sequence>
  </xs:complexType>
</
xs:schema>

There is plenty of gibberish there, but the essence is that the schema defines a list of acronyms using the type complexType named AcronymList which consists of zero or more occurrences of elements of the type Acronym which is also a complexType and consists simply of two attributes which are strings, one called Term and the other called CompoundTerm.

Back in acronym1.xml, you can see that the xsi:noNamespaceSchemaLocation attribute gives the URL of the schema file, acronym1.xsd.

If you can make sense out of all of this, that is great, but at least you have been exposed to what it takes to do even something very simple in XML. Actually, it is not too bad, but it is a bit more like looking at the components and wiring inside your computer rather than simply figuring out how to use it.

But wait... we are only halfway done. I said that there were two distinct approaches to the schema and test file. The first schema defined an acronym in terms of two attributes, which is fine for very simple, unstructured data, but is too limiting for structured data. The second approach to the schema uses elements rather than attributes.

So, here it is, my second test XML file for acronyms, acronym2.xml, using elements, rather than attributes:

<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com) -->
<Acronyms xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
    xsi:noNamespaceSchemaLocation="http://agtivity.com/xsd/acronym2.xsd">
  <Acronym>
    <Term>ABC</Term>
    <CompoundTerm>Agent-Based Computing</CompoundTerm>
  </Acronym>
  <Acronym>
    <CompoundTerm>Resource Description Framework</CompoundTerm>
    <Term>RDF</Term>
  </Acronym>
</
Acronyms>

The header is almost identical but points to the second schema. The main difference is that each acronym takes four lines rather than a single line. My simple acronym example does not (yet) need the power of structured (nested) elements, but I hope you can see how it might be used. Future blog posts will explore the matter further. Anyway, a single acronym has four lines:

  <Acronym>
    <Term>ABC</Term
>
    <CompoundTerm>Agent-Based Computing</CompoundTerm
>
  </Acronym
>

The first line is the same as the acronym line in the first test file, but without the attributes. The last line marks the "end" of the acronym and the elements of the acronym are in between. It is fairly obvious how the value of the Term element and the CompoundTerm elements are expressed.

Now, here is my second XML Schema for acronyms, acronym2.xsd, using elements rather than attributes:

<?xml version="1.0" encoding="utf-8" ?>
<!--Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com)-->
<xs:schema elementFormDefault="qualified"
   
xmlns:xs="http://www.w3.org/2001/XMLSchema">
 
<xs:element name="Acronyms" type="AcronymList" />
 
<xs:complexType name="Acronym">
    <xs:all>
     
<xs:element name="Term" type="xs:string" />
     
<xs:element name="CompoundTerm" type="xs:string" />
   
</xs:all>
 
</xs:complexType>
 
<xs:complexType name="AcronymList">
   
<xs:sequence>
      <xs:element minOccurs="0" maxOccurs="unbounded"
         
name="Acronym" type="Acronym" />
   
</xs:sequence>
 
</xs:complexType>
</
xs:schema>

The AcronymList complex type is the same as in the first schema. The essential difference is that the Acronym complex type now consists of a group of elements, all of which must be expressed in any XML data, and those elements are simple, unstructured, scalar types.

Once again, if you can make sense out of all of this, that is great, but at least you have been exposed to what it takes to do even something very simple in XML.

The good news is that now that we have a lot of the basic stuff out of the way, we can incrementally build on it.

Note that this is still not a true Semantic Web since it does not use RDF, but it does show how Semantic Web Technologies can be used. At some point down the road I will convert the XML Schema to a full-blown OWL ontology and start using RDF triples.

-- Jack Krupansky