Tuesday, January 26, 2010

The Perl RDF project

I just stumbled across The Perl RDF project. As the web site says:

The Perl RDF project hopes to address these issues:

  1. Publish an official API for storage, parsing and serializing modules.
  2. Produce a set of base classes for representing common RDF objects such as statements and nodes (resources, literals, blank nodes).
  3. Produce patches to existing RDF tools to support these APIs, subclassing where appropriate.
  4. Produce a test suite for storage, parsing, serializing, statement and node classes.

That includes SPARQL support, a triple store, a web crawler for RDF resources, and parsing of RDFa.

And, there is a mailing list: perlrdf mailing-list.

Sounds like an effort to keep an eye on. No matter what your preferred heavyweight RDF Semantic Web toolset is, lightweight Perl hacking is clearly a useful adjunct.

-- Jack Krupansky


Doing a little Semantic Web programming with RDF2Go

I was actually doing a little (very little) Semantic Web programming yesterday. I did not even realize it until I was done. I was tracking down a nasty time stamp issue with some client code that uses the file/web crawling features of Aperture (1.4), which uses RDF2Go under the hood for storing file names and time stamps. Normally that is all transparent and problem-free, but I was doing something tricky (if people are paying me to do something, you can bet that there is something out of the ordinary involved.) To track down the problem I needed to verify the exact file names that Aperture was tracking. To do that, I needed to access and dump the Aperture repository.

Ultimately, I solved my problem fairly easily, but seeing and understanding what was in the repository was a big help.

I won't go into all of the gory details, but some of the concepts are worth noting.

What is Aperture? According to the Aperture home page on SourceForge:

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.

As I said, Aperture keeps track of those information sources using a repository based on RDF2Go. According to the RDF2Go home page:

RDF2Go is an abstraction over triple (and quad) stores. It allows developers to program against rdf2go interfaces and choose or change the implementation later easily.

Each RDF graph is stored as a model in RDF2Go. Each RDF2Go model has a context. Essentially the context is the name for the named graph that is stored as a model.

An RDF2Go repository contains one or more models, also referred to as a model set. In other words, a repository can hold multiple named RDF graphs.

And finally, an RDF2Go model consists of any number of statements, which are the actual RDF statements which comprise the named RDF graph. Each RDF statement is a triple consisting of three URIs, one for the subject, one for the predicate, and one for the object (S, P, O.) My errant file names were stored in the subject field and the time stamps in the object field. My root path for my Aperture crawl was stored as the context or model name. Ultimately, Aperture stored two statements for each file (one a date, the other the time stamp.) Iterating through the models in the model set gave me a list of the context names or my root file paths (sometimes file system paths, sometimes Web URLs.)

What RDF2Go really is is not a data repository itself, but an abstraction that can work with a variety of repositories or so-called stores.

The difference between a quad store and a triple store is that a triple store by itself represents an unnamed graph, while a quad store is capable of representing named graphs, with that fouth piece of information being the context or graph name. In practice, a lot of people use the terms interchangably and we tend to implicitly forgive people who refer to quad stores as triple stores.

-- Jack Krupansky


Saturday, January 23, 2010

Applying Semantic Web technologies to extract intelligence from Twitter data

Morton Swimmer, Senior Threat Researcher with Trend Micro, Inc., has an interesting slide presentation about the use of Semantic Web technologies to analyze Twitter data for "intelligence", particularly to identify malware threats. See "Twarfing: Gathering Intelligence from Twitter Data." The slides were for a recent presentation at the New York Semantic Web Meetup.

Twitter tweets are analyzed, mapped into RDF, stored in an RDF quadstore database, and then queried via SPARQL. His approach makes use of the SIOC, FOAF (Friend Of A Friend), GeoOWL, and Dublin Core ontologies.

Currently, JSON and CouchDB are used in the processing of Tweets.

He mentions "probable" use of Lucene in future work. A "cocktail napkin" block diagram identifies Lucene, but it is not clear whether that is in the current architecture or a future design.

The presentation includes a couple of SPARQL examples of "patterns" to identify both users who are promoting malware sites and the sites themselves, based on past references to sites that have been identified as malware sites.

He also mentions the use of "text signatures" to identify similar references across a wide range of tweets.

-- Jack Krupansky