cx:rdfa

Name

cx:rdfa — Extract RDF triples from RDFa encoded documents.

Synopsis

<p:declare-step type="cx:rdfa" xmlns:cx="http://xmlcalabash.com/ns/extensions">
     <p:input port="source"/>
     <p:output port="result" sequence="true"/>
     <p:option name="max-triples-per-document" select="100"/>      
</p:declare-step>

Description

This step uses the Semargl project libraries to extract RDF triples from RDFa encoded documents. The results are returned in a sequence of XML documents that encode the triples directly. If there are no triples in the source document, an empty sequence of documents is produced.

If there are triples, they will be encoded in one or more sem:triples documents.

Consider this example:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:cx="http://xmlcalabash.com/ns/extensions"
                version="1.0">
   <p:output port="result" sequence="true"/>
   <p:serialization port="result" indent="true"/>

   <p:declare-step type="cx:rdfa">
     <p:input port="source"/>
     <p:output port="result" sequence="true"/>
     <p:option name="max-triples-per-document" select="100"/>
   </p:declare-step>

   <cx:rdfa max-triples-per-document="100">
     <p:input port="source">
       <p:document href="http://examples.tobyinkster.co.uk/hcard"/>
     </p:input>
   </cx:rdfa>

</p:declare-step>

On 12 October 2013, using the Semargl 0.6.1 libraries, the following triples are extracted^[2]:

<sem:triples xmlns:sem="http://marklogic.com/semantics">
   <sem:triple>
      <sem:subject>http://examples.tobyinkster.co.uk/hcard</sem:subject>
      <sem:predicate>http://purl.org/dc/terms/abstract</sem:predicate>
      <sem:object xml:lang="en">This page is intended to be a demonstration of
                the use of RDFa (including FOAF, Dublin Core and W3C PIM vocabularies) in
                conjunction with Microformats (including hCard and rel-tag).</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://examples.tobyinkster.co.uk/hcard#jack</sem:subject>
      <sem:predicate>http://www.w3.org/2006/vcard/ns#category</sem:predicate>
      <sem:object xml:lang="en">Counter-Terrorist Unit</sem:object>
   </sem:triple>
   <sem:triple>
      <sem:subject>http://examples.tobyinkster.co.uk/hcard#jack</sem:subject>
      <sem:predicate>http://xmlns.com/foaf/0.1/plan</sem:predicate>
      <sem:object xml:lang="en">I will kick your terrorist ass!</sem:object>
   </sem:triple>
</sem:triples>

The format of sem:triples files is straightforward, it contains a set of one or more sem:triple elements. Each sem:triple in turn contains a sem:subject, a sem:predicate, and a sem:object.

The subject and predicate are always IRIs, the object is either an IRI or a literal value. The object is an IRI unless it has a datatype or xml:lang attribute, in which case it is a literal.

If any IRI begins with “http://marklogic.com/semantics/blank/”, it represents a blank node.

What the heck is this format?

This format is a serialization of the internal format that MarkLogic uses to represent semantics data. It's convenient for me and easy to convert into other formats. Eventually, I'll add serialization options to produce more common formats.

Implementation

This step is implemented by the xmlcalabash1-rdf module. The jar file from that project must be in the class path in order to use this step.

^[2]

Given the intended purpose of the page, I'm surprised more triples aren't found; perhaps the page is encoded in a way that the Semargl libraries don't recognize.


cx:pretty-print		cx:rdf-load