Name
cx:rdfa — Extract RDF triples from RDFa encoded documents.
Synopsis
<p:declare-step
type
="
cx:rdfa
"
xmlns:cx
="
http://xmlcalabash.com/ns/extensions
"
>
<p:input
port
="
source
"
/>
<p:output
port
="
result
"
sequence
="
true
"
/>
<p:option
name
="
max-triples-per-document
"
select
="
100
"
/>
<!--
long -->
</p:declare-step>
Description
This step uses the Semargl project libraries to extract RDF triples from RDFa encoded documents. The results are returned in a sequence of XML documents that encode the triples directly. If there are no triples in the source document, an empty sequence of documents is produced.
If there are triples, they will be encoded in one or more
sem:triples
documents.
Consider this example:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
xmlns:cx="http://xmlcalabash.com/ns/extensions"
version="1.0">
<p:output port="result" sequence="true"/>
<p:serialization port="result" indent="true"/>
<p:declare-step type="cx:rdfa">
<p:input port="source"/>
<p:output port="result" sequence="true"/>
<p:option name="max-triples-per-document" select="100"/>
</p:declare-step>
<cx:rdfa max-triples-per-document="100">
<p:input port="source">
<p:document href="http://examples.tobyinkster.co.uk/hcard"/>
</p:input>
</cx:rdfa>
</p:declare-step>
On 12 October 2013, using the Semargl 0.6.1 libraries, the following triples are extracted[2]:
<sem:triples xmlns:sem="http://marklogic.com/semantics">
<sem:triple>
<sem:subject>http://examples.tobyinkster.co.uk/hcard</sem:subject>
<sem:predicate>http://purl.org/dc/terms/abstract</sem:predicate>
<sem:object xml:lang="en">This page is intended to be a demonstration of
the use of RDFa (including FOAF, Dublin Core and W3C PIM vocabularies) in
conjunction with Microformats (including hCard and rel-tag).</sem:object>
</sem:triple>
<sem:triple>
<sem:subject>http://examples.tobyinkster.co.uk/hcard#jack</sem:subject>
<sem:predicate>http://www.w3.org/2006/vcard/ns#category</sem:predicate>
<sem:object xml:lang="en">Counter-Terrorist Unit</sem:object>
</sem:triple>
<sem:triple>
<sem:subject>http://examples.tobyinkster.co.uk/hcard#jack</sem:subject>
<sem:predicate>http://xmlns.com/foaf/0.1/plan</sem:predicate>
<sem:object xml:lang="en">I will kick your terrorist ass!</sem:object>
</sem:triple>
</sem:triples>
The format of sem:triples
files is straightforward, it contains
a set of one or more sem:triple
elements. Each sem:triple
in turn contains a sem:subject
, a sem:predicate
, and a
sem:object
.
The subject and predicate are always IRIs, the object is either
an IRI or a literal value. The object is an IRI unless it has a datatype
or xml:lang
attribute, in which case it is a
literal.
If any IRI begins with “http://marklogic.com/semantics/blank/
”,
it represents a blank node.
What the heck is this format?
This format is a serialization of the internal format that MarkLogic uses to represent semantics data. It's convenient for me and easy to convert into other formats. Eventually, I'll add serialization options to produce more common formats.
Implementation
This step is implemented by the xmlcalabash1-rdf module. The jar file from that project must be in the class path in order to use this step.
[2]
Given the intended purpose of the page, I'm surprised more triples aren't found; perhaps the page is encoded in a way that the Semargl libraries don't recognize.