Name

cx:rdf-load — Load RDF triples from semantic web data sources.

Synopsis

<p:declare-step type="cx:rdf-load" xmlns:cx="http://xmlcalabash.com/ns/extensions">
     <p:input port="source" sequence="true"/>
     <p:output port="result" sequence="true"/>
     <p:option name="href" required="true"/>                       <!-- anyURI -->
     <p:option name="language"/>                                   <!-- string -->
     <p:option name="graph"/>                                      <!-- string -->
     <p:option name="max-triples-per-document" select="100"/>      <!-- long -->
</p:declare-step>

Description

This step uses the Jena project libraries to extract RDF triples from semantic web data sources. The results are returned in a sequence of XML documents that encode the triples directly.

The format of sem:triples files is straightforward, it contains a set of one or more sem:triple elements. Each sem:triple in turn contains a sem:subject, a sem:predicate, and a sem:object.

The subject and predicate are always IRIs, the object is either an IRI or a literal value. The object is an IRI unless it has a datatype or xml:lang attribute, in which case it is a literal.

If any IRI begins with “http://marklogic.com/semantics/blank/”, it represents a blank node.

What the heck is this format?

This format is a serialization of the internal format that MarkLogic uses to represent semantics data. It's convenient for me and easy to convert into other formats. Eventually, I'll add serialization options to produce more common formats.

Implementation

This step is implemented by the xmlcalabash1-rdf module. The jar file from that project must be in the class path in order to use this step.