Of records and RDF

This ended up a lot more rambling that I’d hoped. I can do a tl;dr later

We use and talk about records a lot in libraries; we use records to manage metadata, to create logical chunks of information, often related by being about the same thing. Examples include: authorities, bibliographic records and borrower information.

Even though it’s a prevalent metaphor in database technology, the record as a base unit pre-dates the computer age and belongs to the time of the catalogue card. In common parlance, a record is a (relatively) complete written account of something, and this fits in well with all aspects of the library and database usage.

A record assumes that it includes everything that is necessary to know about the thing that is described. In the library sense, we would like to think that our formats include all of the relevant data for the use cases we have for the data (FRBR’s FISO tasks) and in the closed world of the database, the record includes everything that is known.

Given an open-world assumption, the record is a construct that we will struggle with; the record can be one of two things: the set of all known things or the set of known things that appear to be true according to certain criteria.

In the latter case, criteria used to derive the set can be many things, but they might typically said to be things that can be logically inferred — using the capabilities of RDF or external constraints. A problem of using RDF constraints means accepting that the set of statements might not be complete, however, one might — in a non-linked data reality — assume a closed world for the sake of the constraints. I’m quite sure that this is not an aim of most people working with RDF in libraries as the values embodied in linked data appear more important than the use of RDF technology per se.

It might appeal to view the set of known things as one limited to “our things”. There is a clear risk that this constructs a closed world as we view only certain sources as true; there is an even clearer risk that because there is a single source valid information (“us”) we create a schema simply by creating uniform representations of data — the processing algorithm or input forms create a uniform shape for data.

It seems that the record is very hard to kill.

One of the things that has become apparent to me over time is that named graphs have something of the record about them — we create logical groups of assertions about something by putting them in the same named graph; we can then make statements about or query the named graph and know that this is in some way restricts the scope of our query and description to a particular set of assertions. At the same time, this approach is informed about the open world assumption and we therefore avoid an intrinsic schema-ization of our data.

I note that a lot of approaches (c.f. BIBFRAME) to getting library data into linked data is clearly schema-oriented, providing methods of (re-)structuring records to standardized templates (called different things, but seemingly designed for portable bijective mapping between systems). I’m sure there are arguments that the open world is maintained, because having a minimal set of data-terms and shapes that are defined as canonical does not exclude the inclusion of other things as well, but that, I think is missing the point: the defined set of possibilities sloughs off everything else, the data has limited scope for assertion types and we basically define records in a specific syntax.

In broaching this problem, I’m currently at a place where I view the assertion as the basic unit of description, and the assertions in our local knowledge base form the extent of our knowledge. Making this work in an open way is difficult, but it pans out in a way that I think is interesting.

In terms of cataloguing (our current issue), I’m aware that there is a desire to catalogue by record; a entry form with lots of fields that forms the basic template for description of items. Hit the button, send all the assertions we have made, creating a full record and validating it according to our standards on the way. Viewing the assertion as a basic unit entails that we need to treat each assertion atomically; a confusing prospect in a world governed by concepts like rules for what is complete cataloguing and difficult to reconcile with a world where systems expect data to have an explicit, known structure.

It’s especially difficult in cases where a single action in a form results in greater than or less than one assertion.

The technical difficulties are, however, overcomable; what seems less overcomable is the challenge for cataloguing — if I submit a single triple, is the thing I’m describing catalogued? It’s very easy to fall into this kind of handwringing anguish; overcoming the problem should be an exercise in pragmatism.

In order for a thing that has been described to work with the systems we’re developing, then it would be good if certain things are present (creator, title, date of creation, etc.), but if they aren’t, we accept that and give what we have. We don’t assume things, we simply give what we have (in this way, absence of an author means simply that there is no data known to us about the author, which differs from an explicit assertion about there being no author).

The burn comes when one considers not the viewing of the data, but the input. Conceptually, we simply allow parts to be added, but make suggestions about what should be added. This weak formulation agrees, I think, with the experience of cataloguers: a rule is subject to many exceptions.

Similarly, I fully expect validation to be atomic; each assertion should be tested as it is added, but the integrity of all the data about a thing in our knowledge base as a whole should be subject only to the open-world assumption (i.e. not validated).

Why do I think this will work? Because I know cataloguers are more likely to go the extra mile rather than shirk responsibility; I also know of no way of formulating rules that will work for everything and that we need not even try because we can create systems that can eat what they get.

Posted in Uncategorized
2 comments on “Of records and RDF
  1. If I don’t misread you, then I don’t think your proposed atomic per-triple validation will work very well.

    First, if you only take a single triple into account, it requires validation rules to be implicit. For example, they need to be encoded in code instead of schema-level constraints or RDF data shapes. I think it’s better to describe validation rules as data. This way, the rules are explicitly described and can be included in the validation (e.g., your 1 triple + a named graph describing validation rules).

    Second, I believe the most interesting data errors arise when you combine multiple assertions. When you narrow the validation scope to a single assertion, it won’t detect these errors. Sure, if you adopt OWA, then you limit errors caused by multiple assertions to contradictions, which might not be very useful. In my experience, it’s usually better to adopt CWA and unique name assumption for data inside of your application and use OWA when combining your data with external data.

    If you have opinions about RDF validation, you can voice them at https://lists.w3.org/Archives/Public/public-rdf-shapes/.

  2. […] while back, I wrote about records and linked data based on our experience at Deichmanske bibliotek; I presented something of this work at SWIB15. […]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s