Serializations wandering into my data model

Your serialization isn’t my data model, in fact, it has nothing to do with it.

We learned this with MARC, right? Using the serialization format we used for data exchange as a data model worked out quite badly in the end, making our systems difficult to work with and prone to “interpretation of the standard”, which in turn made the entire bibliographic data ecosystem a sad place to be.

You might claim that this isn’t the case in system X, but it pretty much is. It’s difficult to quantify the difference between what is done to support cataloguing rules and what is done in the name of using MARC as a data model as these two are quite intertwined now.

Enter new ways of thinking. Or rather, enter moving our data from one standard for data exchange to another.

Quite unfazed by the issues of using MARC as a data model, we now have several competing standards that aim largely at backwards-compatibility with the tradition of using a data exchange format as a data model. Even when we’ve thrown out existing standards, it seems to be the case that the new product aims at supporting the same data workflows as the old products, which inevitably leads to the same issues as the not-well-thought-through idea of using a data exchange format as a data model. Sure, you’re building a library system and sure it needs to model a way of working that was developed because we were using MARC at the core of every process.

There’s a movement towards JSON in all of IT, as if it were a panacaea for all the issues we have. I’m skeptical; we seem to be rediscovering things we knew before. There’s convenience and then there’s experience not gained.

JSON…is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML.

https://en.wikipedia.org/wiki/JSON

Aside the “human readable” part — which is very much moot from my point-of-view — this sounds familiar, if generic, to anyone familiar with ISO2709. But don’t let’s dwell on the fact that we’re talking data interchage not model.

Looking at the format itself, there are a number of issues, mostly boiling down to the fact that JSON is a string. Sure, there are a few basic datatypes that can notionally be represented, but there is no support for even the most basic of Web oriented things (say, URLs). Support for these things is in the supposed contract of the API you’re using — i.e. nothing, or via non-mainstream subtypes of JSON that really crave no discussion because they aren’t relevant as they’re really not in use (because JSON is JSON, right? Well, at least, it is going to get parsed using a standard library that ignores whatever the weird specification “adds” to JSON).

It gets no better with JSON’s binary format:

BSON…is a computer data interchange format used mainly as a data storage and network transfer format in the MongoDB database. It is a binary form for representing simple data structures and associative arrays (called objects or documents in MongoDB). The name “BSON” is based on the term JSON and stands for “Binary JSON”.

https://en.wikipedia.org/wiki/BSON

I can’t imagine a situation where I’d use any of these formats for storing my data as they can only indirectly represent the things I’m likely to want to do now and I can imagine I’d regret the choice very quickly later on. People I like and respect claim some pragmatism in the choice of these formats, but I don’t see that. There’s a reason for this that I’ll return to in a moment.

There’s no problem providing good APIs for data in JSON, in fact, I think that’s essential, but JSON and its like has nothing to do with you core model.

There are numerous database-like technologies that use JSON as a core technology; common to all of these is that they focus on the kinds of functionality that we desire in indexes (quick, scalable searching among documents), where we’re not doing database things like joins and lookups. These technologies excel in this domain, but they’re weak on things that traditionally aren’t important in indexing technologies. I’m sure that there are ways of overcoming these shortcomings, but one thing remains: the document.

I’m not a fan of data documents because I have had the misfortune of a career working with them. MARC data documents have limited how we express things for the longest time. In fact, moving from the relational system of index cards to simple records that baked five-in-one cards into one large document seems like a big mistake; but that is the benefit of hindsight, right? The boon of being able to search that stuff on a machine seems like a pragmatic win, but perhaps we can learn from the mistake.

For me, there’s pragmatism in accepting that data is something amorphous in memory until it needs to be serialized, say when the data is persisted in storage or when it is presented for transfer to third parties. The pragmatism here is not spending time designing a data model, not doing clever things with your domain model; rather do what needs to be done to get the job done. Getting the job done requires some data tied closely to what your actually trying to achieve and the it should be refactorable as the use cases change (and they will).

This is in clear opposition to the majority approaches to data; the massive domain models and ontologies that result in unusable data and unpragmatic choices. As a case in point, when does a core data model ever need backwards compatibility with a previous format? Perhaps an exchange format needs this, but keep that dirt outside your data model.

If you’re working with RDF, there’s a very simple route to designing data models: don’t. Build the data model as, when and how you need it. There is no requirement of consistency and there need not be, unless a use case dictates it (it might, but the initial assumption you make shouldn’t include it). A caveat here is that some data does indeed have a basic structure that is consistent and this is inherent to the data; this data tends to be simple data (for example, numeric series) that is better at home in traditional database technologies.

Not having a design means the core model can adapted to use quickly and simply; imagine a mass of shapeless data that does the things you need to do. Now imagine forcing that into a data-document corset. Actually, this is exactly what you do when you serialize your data for your APIs; and here use cases again form the core (why do we repurpose — you’re not just spewing all the data out there, right? — the data for the API?)

The more I work with JSON, the more I regret that we are forced to use it as a technology. It forces views of our data that don’t belong there. It creates a way of thinking that isn’t reconcilable with the amorphous view of data where the edges of the data remain undefined. The document is all. A sort of PowerPoint for data, if you will.

No serialization is a good representation, because we believe the serialization’s truth; a document-oriented truth. Some serializations do a good job of making it explicit that there is something beyond the document, that the document is a pinhole view of a bigger whole. These are format that show us the transitions between documents (yes, documents). Turtle is a case in point; it’s clear that a IRI is a IRI and that forms a transition. In JSON, it’s hard to tell a string from an IRI because they are the same.

Developing exchange formats for patching our data that is inherently build of transitions represented by IRIs was a task I felt had to be done in JSON because any other choice would mean custom parsers and costs; pragmatism is king, but JSON stops at the level of exchange. The design is unpretty and it is easier to understand in yet other serializations, but not for the machines.

Don’t let my ranting stop you though; revisit that vomit.

Advertisements
Posted in Uncategorized

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s