There’s no reason for a librarian to understand RDF

Given a properly functioning workflow, there’s no reason for a librarian to understand RDF.

In the same way, given a properly functioning workflow, there’s no reason for a librarian to understand MARC — or any other data serialisation.

In cases where we don’t have a functioning workflow, it is absolutely essential to understand these things. In the case of MARC, the serialisation has been so embedded in/as the workflow — along with other annotations that are typically done from memory — that I’d dare to say that the serialisation is the workflow†.

You might argue that any expert system has this kind of oddity, but you’d be wrong — bad system design introduces this kind of oddity. Good system design abstracts this kind of thing so that they are applied uniformly by all expert operators. You might think that I’m saying that such interfaces are uniformly awful — and harmful to data quality — and you’d be right.

In the case of cataloguing interfaces, too much focus is placed on the expert nature of understanding annotation details, and too little on knowing what a thing is and what is important for a user to know about that thing in order to do what they want to do.

A linked-data-based system should have a workflow that works abstract to the technology; there is no reason why, for example, you can’t have basically a MARC-alike interface with a linked-data under the hood. I don’t say that this is a good idea, but it isn’t a problem (as long as it is assumed that you only need to represent what you represent in MARC).

The current system we’re working on is a radical departure from traditional systems, but for all the radical-ness of the linked data (not very radical in terms of systems), the really radical part of the system is the data-entry interface (not very radical in terms of systems, but a radical departure from most of the other library systems we’ve looked at).

This isn’t something we could have come up with without a lot of help from an interaction designer, and I’m beginning to understand why. We were blinded by tradition. Field-by-field entry is a common methodology in metadata (c.f. every metadata interface in every image editing package).

Further, the belief that the fields that form an RDF description are a record is also problematic; I’d say that it has become clear to us that the record is very much a workflow concept — analogue to this in the data model is the RDF description, but the two aren’t really related; I’ll get back to this in the next post.

So, given an actual, functional workflow, the motivation for understanding RDF is akin to the motivation for understanding the logical model/physical model of your LMS — specialist knowledge for those librarians doing database tinkering. And it really should be this way.

But, what about data I/O? Well, if that isn’t part of the workflow in a linked-data-based system, I’m going to have to say that that isn’t a linked-data-based system.

†I really didn’t realise that people out there use external MARC-editing tools as their workflow; editing records outside the system workflow wasn’t a thing in Norway…until Alma came along. But even so, in the workflows of all of the systems I have been exposed to, understanding MARC is still a thing (kudos here to Koha, where the explicit in-situ documentation of MARC is really good), even when it doesn’t need to be (looking at you BIBSYS Bibliotekssystem, where meaningful field names were eschewed in favour of MARC field codes).

 

 

Advertisements
Posted in Uncategorized

Creating functional linked-data solutions

I was talking with a new-to-linked-data colleague who’d been asked by another colleague on a different project about how we dealt with the performance problems when using RDF. He said he’d never experienced any.

There are a few reasons for this — all of them choices. I have noted a few of these.

Dereference linked data

It should be obvious, but dereferencing linked data in the application is the only way to do things. Why get handy with SPARQL when you already have a functional REST-API? If you’re not dereferencing, consider whether you need a document store rather than an RDF store.

Use an index to search discrete values

Use the right technology. Searching requires a search index — irrespective of technology (conversely, storage doesn’t require a search index, but that is another rant). Indexing RDF has never been easier, even if you want to stay platform independent, there are many good choices and patterns.

In the absence of portable ways of creating a CBD, use SPARQL CONSTRUCT

Concise bounded descriptions (CBDs) are a great way of making sure that all of the data that needs be delivered together over your REST-API is delivered. Since there’s no way of doing this platform-independently, use SPARQL CONSTRUCT to mimic this functionality in your REST-API. Doing this will also mean that you’re less likely to want to do silly things with SPARQL later.

Model data as you go

An eternal hindrance to RDF take-up is the ability for hard-thinkers to make a mess of things by creating a bad, disconnected conceptual model that can go directly into production as the physical model. Model things minimally and as they are needed; expect refactoring of the model.

Look for obvious code smells

Overambitious queries murder performance irrespective of technology. They are also a code smell. If you need to create the kind of query that has n-seconds performance, then you need to look at a) your model and b) your architecture. Sometimes you can fix things simply by creating addressable objects that are what you want; other times you simply need tabular data and RDF simply isn’t going to cut it. And there is no shame in that.

Wrapping up

There are a lot of other things that I could say, but I think these simple principles keep the likelihood of snappy performance and functional solutions very high. Graph databases are very good at certain things and it is knowing what technology to deploy where that is the major part of of an architect’s job. Because everything can be done in RDF doesn’t mean it should.

Posted in Uncategorized

It’s hard finding a job in library technology

It’s pretty safe to say that libraries don’t do technology. Sure, some libraries do technology, but those that do technology in a structured, sustainable way are few and far between — the rest to bodging and temporary measures to be replaced at some point in an unknown, but presumably not-too-distant future. Libraries’ resources are tied up in other things and technology simply isn’t part of the agenda — the core values of providing information and service are largely covered by acquisition of technology. Unless you want to work with acquisition and strategic planning around technology — as a library technologist, you’re a bit stuck. And if you have radical ideas, you’re doing to want to acquire equally radical solutions…and that’s an issue.

Service centres sometimes do technology, other times they just do implementation — or even simple consortial acquisition. The strategic plan of service centres has to be realistic in the eyes of the executive and are formulated around extremely conservative safe bets. Even in cases where development is radical, the outcome is generally oddly dependent on conservative choices — keeping certain parts of the architecture locked to existing workflows and thereby inadvertently extending the lifespan of inadequate systems. This kind of conservative approach maybe is necessary, for non-technological reasons; but consequently service centres are also blighted by far slower, procedural progress than most radical library technologists want to see.

Library systems vendors do technology. You could probably argue that library vendors and service centres are a good place to look, but from where the majority of people like myself are standing, the interesting place to be at a library systems vendor is in product R&D. It doesn’t seem to be the case that work in R&D is available to non-company people. I’m going to make a very brave stand here and say that this is a problem. The echo chambers need new blood. Sometimes, maybe the business plan too — but I suspect the compelling argument for money people in either respect is lacking because, well, libraries are still interested in the product that they’ve been selling for the last decade. I understand this. But it makes one less place for the radical ideas you want to promote.

It’s obvious that these places don’t need staff with too-radical ideas; that’s fair enough, they have a job to do. But, what to do? I know of two solutions: become a consultant (like myself) or start a library services company.

Magnus Enger did the latter — Libriotech — providing services that are a radical departure from anything offered in Scandinavia previously — his radical approach shouldn’t be ignored because, as a library technologist, he’s doing something very right. What’s very radical about Magnus? He up-ended the status quo and created a livelihood for himself at the same time as providing the kind of very ethical support for libraries that I really admire.

Why don’t we all start similar enterprises? I have a few ideas for services that I think would be popular in the academic library domain — but it’s finding a model for providing these in the total absence of business acumen that is the problem. I guess systems vendors positions are safe — for now.

Posted in Uncategorized

The self-taught coder

Recently, I have been reflecting on the fact that I’m a self-taught coder†, and not only that, I also worked alone in a cupboard for many years. I realised that I have a tip for others in the same boat: there’s a broader horizon.

It’s great being a solitary coder: there’s safety in not being part of other people’s things; safety in knowing that the critique your code receives isn’t unjust; safety in knowing that you-and-you-alone in your team-of-one know the best way of coding.

And hell, running code is good code. Mostly.

But that’s code, and there’s so much more to software development than code: the development process, creating maintainable, useful software without unforeseen errors and doing this in accordance with what was requested.

Of course, no-one willingly writes code that doesn’t do something useful — unless they’re trying to prove a point. But, there are many ways to achieve running code. There are bloody pragmatists that smash their way through the code, creating unmaintainable spaghetti on the way. There are code perfectionists who never finish a project, getting lost in academic exercises at once.

I have fallen into both of the categories I outlined above, and without pretending that I have any great insight, I’m pretty sure that I’m in a new category right now and will be wiser and in another category not long from now.

The reason I’m thinking about this is that I was talking to some folks at a seminar the other day, and I realised in the course of our discussion how differently I think and work these days. My coding is better, but it’s not better because of spending more time coding alone, it’s better from coding with other people, being exposed to different ideas and understandings — being forced to question my own understandings and ideas and argue for or abandon these as necessary.

There’s a lot to be said from being open to criticism (no, really!) and accepting this; there’s a lot to be said for encountering other opinions and escaping the tyranny of your own. It’s no secret that arrogance is the currency of the lone wolf — and crappy software the price we pay.

At the same time, I’m now pretty sure that there are no hard problems in software beyond people, and adding more people to a situation can make for exponentially more problems. Maybe “Personality is great, but it might be better if you didn’t have one” is a thing after all! Never to have experienced working closely with people who have profoundly different personalities and ways of doing things leaves you unable to cope with the at-odds stuff people do and say — and consequently you get out of practice arguing coherently for your own ideas.

Sometimes teams just work. I wish I had a pattern for the kind of personality you need to work well in a team and produce good code, but I’m afraid it isn’t one personality; it’s rather the combination of personalities in the team. And maybe having worked in different teams has made it easier to be one of the necessary personalities in a team. Yes, you can be different in different roles.

I know great programmers who have shown themselves to be awful in teams of equally strong programmers and the replacement of strong players with on-paper weaker players with less strong personalities has proven highly successful. There’s more success when people don’t trample people with different ideas; when stronger personality doesn’t trump better ideas.

It turns out having feelers out for the situation — a difficult task for the programmer who boasts of their anti-social behaviours — makes you maybe not a better programmer, but certainly a more useful one. This isn’t a thing you can teach yourself in a cupboard on your own, which is why staying in the cupboard dooms you to never leaving it.

What brought on this hand-wringing anguish and introspection? A question on Quora, something akin to “At what point in your career do you know you’re a good programmer?” I suspect I will never be a good programmer, but I want to be one — which is maybe an attitude I want to see in those around me. I also know really excellent programmers, but I also know more useful team members. I rather hope that I’m beginning to fall into the latter category.

† A slight modification here: I have attended many courses in computer science and worked with many intelligent, willing teachers.

Posted in Uncategorized

Making a case for standard identifiers

I’m talking in Trondheim on Monday about the necessity of stable identifiers provided by the Norwegian national library for modern cataloguing. This sounds like a no-brainer because surely everyone understands that this is a good idea. Well, yes, but also no.

The national library has long been involved in rolling out national infrastructures and they’ve made great progress in many areas. One of the areas that they’ve been looking at recently is national authorities for agents and works.

It’s obvious that creating a useable dataset for either of these things is of great use to anyone cataloguing and providing a centralised resource also helps organise data across the domain. It’s essential to some of the other work that the national library is doing, like providing a national search interface for end users.

On the other hand, providing the framework that the data is delivered in is a slightly more complex problem. There are numerous standards and APIs to be considered; of the existing options, the URN:NBN seems to be widely implemented for document identification and it also seems to be something that has a lot of traction in the national library sector.

While the URN:NBN acts as an identifier, it is also can also be resolved. This sounds great, but there’s a rather big catch: the URN scheme is not directly dereferenceable and maintaining an (inter-) national infrastructure for this is both hard in conception and implementation. It’s also a mistake because a widely implemented, corresponding, parallel infrastructure already exists that provides dereferenceable URIs: the Web.

It’s here that the linked data concept comes into play; using HTTP-URIs as identifiers and providing a method of dereferencing these directly. The architecture is simple and already available. The job that remains is to convince national libraries to use linked data as the permanent solution to bibliographic data identification and distribution as opposed to less mainstream, but certainly more library centric solutions.

The biggest issue in introducing such an infrastructure is that the data is largely consumed by systems that do not understand this method of delivery. Additionally, the systems are designed and maintained by people who do not understand — or, worse, do not believe in — distributed systems of this kind. Adding in this kind of functionality to existing systems is highly problematic.

What needs to be done then is not simply to provide the service and hope that everyone is happy, there needs to be some direction in how library systems are developed. A key ingredient is how users understand the distribution and storage of centralised data; the mandate is not to download, store and re-use, but simply to re-use in situ.

In cases where direct in situ use is not possible, local caching with invalidation check is necessary; this need not be more complex than header retrieval (ETag) combined with a graph property. So, the technology for this is actually in place already and where it isn’t, there is a clear plan for implementation.

In sum, there is simply no reason to not use linked data directly in this application.

 

Posted in Uncategorized

So I created workable JSON-LD

TL;DR: I made something work that I have previously said is a bit wonk. I was wrong. Comprehensive round-tripping of linked data via HTML is possible.

 

One of the big bug-bears that I have had is that there’s no real way to do proper round-tripping of linked data in HTML; there have been a few attempts (we remember Fresnel) at doing things that make it possible, but it hasn’t really happened†.

Then there have been attempts at making RDF more Javascript-friendly by representing it as JSON (as if XML wasn’t parseable in Javascript). There’s an obvious point to making things accessible to Javascript; web pages use it. The problem is that serialising RDF as JSON means that you get the worst of all possible worlds: no typing, and verbose representation of triples. Javascript doesn’t have a type for URI (you have to make your own — or rather “pretend”).

And in comes JSON-LD, the JSON format for linked data. I can’t deny it, the silly way it was presented — those words irritated me — made me think “nonsense”. We already had RDF/JSON and it worked. Well, it worked for the few who actually understand RDF. But, it didn’t work well for people un-used to RDF and it only had support in the application as long as you put it there.

I’m using the past tense because RDF/JSON’s W3C page has a splash that reads “Use JSON-LD”. So I did. And it has been utterly miserable. I couldn’t give two figs about serialisation beyond the fact that I can get data from here to there and then use it. JSON, as I have said, is a good fit with Javascript. I maintain, however, that serialisation is otherwise irrelevant.

When using JSON-LD, I have had little control over serialisation — I’m  used to RDF just working, but producing “pretty” JSON-LD (i.e. the kind of thing that doesn’t make your ears bleed when you’re trying to parse a document) seemed nigh-on impossible. I have always known what the problem was: my not understanding, nor having tools that worked with framing.

With a few small hints from Markus Lanthaler (who does more good for JSON-LD-take-up than any other person involved), I finally got it to work. JSON-LD Java has good support for framing and after a bit of tweaking in the JSON-LD playground, my data was ready to fly.

This all sounds like delight and mirth (as a bad Norwegian translator might have it), but let’s analyse what I was doing: putting linked resources into arrays. I also wanted a heavy context object that got rid of any silly namespace nonsense (and alias JSON-LD “@” artefacts). I also wanted to suppress “@id” inside the object as well as banish “@graph”. Actually, what I wanted was simple, recognisable JSON. And that is what I now have with the help of JSON-LD, framing and Gson (I have to chew the JSON-LD a bit because framing doesn’t support suppression of objects).

What this has bought is control. JSON objects can now be produced that can be consumed by simple JSON-oriented services (like an indexes or parsers). It has become apparent to me that lightweight, focussed data structures are necessary outside the comfort of in-memory objects. Sometimes what is simple and lightweight has surprised me (like finding the simple, generic format that will allow multi-linguality to slip through the parser un-challenged didn’t involve enumerating languages as keys…), while yet other times I was thoroughly un-surprised that common data-modelling from RDBMSes works really well.

The upshot of all this is that the JSON that is created works well because it is more JSON and less LD. Now, you’d find that a shame if you were then the kind of person who was trying to round-trip this data via HTML. And it is here that I can relate a minor epiphany: you don’t need to use the same data structure in as is used out. In fact, there are really good reasons why you ought not to‡.

The HTTP-PATCH format being used to generate plain-old RDF works well and continues to surprise — it turned out that it does indeed support blank nodes (with a minor tweak) and all is delight and mirth (again). Combined with now-easier-to-work-with JSON for consumption, I’m pretty sure that the elusive round-trip is in place.

† Feel free to disagree that this is untrue; I know a lot of folk have XML-workflows that work, but these don’t rock my boat.

‡ I ought to write about this in another piece.

Posted in Uncategorized

What is the innovation space in libraries?

I won’t dwell: most of what has gone before in library systems is rather out-dated. Building on the ideas behind these systems has been a problem, but we’ve done it anyway. Why? While the ideas aren’t good ones, they are simple. I contend that the reason for the “success” of the current crop of library systems is that they are exactly this: simple.

But the simple we’re talking about here isn’t a lack of complexity (good simple), it’s a bad kind of simple, characterised by quick-and-dirty solutions to complicated problems; these result in increasing complexity at a granular level, but an apparently amorphous — and thereby “simple” — lump at the higher level.

And then there’s this thing of thinking that more of the same must be better; actually, more of the same is really the only way to go when you’ve exploited every other avenue of a particularly limited concept.

Unfortunately, increasing the amount of bad simple in your system is not a good idea; it’s very hard to make the bad kind of simple scale. Or at least scale in a way that is meaningful.

I’m not a fan of weak ideas and bad system architecture; these things seem omnipresent in this innovation space. What we need to do to work against this, I’ll come back to later.

I have never been critical of the fact that system vendors are not charities; they provide a service within a profit-driven model. This is no secret. What I have criticised, though, is the lack of imagination shown when developing functionality: I see glitter, not substance.

It is ultimately the responsibility of system vendors to provide compelling reasons to use their system, but lack of differentiation between products — and I really see no difference between any of the current crop of systems, commercial or otherwise — means I see no reason to swap one system for another. I see an argument for onsite/offsite, but it has to do with user privacy, not my convenience as administrator.

The current compelling argument is price or a move to open source (the computational equivalent of generic medicines). When price and convenience is the major issue, there is not going to be much actual innovation.

And here’s the bit where I’m going to claim that the community defines what the innovation space is. I really believe this, but most libraries (and especially academic libraries) are very comfortable not innovating — or at least only innovating in ways that don’t affect anything that might be touched by a real user.

There are ways to work with commercial vendors and open-source communities that lead to innovation, but I’m not seeing anyone really doing this. There is some good work going on, but it seems that it’s also stuck in the rut of doing the same again in a different way.

Innovation is much more than technology.

Posted in Uncategorized