RDF, it’s difficult, nasty, horrible and I hate it

When something isn’t easy, people are apt to give up. Because it’s easier to just do the same thing again, we don’t move forward. A case in point, this appeared in my twitterstream (HT @InspektorHicks):

The Nepomuk project started as a research project in the European Union. The goal was to explore the use of relations between data for finding what you are looking for. It was build completely on top of RDF. While RDF is a great from a theoretical point of view, it is not the simplest tool to understand or optimize. The databases which currently exist for RDF are not suited for desktop use.

The Nepomuk developers have tried very hard over the last years to optimize the indexing and searching infrastructure, and they have now come to the conclusion that Nepomuk cannot be further optimized without migrating away from RDF.

RDF also heavily relied on ontologies. These ontologies are a way to describe how the data should be stored and represented. They used the ontologies from the original EU research project – Shared Desktop Ontologies. These ontologies were not designed in a time when it was not very clear how they would work and have sub-optimal performance and ease of use. They are quite vague in certain areas and often duplicate information. This leads to scenarios where it takes forever to figure out how the data should be stored. Additionally, since all the data needs to be stored in RDF, one cannot optimize for one specific data type.

Given these shortcomings and the many lessons learned over the last years the Nepomuk developers decided to drop RDF and rechristen the project under the name of Baloo. You can find more technical background and info on its architecture here.

Source: Baloo

In my own story, I  wasted a few years labouring under the mistaken belief that it was important to search data in the RDF store natively; I wasted seconds of users time producing messy SPARQL queries that approximated indexing. I wasted more time looking for an RDF store with native indexing that actually worked in a satisfactory way; I wasted so much time because I was trying to do things in the way I had always done things. Then I got wise.

I realized that the point of RDF (and by RDF I mean RDF with HTTP-URIs like the RDF in the documentation from W3C, rather than some occult rubbish from the late 1990s involving non-HTTP-URIs) is that it’s part of the web (call it Linked Data); the point is that it’s there on the web to be indexed. The point: ON THE WEB. That’s important, it’s the useful part of RDF/Linked Data.

RDF is good as a data modelling framework because it is schemaless. That is its only real benefit. It makes no claims about the data, it simply provides an apparatus for describing data structures. Any data structure. There are views of data that are difficult to describe for exactly this reason; lists have no place because they imply an arbitrary structure (indeed, a “view” of the data).

Now, as long as you have a fixed view of what you want to know (title, author, creation date,…), then RDF is largely pointless, please use whichever row/column-oriented tool you choose. As soon as you’re unsure about the structure of your data, use RDF. Unfortunately, to use RDF, you need to understand it. And this is where most people fail.

To understand data, you need to understand that the view you have of the data in a table isn’t the data, but a view of the data. The data is nearer the combination of the values, their relations and their context. It’s difficult to understand these things from a spreadsheet; a database is closer, but not for most programmers. An indexing tool is mostly right out.

Looking at RDF from a point of view of what is good, it means you never have to see a database table like this (yes, really, people do this) because it is schemaless:

CREATE TABLE books
(
TabelId int,
Creator_first varchar(255),
Creator_last varchar(255),
Title varchar(1024),
f1 varchar(1024),
f2 varchar(1024),
f3 varchar(1024),
f4 varchar(1024),
f5 varchar(1024),
f6 varchar(1024),

);

What is bad is that it isn’t table-oriented or has anything like an RDBMS. But it just works on the web. I suppose my real breakthrough was when I realized that what I was doing was providing a RESTful API for my data and that I needed to do other things to achieve what I wanted to do. I had to transform this data to make a useful HTML representation, I needed to provide simpler (transformed) APIs so that my indexer could do a good job, I had to take my complex RDF structures and rework them. And RDF tools make this really easy.

This is the point that’s lost in projects that never really get to grips with RDF. It’s the point that’s lost when we try to be Semantic Web purists. It’s the point that’s lost when we forget that this is a Web technology. It’s the point that’s lost the second we use OWL (and disappear up our own backsides into ontolowanking). RDF is great at what it is great at, but it isn’t a tool like a database. You need a database for that. (You may actually need RDF for what it is that you’re trying to do, but YOU need a database for it.)

This is the thing with Baloo, others have already done what Nepomuk was trying to do. Adobe’s XMP does this; their tools use RDF. It works fine. XMP is supported by all kinds of indexing software. It is RDF. It just works. The thinking in RDF isn’t simple until you understand it, and then you have learnt from your mistakes. Unfortunately, we have an arrogant tendency to hate what we don’t know and to understand from prior experience. RDF is possibly best learnt without prior experience of data structures from databases.

A footnote: The EU funding we waste on nonsense related to IT could be better put to use feeding the poor. Please stop using this money on these pointless projects that serve only to raise the profiles of those few lucky participants.

It has been pointed out to me that “disappear up our own backsides into ontolowanking” is offensive to ontologists; I apologize for any offense caused, but caution the reader to see this not as an attack on ontologists, but rather on a way of doing ontology.
Advertisements
Tagged with: ,
Posted in Uncategorized
6 comments on “RDF, it’s difficult, nasty, horrible and I hate it
  1. When talking about the benefit of schemaless data one could mention NoSQL as well: Instead of having a fixed database table one can add, omit, repeat and subdivide rows without any constraints. With NoSQL databases this is even more flexible than with RDF, but with RDF one can easier merge data from multiple sources. JSON-LD (despite its actual complexity) combines good parts of RDF and NoSQL.

  2. Nepomuk was a “Semantic Desktop”, it was inspired by artificial intelligence research, and had not much in common with folksonomy tagging systems.

    What KDE developers and others want is a tagging tool for the masses, and a smooth, deliberate search on a human’s desktop, not a semantic engine for experts, researchers and scientists.

    Maybe KDE, like Wikipedia, is also dominated by hackers who are bound to PHP and MySQL. They will not get very far prejudicing computer science progress.

    What’s worse, nothing at all is related in any way to RDF. RDF is simply an abstract model for logical assertions, with formal semantics, rooted in deductive logic, closely related to logic programming, also to SIMULA, LISP, Scala, Clojure etc.

    Nobody is seriously claiming RDF a data model or a data standard. So I conclude it’s a misconception that RDF must be hated. It’s just some crappy tools that got aged and fail, and that’s all.

  3. […] posted a ranty post about RDF; based on some of the discussion on the back of that with friends and colleagues, I […]

  4. RDF(a) is pretty abstract and theoretical and therefore difficult to use in general. I guess people get tagging their articles and maintaining a list of tags, but you expect people to create an ontology and / or link their work to other ontologies.

    But isn’t it the tooling that makes a method usable? It’s just that for databases there’s good tooling to help people use them. Okay, and databases come more intuitively I guess since tables are a common thing (although even just tables can still be a bit abstract). And when a method (like databases) is used much, it becomes useful to create create indexes based on this method, and use it for gathering data and search.

    In this sense, what’s the difference between RDFa and JSON-LD or NoSQL. They’re still all things no-one gets. But which is actually better and worthy of developing good tooling for?

    As I learned the whole point of RDF is that anyone can create their own ontologies and even link their own ontologies to others, with for instance RDF, RDFS, OWL, SKOS. This is what makes RDF so great in my point of view, and not at all rigid or limited. There’s nothing about RDF keeping you from attaching metadata to each little list or sentence in your document and linking it.

    And here in 2016, Google’s and Bing’s are still hanging on to RDFa, which should mean something.

    I realize I’m probably missing some of your points. I would love to hear if I do. Very nice article though, it made me think.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s