An unusually sensible post about RDF

I posted a ranty post about RDF; based on some of the discussion on the back of that with friends and colleagues, I thought I’d follow up with something a lot less sweary and hopefully a bit more helpful.

As a user of RDF, I’m rather inclined to think of it as a good thing; I also use other technologies for data work; column and row-oriented stores, (No)SQL, etc. I’m a firm believer in using the right technology for the job and I observe many people trying to do everything with RDF, when the RDF stack isn’t suited to this.

RDF is good for data models, it’s good for data structures and transformations (HT @jindrichmynarz); it isn’t good for working with values. I have worked in several contexts where I have worked with values in a large scale way and here, I have inevitably moved values away from the semantic technology and into more robust systems that can crunch numbers and access values quickly and simply. That isn’t to say that RDF doesn’t play a role; it does, it plays important roles in helping users and systems understand and gather data.

I pointed out that the major benefit of RDF over other technologies is that it is schemaless; there are other schemaless technologies, but none are self-documenting and none are delivered with a open, standardized API. Schemalessness is important because it allows abstract and complex structures to be represented alongside simple ones; it means that you never have to worry as a developer about designing a clean and effective model. This can come later, if at all.

In saying that RDF’s only benefit is schemalessness, I’m talking about RDF as a modelling framework; the many vocabularies, ontologies and tools that make up the linked data stack aren’t “RDF”, but they are facilitated by it. These are indispensable in working with data on the Web in a distributed, Web-like fashion, but they don’t need RDF to work.

As a technology, RDF has some other benefits, but they’re not of clear value to a developer; things like how triples are efficient in storage and how the syntax is compact in some obscure sense can only be viewed as at a tangent to the pressing concern of writing working code.

Unfortunately, RDF’s power as a data tool is lost because people wanted another database. We need to get over that. I represent my data and work with my data structures in RDF, but I manipulate the values in JSON with tools that allow can give quick responses to questions about textual content, geographical and temporal information.

In working with values, I’m not working with data structures, I’m asking questions about instances, about the outermost edges of my graphs. I’m literally not interested in anything other than literals and I don’t need RDF data structures to do this. Simple tools for these simple jobs.

Nevertheless, to build these simple tools, I can perform much better if I have a knowledge base that provides tools to tell me if I have any geolocation data, what its structure is and what it is actually geolocating. Without this, I just have some literals.

In the library world, we bang on about “things not strings” because we were inundated with strings as well as powerful tools for manipulating strings; the recent BIBFRAME initiative attempts to formalize “things not strings” in a way that is compatible with historical practice. I’m not a believer in this approach, but the nature of RDF, its hitherto-noted modularity and extensibility, means that it doesn’t matter — something better can come later.

RDF’s problems are many, but they largely boil down to a couple of things; firstly, as @xbib pointed out in a comment on the previous post, people don’t want a data model, they want values. I’d respond to this by firstly agreeing and secondly thinking that this is a shame. There are plenty of examples why choices that boil data on the Web down to the simplest route to values aren’t necessarily good choices, but Sarah Mei’s thorough treatment of using the JSON document store, MongoDB, might open a few eyes.

The second big problem RDF has is that it isn’t what people expected. I wrote this in a flippant way previously, but it needs to be said properly. The RDF stack doesn’t provide a replacement for a database. The expectations that one has because of the omnipresence of databases include: easy schema-based data entry forms, easy value querying and sorting and knowing that the data you got out was the data. Enter RDF. Aside causing triples to come into existence directly, getting data into RDF is difficult; there is no roundtrip from HTML. Or any other from. I will take a moment to point out that some people in the past did data entry in Protégè…alas. This makes RDF largely useless to anyone who thinks database-wise; it also directs attention away from the lack of data management in RDF…but that is for another time. I’ve already said RDF isn’t good for values, but I didn’t mention that it’s open world, which means that the data you have is just some statements that may or may not be all of the statements and/or be true. This is difficult for most people to accept.

RDF then should large largely be left to applications where you need a logical data model that will allow you to create data structures and manipulate these as graphs. If you don’t need this, you don’t need RDF. I’d argue that you often do, especially if you work with data and large volumes of complex data. I’d argue that a logical, modular representation can be useful for generating views of data and providing new insights into data structures and transformations of these. RDF can be a good starting point for producing smart data, but it isn’t the endpoint; the endpoints are ephemeral and provide the here-and-now of the data. Pushing all the data power into technologies that focus on the here and the now doesn’t seem like a long-term strategy.

As a footnote, JSON-LD has been mooted as an alternative to RDF; why not, it’s either a serialization for linked data (not a very good one because I can’t parse JSON as easily as I can parse XML, I can’t read it as easily as I can read turtle and I can’t easily chunk it like I can ntriples). It also imposes a whole load of conventions that make my data more difficult to work with when I need simple access to simple values. Will I use JSON-LD? I already do. Is it the panacea it’s being toasted as? Certainly not, it’s another route to not having the tools to work with data and it certainly smacks of being of the here and now.

Advertisements
Tagged with: , ,
Posted in Uncategorized
3 comments on “An unusually sensible post about RDF
  1. Yes, RDF is just a tool with strength and weaknesses. I just uploaded the broad categorization of tools for data structuring and description from my PhD. In this categorization RDF is primarily a tool for structuring data. In particular, RDF is good for graph structures. Comparable structuring tools include CSV (good for tabular data) and XML (good for hierarchical data). Graph structures are very flexible (I think this is what you meant by schemaless) but it’s difficult to get simple answers from a graph, compared to hierarchical or tabular data.

    The RDF ecosystem also puts emphasis on schemas and rules (OWL ontologies, inference rules etc.) and conceptual models (data models in a more informal sense). But conceptual modeling is hard, no matter what technology is used and people tend to avoid explicit data modeling (they prefer to do it implicitly in their heads). In fact modeling is deeply connected to the second hard problem in computer science as mentioned in the quote by Phil Karlton in the linked posting by Sarah Mei.

    • brinxmat says:

      I’ll summarize the points from the post:

      1) Don’t use RDF for every job.
      2) RDF is good for data models.
      3) RDF isn’t good for value-intensive operations.
      4) Schemalessness is good, whether this is derived from a graph or a KV/whatever approach (cf. http://martinfowler.com/articles/schemaless/)
      5) People have issues using RDF because they are used to value-oriented, schema-driven technologies.
      6) RDF is difficult to work with because it is unfamiliar and logically complex.
      7) Most people don’t want a data tool, they want a database/KV-store.
      8) Using an RDF model is smart for an application developer.
      9) RDF needs to be used in conjunction with other technologies to usefully do value-oriented things.

      As you can see, we hardly disagree.

      Your point about the RDF ecosystem putting emphasis on OWL & co.; I disagree, proponents of OWL place emphasis on OWL ecosystem stuff. That’s their choice. Many others certainly do not cf. BIBFRAME. Conceptual modelling is hard, which is why I suggest that the many existing ontologies and vocabularies are indispensable…

  2. I used OWL as example of an RDF schema language. RDF can be used schemaless, and not having to chose one single schema is a major benefit of RDF, but in practice RDF is used with schemas, even if the schema is just a list of classes and properties. Maybe RDF is more schemless just because it’s more difficult to spot implicit schemas in a graph than in tabular data. I agree about the minor role of values in RDF: major parts of information are encoded in URIs and RDF triples, but less in RDF literals. This can also apply to XML to some degree with tags instead of character content. Point 7 also applies to other technologies: people think that they just want to “store” data but data must also be expressed, refered to, structured, constrained, and described. Expecting one technology to do all of it is wrong.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s