I read a couple of things about semantic web stuff that made me think about things that we have really known for quite a long while now:

  • use appropriate technology for appropriate applications
  • consumption is far harder than publication
  • some data simply isn’t very re-purposable

Having worked with production systems using RDF since 2006 in print industry, energy industry and libraries, I’m aware that semantic technology has a clear place in the software engineering toolbox. It irritates me a bit that people claim that RDF isn’t widely deployed and that it’s an academic field; this isn’t true, it simply means YOU haven’t had experience with the fields where it is present.

Just because it is possible doesn’t mean we should…

Anyone who has worked with RDF, will have also experienced that there is a lot of woo-thinking surrounding the semantic stack. Typically, there has been a desire to create semantic-only solutions (ignoring other technologies) as well as attempts to solve difficult and interesting problems instead of focussing on core problems.

I don’t want to dwell on this topic, but I have noted a number of projects where non-semantic technologies have been largely ignored in favour of purely semantic solutions. Sometimes, a simple SQL database solution is much more appropriate than something in RDF — think lots of values and calculations thereupon — and one should always consider the relationship between indexing and search (“but SPARQL can be used for search…srsly”).

It’s easier to throw up than create a delicate sushi meal

Emptying the content of your database onto the web is an easy thing, whereas creating a re-usable dataset is really hard. One of the things that helps to understand this is creating an application on top of other people’s data. Comparing apples with apples in your own system is sometimes hard enough, comparing my entity apples with your string oranges is harder. Even when we can work out how the data is related, doesn’t mean it can be used directly.

It’s even tempting to say that putting data out there first was a mistake; yes, data availability has increased, but has data use increased anywhere near proportionally? Some datasets see heavy linking and this is because we see good, comprehensive, domain-specific datasets (geonames really stands out here), whereas DBPedia stands out for its own unique reasons. In the latter case, really using the data in good ways is still a challenge because while it is voluminous, the quality and coverage is not equally good from subject to subject. Which brings us to weirdness…

Metacrap is another way of saying I don’t understand domains

Cory Doctorow’s Metacrap has been cited many times as proof that the whole metadata enterprise is doomed. Nonsense; Doctorow is arguing against universalism in metadata, whether he understands it or not. Within all domains, the usefulness of terminology is balanced against the difficulty of having to learn the lingo; when you’re searching for a thing within a specific field in your own language then a highly subjective term is going to help you — similarly tagging a thing oriented towards kids with Latin terms would be useless (unless your kids attend that kind of school). It’s working in and between these domains that is the work of search.

Providing links between domains is hard — there are some simple datasets that do this, but the real work in this area is done not by simple facet comparison and entity relations but with probabilistic tools. The job of relating semi-overlapping terminology is, however, trivial when compared to creating a system that can use the data in a useful way. Perhaps there is something in this: that the limits of our data’s linkability (a subjective concept?) is directly proportional to its domain specificity (measurable?).

…and then there is simple bad practice

One of the interesting paradoxes that has been raised is that data should outlive its application — the fine wine analogy springs to mind — whereas often data is created for an application. Perhaps RDF’s biggest boon is that it makes it possible to create data models that are abstract from the application in a way that a physical data model never can be. I guess that this is also what makes RDF quite hard for most people.

I’d say it is a common anti-pattern to not consider data at a very abstract, conceptual level, but plow down to low-level implementation pretty much straight away. This can be seen in the way some folks will disregard any abstract discussion of data models at all. It’s worth pointing out that there’s a difference between an engineer and a mechanic.

Posted in Uncategorized
One comment on “Naysaying
  1. Albert says:

    Reblogged this on sonofbluerobot.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s