Semantic technologies like RDF, OWL might be toted as a cure-all for the many data ills you may have, but that’s only half the story — there are many areas that semantic technologies can add a lot more value over and above what you’re doing today, while there are many area that you really shouldn’t entertain semantic technologies because doing so will literally make things worse than they need to be.
I have spent quite a few years struggling with content creation in a string-heavy domain, it’s a proper pain in the neck to roundtrip HTML and RDF. It struck me a while ago that there’s really no need to do this. In this particular domain, there is no need to provide a semantically enabled set-up for people doing data entry, in fact it’s the reverse of what is needed. People tend to assume that operatives in data entry in this domain needed to know something about semantic technology, whereas this is almost certainly not the case. What the operatives need to know is the organizational aims in doing data entry so that they can provide an outline of a workflow that can then be used to provide data to a semantic machine that does work with the data.
Because we’re talking about semantic technologies, the semantic stack can also provide data to the workflow in appropriate non-semantic formats in order to be then keyed into the semantic representation of the data that is entered. This seems like a simple idea, and yet I have struggled with this over a length of time.
The realization should, of course, have come earlier as I worked with time-series data in a semantic context, and I certainly wouldn’t have considered using semantic technology for data creation here. The simple devices that provide rapid storage for corrected time series values certainly cannot be improved upon, however once the data has been collected, it can be presented within a semantic framework that adds value (for example providing links between entries and annotations of the content).
I have asked myself why did I not come to this realization earlier; I suspect that it is a combination of my direct connection to this domain — I have worked with semantic technologies here since 2009 and have had some success. I certainly thought for a long time that the basic idea I had for data creation was a good one; a generic loom for creating semantic data. Unfortunately, this in essence was an HTML/linked-data replacement for ontology creation tools like Protégè and TopBraid Composer, which can largely provide the tools you need to create data directly given a sufficiently strict training regime. Users typically outgrow these tools anyway, as this kind of data creation is generally easier and ends up being tidier when done in a text editor.
And here’s the realization; you need to decouple data from semantic data. Your intent shouldn’t be to create that “kind of data”, as if semantic data itself is your ultimate aim; your aim is to create the data necessary for your application and this should be independent of how the data is realized in the application. This has largely been obscured for me in a struggle towards semantic applications — because I can do something semantically, doesn’t mean I should. This is a bitter pill to swallow for semantically geared developers.
As for real-world data creation, on the whole, I have moved away from a purely semantic approach. Document and table-oriented approaches offer immediate benefits; in the latter case — which is my currently preferred route — I can manipulate data as RDF objects while maintaining metadata in-store, which allows me create a simple interface for entry and simultaneously allows me to perform CRUD operations on the resultant semantic representation quickly and easily. This doesn’t keep me from wanting to provide the full representation of the underlying possibilities of semantic web directly in the data entry interface — but the workflow needs to be more streamlined than that. Again, a bitter pill for the semantic developer.
There are other areas where the semantic technologies we use can provide support for things that we want to do; a case in point is value-oriented (textual, geographical) searches. It’s possible to use SPARQL to do rudimentary textual searches using regular expressions, but it isn’t a good idea as it’s both expensive and there are patently better ways of doing this; there are semantic add-ons — including the one I wrote in 2011 — that have poor performance and again have better equipped counterpart solutions from outside the semantic domain.
Using for example indexing software like Elasticsearch provides a way of getting the values out and linking back into the semantic layer; there are a number of semantic software packages that do provide built-in support for indexing and while these are worth considering, I have the feeling that this makes it all to easy to rely on this as part of a “holistic” semantic stack. Indeed, I was burned by this when I relied too heavily Talis’ excellent Lucene integration, providing me with a vendor tie-in of my own creation that screwed my application when Talis stopped hosting data.
It is certainly the case that a better understanding of these things has helped me focus on the real strengths of semantic web technology, things that don’t have counterpart solutions outside the semantic stack: providing programmable logic in a declarative framework; providing simple, self-documenting data representation; providing a uniform API to unstructured data. The case for using semantic technology shouldn’t blind us to the possibility that there are technologies that are not semantic that nevertheless support our needs.