Documenting CSV with the csv vocabulary

At NTNU, we developed the csv vocabulary to provide a simple vocabulary to document CSV files in a way that wasn’t previously possible. It allows humans and computers to manage and interact with list-based data resources in new ways by providing the information that is typically stored in the head of the creator of the resource (i.e. the person to download the spreadsheet).

The reason we developed the vocabulary was that we suffer from spreadsheetitis; [wonder about this? Watch Felienne Hermans’ entertaining presentation “Spreadsheets: the ununderstood dark matter of IT” from the 2013 Strata conference]. Our solution here is to provide tools that firstly understand the data and secondly provide a way of supplying data to transform the data into other formats.

Using RDF for this application is a no-brainer — RDF has the tools needed to document things explicitly and expressively for machines and humans. It’s simple (really), the csv vocabulary provides two classes (CsvDocument and Column) and seven properties (hasColumn, hasIndex, hasCharacterEncoding, encodesLinebreaksAs, hasHeaderLine, mapsTo and hasMultivalueSeparator). The simplicity belies the power of this vocabulary; we can document files quickly and efficiently, utilizing other widely used vocabularies (such as rdfs and dcterms); we originally had terms for cell description, but this proved largely meaningless as it is simpler to provide a new representation of the CSV in some other format.

It’s interesting that Jeni Tennison of the Open Data Institute declared 2014 to be “the year of CSV” and it seems that many now recognize that CSV isn’t going to go away. At least we have some tools for documenting it without having to mangle it into new formats. This pragmatic approach perhaps ignores some of the purism of the semantic web, which means it is usable.

So, how is it used? Simply declare that a resource is a CsvDocument, that the document has columns and annotate those columns. An example:

<; a csv:CsvDocument ;

rdfs:comment “A document about fish” ;

dcterms:title “Fish” ;

csv:hasHeaderLine “false”^^xsd:boolean ;

csv:hasCharacterEncoding “UTF-8” ;

csv:encodesLinebreaksAs “LF” ;

csv:hasColumn :column1 ;

csv:hasColumn :column2 ;

csv: hasColumn :column3  .

:column1 a csv:Column ;

rdfs:label “Name” ;

csv:hasIndex “1” ;

csv:mapsTo rdfs:label .

:column2 a csv:Column ;

rdfs:label “Latin name” ;

csv:hasIndex “2” ;

csv:mapsTo rdfs:label .

:column3 a csv:Column ;

rdfs:label “Suppliers” ;

csv:hasIndex “3” ;

csv:mapsTo ex:hasSupplier .

Note that you can add other terms to add value to this; if for example column3 contains formatted text, you could provide annotation to the effect that each field contains such information, and if it is structured, you could provide information about parsing this information. Simplicity in description lies in the use of simple terms from the csv vocabulary in conjunction with other vocabularies. You’ll note that the user is free to describe information to the extent that they feel is appropriate for their use; because it’s RDF, you can always go back later and add more information to the description.

Tagged with: , , ,
Posted in Uncategorized
One comment on “Documenting CSV with the csv vocabulary
  1. […] is interesting: the CSV ontology to describe the columns of a CSV file and the file itself. I can definitely see the value in rich descriptions of CSV files, or spreadsheets in general. But […]

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s