Doing digitization

This is a longer piece about digitization, particularly digitization in libraries and archives. This is a personal account and does not constitute the view of my current or any previous employer. I’m writing it largely on a promise to some Canadian or other — as is usually the case.

I have a small amount of experience with digitization workflows and systems having worked with digitization for more than half of the years in the period 2004–2014. I was an apprenticed pre-press worker in 2001 and worked extensively digitizing and preparing raster image files for production contexts in analogue and digital print as well as web. Following this I worked for an electronics industry company creating a workflow for digitizing and documenting classification and certification documents. Furthermore, I worked for several years on a project related to digitizing so-called special collections (manuscripts and other unique materials that libraries hold). In this latter case, I also had opportunity to work closely with archives and see the touching points there.

The experience of working with digitized images combined with the reality of creating content that can be used in multiple contexts has provided me with a clear vision for what I think is “right”. What I think is right is often at odds with what people in libraries and archives think and do.

The first thing I do is to question why you are digitizing things? There can be several answers, but some of the ones I have come across include:

  • because they are unique and disappearing (falling apart)
  • because they are moving to an inaccessible/less accessible location (different institution or repository) or are not unique and are being burnt
  • to make them available to a wider audience
  • to monetize it

Of these, only the first one really means that digitization needs to happen — if something will be lost forever if it is not digitized now, then do it and do it with as high as possible resolution! In all other cases, the materials are not disappearing, so they can be scanned in a lower resolution, at greater speed and with less heavyweight equipment.

Let me make this very clear: high resolutions mean expensive equipment that will quickly be obsolete and enforce ridiculous restrictions on how you can work; high resolutions mean slower digitization and larger files, larger files mean slower processing times. If you’re delivering files over the web anyway, you do not need extremely high resolutions, you need enough resolution. Think: scan twice at different times if you need higher resolutions. The current “high res” will be low-res two years down the line and even if it isn’t the equipment prices and processing times will have been reduced.

This brings me to the second point: do not buy any equipment now. Don’t buy any equipment at all until you have looked very hard at the following:

  • Do you have the right to digitize the materials (and provide them to third-parties)?
  • What use cases do you have for the digitized files?
  • What kinds of materials are you digitizing?

If you don’t have the requisite rights — and I mean, really, really make sure you do — forget it. I am aware of at least one project that spent hundreds of thousands of Norwegian Kroner on doing literally nothing because the ownership of the materials was not cleared before digitization began. This is a stupid mistake to make, so don’t make it.

If you want to make materials available to a wider audience, you’re going to be creating a website. This isn’t part of the digitization project, but a user of an API for your data stores. Let me repeat that: YOU ARE NOT CREATING A WEBSITE. Don’t ever, ever mix these two. The APIs for the data can take specification input from another project that is about websites, but let me mention two technologies right here: RDF and IIIF, and let the website project eat that cake.

If you’re aiming to monetize digitized collections, you might be entering into murky waters legally. I’m not even sure that I want to aid-and-abet people making money from digitization. For sure, if you have the copyright-holder’s consent and you’re not publicly funded, go ahead, but a publicly funded organization making cash from doing the job it has already been paid to do is not OK. In fact, if the revenues are needed to do the project, then the project may die if the revenues cease, which makes a mockery of long-term preservation goals and making the content generally accessible. Give up now. (While we’re on the topic, visible digital watermarks on things are stupid, don’t do that; put the rights stuff in metadata and embed it!)

Use cases for digitized files can be many and various, some of the ones I have come across include: providing marketing images for advertising agencies, providing users with on-demand copies of information resources, long-term preservation of cultural heritage, providing access to unique materials via the Web. There could be several reasons for digitizing content, bear in mind that these may conflict. Long-term preservation and on-demand delivery over the Web don’t sit easily together for reasons of speed.

It’s worth considering what the things you’re digitizing are — if they are archives then certain standards for documentation pertain, similarly special collections have their rules too. The formats of the materials you’re scanning also dictate things like the ways in which they will be digitized, how they will be documented and in the final instance presented to users. Knowing that a workflow will have to support different document formats is useful because it also means that you can find appropriate equipment for the purpose.

I mentioned the most important aspect of digitization in the previous paragraph: the workflow. The workflow must be planned very carefully. Spend more time on this than you spend on actually digitizing things, it doesn’t matter, you will learn more about your collections and how the digitized collections can be used. If you know your use cases and have planned your workflow, you can’t go wrong.

That is easy to say, but there are a few pointers about actually planning the workflow:

  • document the existing processes related to the materials
  • the workflow isn’t just about digitization, it includes the physical materials and the digitized materials. These two should be related via the metadata and synthetic keys (unique attributes that are manifested both in the metadata and on the physical materials). Use the synthetic keys to provide support for linking digitization with metadata.
  • imagine that the workflow is a specification of how metadata of an object to be scanned is interpreted so that it ends up in the right places in the right quality with the right metadata and the right display
  • metadata need not be created by hand, documenting automatic processing should be automatic
  • the workflow must be rigorous
  • the workflow needs to be easy to do right (otherwise operatives will work around it)
  • flexibility where possible, but there are no special cases (don’t distinguish things that don’t fit in, make sure you plan for extensibility to take care of new materials and use cases)
  • metadata is a tool for everything, not just description
  • leave nothing to chance, manual manipulation is a no-no
  • if operatives are photoshopping, you’re creating art and their name needs to be on the copyright (don’t ever do this)
  • scanning and processing must be automated
  • keep all the data, but make sure it is annotated so you know what it is

The details of your workflow need to be hammered out locally, consultancies can help you do this, but being part of the process and working closely with operatives who will work on the processes is essential. Make sure that everyone understands why you’re doing what you’re doing in the ways you’re doing it and they will help find ways of ironing out issues rather than simply working around them — the added value of the processes should be clear.

Technologywise, it is important to understand that there are some things that should be ephemeral — equipment and software — while others have rather longer lifespan — file formats. One of the recurring themes in digitization is choose hardware that can be used with non-vendor software; a single-vendor hardware/software package is often has very few benefits for the you and many for the supplier. I have seen systems forced to run (vendor: “I see no problem”) on Windows XP in 2014; the same system also needed nineteen clicks to process a single document and an additional five clicks for each scan. State-of-the-art in 2002 at a huge premium is not state-of-the-art twelve years later; equipment that needs longevity to be cost effective in this context is pointless as resolution, usability and speed are constantly moving targets. For goodness sakes, spend more money on your server and storage/backup! All the image quality in the world is worth nothing if it is impossible to process or just disappears (yes, this happens).

Must-haves in terms of technology include:

  • don’t hesitate to automate (Imagemagick and bash can get you a long way)
  • don’t let anyone manipulate files manually, use an API
  • files belong in a file system
  • file-system-level version control (ZFS, or somesuch)
  • embed metadata (including rights metadata)
  • choose software that can be automated, and use tools like AutoIT even when it can’t
  • search is not your game, providing data for search is

This topic is huge, and I have scratched the surface very lightly. I find this kind of work very exciting, so if you want to talk more contact me.

Advertisements
Posted in Uncategorized
4 comments on “Doing digitization
  1. Ole Husby says:

    As I have commented elsewhere: I would like to see a similar guide for the purpose of personal digitization. I have thousands of photographs (several media types), and I have repeatedly claimed that I would start a digitization project when retired. Now I am retired, but the project is for some reason not started. The hardest part of the specs is not about technical things like resolution and hardware, but more like metadata and workflow. (Sorry about this sidetracking, I find your guide very sound).

  2. Do you have any recommendations concerning format? Do you differentiate between storing format and presentation format?

    • brinxmat says:

      I tend to say TIF, uncompressed for storage (banal, understandable format that isn’t a moving target) and compressed JPG for presentation (really, the file should be as small as possible — we did blinded ocular tests on aliasing which resulted in us choosing to compress the hell out of things). Using an IIIF image server helps matters on the presentation side and has repercussions for formats too.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s