We’re familiar with discovery of literature in the library world; we facilitate it by producing bibliographic data and protocols for accessing and harvesting this data. What we don’t do is search. You might argue with this claim, but broadly speaking, search is delegated to third parties. While search in a local data repository isn’t an irrelevant aspect of our offering, the broad-strokes search that we typically term discovery isn’t really the thing libraries do.
Federated search was once seen as a solution to the problem of broad-strokes search, but that fizzled out as the realities of federated search became apparent (briefly, heterogeneous data formats, heterogeneous terminology in the data, heterogeneous interfaces to the data, workload involved in reconciling these).
More recently, the trend is towards “discovery” services, where a provider offers search in a repository of “all the data”. In this case, “all the data” depends on who is doing the offering, but amongst other things, this includes the data your institution “owns”*. This sounds like a great deal, but it soon becomes obvious that the big-old repository approach largely means cramming all the old stuff into a bigger database, carrying with it all the weaknesses of ambiguity in data and format that we saw in federated search.
You’d like to imagine that this was a step up from federated search — where federated search is inconvenient (time, user experience), discovery services excel because we have lost the key marker of heterogeneity; things actually being in different systems. Of course, you’d be wrong for a number of reasons.
When the data was created, it was created in a specific domain, with a specific model and is implemented in a specific system (the unloved triumvirate that enforces the dreaded local colour). Pulling that data out of that system without regard to these things is easy, but results in less useful data; this is why we see duplicates and ambiguity, the enemies of good search.
On the flipside, you could do data migration properly from each system**, but given that actual data migration will rely on an understanding of all modelling levels for each and every case and mapping this to an entirely new set of models and anyway result in a new system with all of the issues that the old systems had the very next time you migrate to a new system (and yes, this will happen).
Maybe it’s the case that the local colour is there for a reason; maybe this is added value for the users (using the correct terminology for the field of high-energy physics or terminology accessible to children) and maybe the physical implementation of the interface results in a user experience that the users can live with.
But, don’t the users want a “Google-like” experience? Well, they already have one, it’s available for free at google.com***.
“It doesn’t contain our data!” Then you have issues with your system and it needs to be changed so that it is attractive to search engines (largely providing actual content and the metadata formats search engines use).
“We only want to search in relevant data!” So, you curate your collections carefully, to weed out the spurious articles written by algorithm or journals bought-in by corporations? You curate all of the free resources available on the web?
“We only want to search in our data!” If you’re not curating it yourself then it isn’t your data and you mean “We only want to search amongst the things we have paid for!”. If you really only want to search in the data you curate then you as-like-as-not will also benefit from a locally curated search interface as you’ve managed to define a domain-specific use cases.
The point: domain specificity isn’t a crime, it’s an essential part of creating usable services. Usable domain services will always be smaller systems; creating an overarching framework to provide search for these is extremely difficult without data marked up with genericity in mind (and here, linked data is definitely a candidate framework) and may even be doing the users a disservice by not giving them access to an interface tailored to finding things in a domain-related way.
Paying money for systems that provide access to just your data is paying for something that is worse than the free offering; you have all of the commercial disadvantages (adverts? No, just back-room agreements about access) and lose out on the economy of scale that is “in fact, everything, more-or-less”. I’ll explain that last bit: sure, maybe your hit doesn’t come out on the top, but someone else’s does; the user having access to relevance ranked data in a massive framework with access to metadata and full-text is likely to put lazy searchers in touch with content more reliably than a little repository based on your metadata. How the user gets to content and which site they download from seems irrelevant to me as long as they get the content they need.
There are duplicates in overarching search services, there always will be. The user recognizes this, it isn’t important.
If you are to have a new service, rather create a small locally oriented service that can be consumed by others via web crawling and APIs — provide contexts that will inform services consuming your data about what the terms you’re using mean (using schema.org and published vocabularies) and expect users to find your stuff elsewhere.
If you really must have a local overarching search engine, buy a local overarching search engine; set aside money for developers who will create the search architecture and operatives who will ensure that ingested data is continually mapped, normalized and tested, that search algorithms are updated for your local users’ behaviours and that the services comply with the user stories that you have for your user base. Don’t leave this stuff to chance, it’s far too important to leave to an anonymous corporation who doesn’t know your users…obviously.
*Let’s just agree to ignore the preposterous aspects of the licensing agreements these offerings often carry (your data? No, it’s our data!).
**Let’s just agree to ignore the preposterous aspects of the licensing agreements these offerings often carry (your data? No, it’s our data!).
***Let’s just agree to ignore the preposterous aspects of the licensing agreements these offerings often carry (your data? No, it’s our data!).