It's a pretty widely accepted fact here at K-State Libraries that SFX, aka Get It, plays a huge role in the services we provide for our users. Inevitably, and partly because SFX is so heavily used, we encounter problems with how SFX directs people to resources. Such problems, for the most part, have one of three possible causes:
- The data we put into SFX is wrong. Once we know about them, we fix these errors pretty quickly.
- The data contained in an incoming OpenURL is wrong. Information providers have to fix these errors; depending on the provider a fix can take anywhere from a day to a month or even longer.
- The incoming OpenURL itself is wrong, i.e.: it is incorrectly formed according to the rules for OpenURLs.
You'll notice I didn't mention a fix for the third cause? Right. That's because there isn't a good fix just now. Libraries have no control over how an individual information provider chooses to form the OpenURLs destined for a library's link resolver. Information providers have little incentive, other than a wish to do the right thing, to conform their OpenURL practices to the published standard. This unfortunate situation is about to change.Since the form of an OpenURL and the data descriptors it contains are prescribed by a published standard (ANSI/NISO Z39.88), you'd think malformed OpenURLs almost never happen. After all, the point of an information standard is to give everyone using a technology the same set of rules. But badly-formed OpenURLs arrive at libraries' link resolvers all the time and, until recently, there was very little we could do even to prove it was happening because much of the evidence was anecdotal.
Enter the OpenURL Quality Metrics project, run by Adam Chandler at Cornell University. The idea behind this project is to examine the incoming OpenURLs sent to link resolvers by information providers, identify OpenURLs that cannot result in delivery of the requested service (full text linking, library catalog, etc), and determine whether specific information providers routinely send malformed OpenURLs. The eventual goal of the work is to create an ongoing set of measurements that libraries can use to evaluate information providers' effectiveness as sources of OpenURL linking. Adam's blog, presentations about this project, and link to the current reporting system are at http://openurlquality.blogspot.com.
K-State has contributed incoming OpenURL data to this project. I haven't yet begun to analyze it using the reporting system's tools; I'll report more about the data itself as soon as I have. For this post, let's go back to the phrase, "K-State has contributed...data." I'll tell you a story about the work behind those four innocent-sounding words.
Find an article in a citation-only database like Web of Science. Go on. I'll wait. Got an article? Now click the link resolver button (Get It, if you're at K-State). Right there. That click generated an incoming OpenURL. It's "incoming" because the information is coming from an external source into your institution's link resolver. From there, as you know, the information can go to a full text journal, library catalog, ILL software, or wherever, this time as an "outgoing OpenURL." Here's the thing - the quality of the incoming OpenURL directly affects the service you get from the subsequent outgoing OpenURL. As an example, no ISSN incoming probably means no full text outgoing. Garbage in, garbage out. The OpenURL Quality Metrics project is working to find that garbage and subject it to our scrutiny.
In order to contribute data, namely incoming OpenURLs, to the project, I needed to extract those incoming OpenURLs from SFX. As I mentioned, SFX logs incoming OpenURLs, the dates and times of their arrival, and other relevant information in a MySQL table. With assistance from the kind customer support people at Ex Libris (thanks, Michael!), I tracked down the relevant table and constructed a query to get the data into a tab delimited text file. Simple, right?
In hindsight, yes, it was simple. I'm lucky to have a nodding acquaintance with SQL so I can write and modify my own queries. But getting the data into a query-able table was a challenge. For performance reasons, SFX recommends archiving its stat tables often, which I do. As a result, our SFX production instance is pretty nimble but generally doesn't contain more than the most recent month of data in a query-able form. I wanted to send Adam OpenURLs from all of 2009. To be able to query that much data, I needed to un-archive the relevant files into SFX's MySQL tables in a way that wouldn't impact performance for our end users. I'm working on getting a separate MySQL database where I can park all 3.5 years (and counting) of our SFX data, unarchived, to be queried at will. At the time, though, I took advantage of our sandbox instance to un-archive all 2009 transactions without impacting SFX's performance in production.
Holy data, Batman. The 2009 transactions come to almost half a million rows of data. Each of those rows contains an incoming OpenURL. I extracted the OpenURL and date from each row, broke the resulting massive file into slightly less massive three-month sections, and sent the whole mess to Adam one quarter at a time. I use that word deliberately - it really is a mess. Of 452,429 rows, Adam's program was able to process only 333,230, or something like 75% of our 2009 data.
This is no reflection on Adam's program, although I expect he'll be doing some tweaking as the project moves forward. The remaining 25% of our 2009 incoming OpenURLs are, for the most part, formatted so poorly that the program can't parse them (this result alone justifies the need for a measure of OpenURL quality). So far, the worst culprits seem to be OpenURLs that use semicolons as delimiters between data elements. The correct delimiter in an OpenURL is the ampersand. Ex Libris has confirmed for me that SFX doesn't rewrite incoming OpenURLs so presumably the semicolons aren't being inserted by SFX. One mystery I have to solve is where they are coming from. There's also the successfully processed data to examine. Even though I haven't yet waded in to examine the results, you can take a look at http://openurlquality.niso.org.