The paper in this post describes research on digital library applications. This paper has multiple authors, although I've credited by name only the author who presented the paper. The entire paper is available in the conference proceedings; let me know if you'd like to see it.
Large-Scale ETD Repositories - A Case Study of a Digital Library Application
Mark McFarland, et al
Session 5
The
presenter discussed three aspects of a specific digital library: the
digital library itself, a large collection maintained within the
digital library, and the system architecture that keeps both the
library and the collection functioning on a statewide level. The Texas
Digital Library is a statewide library with institutional members. One
of the largest and earliest collections within the TDL is the Texas ETD
Repository, a federated collection build on DSpace, with a Manakin
overlay. This repository project has a lot of layers, but the speaker
focused on issues of statewide implementation. Major issues encountered
included stakeholder participation, metadata challenges, and branding
of institutional portals. Specific stakeholder issues included the
technical infrastructure available at different institutions and
policies at those institutions and statewide that affected
participation. A specific metadata challenge included creating an
authoritative metadata scheme and mapping it to the various schema used
at different institutions. Once participation was established, the
project encountered problems with the document workflow. Two user
groups had two very different needs: submitters needed an interface
only once for submission, while administrators needed an interface for
daily interaction with the repository. The team built a workflow module
called Virio that replaces the Manakin interface for staff users only.
To address problems of interoperability, the team added OAI harvesting,
ORE support, and a scheduler for content harvesting. The team is only
beginning to address issues of preservation; they are looking at ways
to maintain multiple copies of the repository and spread them over a
wide geographical area.
From their experiences, the team has
developed four models of system architecture that account for the needs
of almost any DSpace site:
1) Simple all-in-one
2) Cooperative all-in-one, with addition of Shibboleth or similar for institutional identification
3)
Separated workflow: one repository dedicated to submission workflow;
another dedicated to publishing the repository - this model adds
scalability and security
4) Cooperative & Separated: one workflow repository for submissions; multiple institutional repositories for publishing
The team has also developed three federation models to address the size and ability of each participating institution:
1) Metadata replication only
2) Metadata + references to content
3) Metadata + content replication
The
team is planning to open source most of the software they have
developed for their particular projects, including the Virio workflow
module and the OAI-ORE harvester. Lessons learned include:
*Stakeholder participation is vital
*Need a flexible architecture (ORE provided this for their project)
*Must consider scalability, especially in terms of server load and geographic distribution
*Integrate with existing technologies whenever possible, e.g.: using Shibboleth for authentication
Comments