Project Summary

This proposal seeks $50,000 to support a user needs assessment to assist the U.S. Federal government documents community migrate from a print-based to a digital collection program. In particular, we request funds to analyze the bibliographic structure of document types, the social workings of collection development, the legal requirements, and the economic viability of adapting the LOCKSS peer-to-peer distributed archiving technology to the Depository Program. We will use this information to assess what further technical research and development will be needed. This technical work will be the subject of a prospective e-government grant proposal to be submitted October 2003.

LOCKSS is designed to preserve access to web-published documents such as e-journals. Currently in beta test for materials that are intended to be stable over time, LOCKSS allows individual libraries to take custody of stable content delivered via HTTP and safeguard their community's access. LOCKSS ensures that such locally held content maintains its integrity through a polling and reputation system. LOCKSS is designed to run on inexpensive, consumer-grade hardware and to require almost no technical administration. The software is distributed as open source on www.sourceforge.net. Funding of this proposal would allow Stanford, in concert with selected members of the government documents community to explore the technical, economic, social and legal viability of various LOCKSS architecture models for the GPO depository program.

The nature and significance of its potential impact on the field
Decentralized retention of and access to electronic government documents would resolve a number of significant challenges ranging from the merely logistical (e.g., server bandwidth at centralized government servers) to the profound (e.g., enabling data-mining of large sets of this literature, assuring the persistence of the record of government information over time). LOCKSS has established proof of concept with a single type of literature (i.e., e-journals), and is well on the way to production of a unique archiving methodology. However, it has yet to branch to other types of literature and other communities. If convergence between the protection of government documents and the methods of LOCKSS can be shown to be viable, it would change the government document community's understanding of custodianship in the digital age and improve public and scholarly access to vital information, while also demonstrating techniques for LOCKSS-like distributed archiving of still other genera of electronic publishing.

Project Description

Introduction

LOCKSS is designed to preserve access to web-published documents such as e-journals. Currently in beta test for materials that are intended to be stable over time, LOCKSS allows individual libraries to take custody of stable content in all formats delivered via HTTP and safeguard their community's access. LOCKSS ensures that such locally held content maintains its integrity through a polling and reputation system. LOCKSS is designed to run on inexpensive, consumer-grade hardware and to require almost no technical administration. The software is distributed as open source on www.sourceforge.net. Funding of this proposal would allow the Stanford LOCKSS team and selected members of the government documents community to explore the technical, economic, social and legal viability of various LOCKSS architecture models for the GPO depository program.

Problem Statement

Congress established the Federal Depository Library Program (FDLP) in 1860 to ensure that the United States public has access to its government's information. Authorized by 44 US Code Section 1902, the program involves the acquisition, format conversion, and distribution of depository materials and the coordination of Federal depository libraries in the 50 states, the District of Columbia and U.S. territories. The mission of the FDLP is to disseminate information products from all three branches of the Government to more than 1,350 libraries nationwide. Libraries that have been designated as Federal depositories maintain these information products as part of their existing collections and are responsible for assuring that the public has free access to the material provided by the FDLP.

The program operates on a two-tiered structure. Fifty-three of the 1350 libraries have "Regional" status: these libraries automatically receive every document distributed under the program and are expected to maintain access to the material in perpetuity. The remaining 1300 libraries have "Selective" status. Selective libraries acquire a selection of materials that reflect the needs and interests of their local constituencies, and after five years have elapsed individual libraries may withdraw material after it has been offered to other selectives in the region. Selective libraries do not acquire documents at the individual title or piece level. Rather, they identify broad categories of material from specific agencies in which they are interested as designated by "item numbers" listed in a GPO basic document, the List of Classes of US Government Publications Available for Selection by Depository Libraries. As a result, each of the 1300 selective libraries in the program possess "profiles" involving receipt of as low as 2 and as high as 99 percent of the available material.

The FDLP as described ensures long-term public access through a geographically dispersed network of repositories to government information on print, microform and tangible electronic media (e.g., CD-ROMs, VideoDiscs, magnetic tape). These repositories serve as guardians and trusted repositories for the content and the public's right to unfettered and efficient access to the content. Increasingly, government agencies are producing less paper and relying on digital versions of documents made available solely on government servers. At present, there is no program for the systematic distribution of these electronic documents through the depository program. Long-term public access to this increasingly significant body of content is shifting from the model of a distributed set of trusted repositories (including public, academic, and government agency libraries) to a centralized model. For example, rather than distributing materials to depository libraries the Government Printing Office instead provides access to a large electronic collection through its own server, GPO Access. No longer is content distributed among a number of autonomous, trusted repositories of content, but content is increasingly geographically concentrated, with a small number of Federal servers located largely in the Washington, DC area serving everyone via the Internet.

The shift from a highly organized system of distribution is further compounded by the fact that government departments, agencies, and in some cases individuals working within agencies maintain a considerable amount of autonomy in populating and managing the many websites that comprise the government Internet domains. As a consequence there is very little consistency across or even within government agencies with respect to the way in which materials are created, made available, and maintained. At the agency level activities such as content refreshment (replacing old with new), content organization, budget reductions, loss of interest, or shifting politics, have a significant impact on the public's ability to find and use government information.

An informal search on Google highlights the problems of access and persistence. Some government documents can be found through Google, others cannot. There seems to be no pattern as to which are indexed. Some government documents are available through academic institutions and seem to be no longer available, or at least are not easily located on U.S. government sites, even using such specialized search engines as GPO Access, Google Uncle Sam, Firstgov, US GovSearch, SearchGov.com and Fedworld.

To assure their communities retain long-term access to this important literature, depository libraries must identify and adapt an inexpensive, robust, and independent mechanism for securing government e-documents. Such a mechanism could establish for the digital age the advantages (both operational and philosophical) of the distributed model that served the paper document depository program successfully for so long. We believe that LOCKSS may well be that mechanism.

Proposed Work

The LOCKSS project has proven the concept that a distributed peer-to-peer archiving system can capture and maintain scientific e-journals. However, this application of the LOCKSS principles makes a number of assumptions that are not true in the realm of government documents.

The purpose of this planning grant is to determine whether or how well the LOCKSS technology can be adapted to the U.S. Federal Government Depository Program by conducting a yearlong user needs assessment. Members of the "govdocs" community will provide the assessment in consultation with the technical team. There will be two deliverables from this grant: A document specifying the needs of the community in the three areas outlined below and a technical grant proposal based on these needs to be submitted October 2003:

I. Bibliographic Content Structure
Structure of documents; files; periodicity; volatility; databases vs. text; complexity

  • What technical changes are needed to the LOCKSS technology to make it work with Government Documents?
  • What additions to the LOCKSS software does the government documents community need?

II. Social and Economic Aspects of Collection Development
Processes and procedures for collection development selection and retention; Costs for selection, storage, and access.

  • What preservation architectures are appropriate for government documents?
  • Apart from preservation and access, are there other roles that the Depository Libraries could play in a system of web-based Government Documents?
  • Does LOCKSS, by allowing local institutions to hold the online content, increase use of this literature? We would work with a small number of institutions in depth to explore the issues of integrating local LOCKSS caches into university and other public search engines.

III.. Legal Aspects of Collection Development
U.S. Federal Documents deposited through the paper U.S. Federal Document Program remain U.S. Federal Property.

  • LOCKSS is an append-only database; is it acceptable, when the Federal Government recalls documents, to turn off access? While the documents are not under copyright, they are government property, see http://www.arl.org/info/frn/gov/Susman.html for a summary of the issues.

Project Participants

Stanford University Libraries Staff
- Michael Keller, University Librarian and Principal Investigator
- Vicky Reich, Director LOCKSS Program
- Tom Robertson, Technical Manager LOCKSS Program
- Chuck Eckman, Principal Government Documents Librarian and Head of the Social Sciences Resources Group

Government Document Partners

This project will involve eight government documents partner institutions representing a range of community interests. Six of the eight partners are Federal depository libraries with significant print collections and a strong interest in long-term preservation and local management of content and access. These six institutions include two regional depository libraries (Colorado and Minnesota) and four selective depository libraries (North Texas, CSU San Bernardino, US National Agricultural Library and Stanford). Although not a formal depository, the California Digital Library (CDL) represents and supports a network of twelve selective depositories. The CDL is currently investigating technical models that would support long-term preservation and access to government information. The National Agricultural Library (NAL) is one of four national libraries. NAL is managed by the US Department of Agriculture and provides a unique dual role as both an active "publisher" of government information as well as a "library". The US Government Printing Office (GPO) will serve in the critical role as agency sponsor for this project. GPO is the Federal agency designated by statute to manage the Federal Depository Library Program. The GPO has in recent years begun developing and managing electronic government documents on in-house servers, with access to this Electronic Collection provided via the GPO Access interface. As a result of these roles, GPO is uniquely positioned to explore legal and technical aspects of the extension of the LOCKSS model to support distributed archiving of electronic government documents. All parties will be active participants in the project, helping to define the technical requirements for adapting LOCKSS technology to government documents content as well as identifying the legal and community issues involved.

California Digital Library * Patricia Cruse

Stanford University Libraries * Chuck Eckman

US National Agricultural Library * Evelyn Frangakis

US Government Printing Office * George Barnum

University of Minnesota * Amy West

University of North Texas * Cathy Hartman

California State University San Bernardino * Jill Vassilakos-Long

University of Colorado * Tim Byrne

St. Charles City-County Library District* Anna Sylvan

Other Relevant Projects

Other relevant projects (CDL, other NSF, Mellon, CRL, OCLC/GPO). ). The Stanford Libraries staff are aware that other players are looking into development of solutions in the area of government documents. According to the 1 July 2002 "Proposal to review technologies for acquiring, assembling into meaningful research collections, and persistently managing the web-based documents of the US Federal and State Governments," submitted to the Andrew W. Mellon Foundation:

[The CDL project] will also be conducted in close co-ordination with a number of complementary initiatives including those led by the Center for Research Libraries, Stanford University, the San Diego Super Computer Center (SDSC) with the InterLib consortium led by Stanford computer scientists, OCLC, and the Library of Congress, respectively.

Ours is a parallel development, informed by and in communication with the CDL team and others, based on quite different technological models. We have initiated a dialogue (as of late July 2002) with CDL to explore areas of common interest and sharing of issues and challenges. Similarly, the Government Printing Office has announced a collaboration with the Online Computer Library Center (OCLC) to explore vehicles to manage electronic documents. Given the key players, SDSC and OCLC particularly, we can safely predict that their approaches will be oriented toward small numbers of large databases, centrally managed, as opposed to the LOCKSS model of many decentralized instances of smaller systems. Though LOCKSS' different philosophical, social, as well as technological models, we intend, at a minimum, to provide a different perspective on the problem set and to contribute to a broader, possibly more future oriented, analysis of the problem space and palette of available solutions to them.

About LOCKSS

The LOCKSS (Lots of Copies Keep Stuff Safe) project was initiated in October 1998. Based on Java[tm] technology and Linux, the LOCKSS system is an open-source, easy to use, distributed system, which runs on low-cost computers without central administration. Designed as an Internet appliance, the LOCKSS system preserves access to authoritative versions of web-published materials, applying contemporary automation to the old idea of preventing loss by multiplying copies. The PC runs an enhanced web cache that collects new issues of the e-journal and continually compares its contents with other caches on other participating computers. If files have been corrupted or altered, they can be repaired or replaced with intact copies from the publisher or from other caches.

The LOCKSS program is currently in a worldwide beta test focused on integrity, usability, and software performance, including impact on network traffic. The beta software has been released as open source, and is available on www.sourceforge.net. More information about LOCKSS and the beta test can be found at http://lockss.stanford.edu
In summer, 2002, the Andrew W. Mellon Foundation and the National Science Foundation (NSF) independently awarded two new, two-year grants totaling almost $3 million to the LOCKSS Program.

Currently, a total of 42 Publishers and 56 Libraries - including the Library of Congress-are testing the system to protect the integrity of, and maintain permanent access to valuable electronic data. The LOCKSS system makes it feasible and affordable, even for smaller libraries, to preserve access to the e-journals to which they subscribe, and safeguard their community's access to it. Individual libraries can also monitor the level of redundancy within the system.

The LOCKSS system makes it feasible and affordable, even for smaller libraries, to preserve access to the e-journals to which they subscribe, and safeguard their community's access to it. Individual libraries can also monitor the level of redundancy within the system.