SCA: Going Digital: Less Process, More Content

Happy Saturday! Time for session notes from the first talk of the day. Let’s talk about digital materials and processing. Allons-y!

Moderator: Lellani Marshall (Sourisseau Academy for State and Local History, SJSU)

Speakers:
Paula Jabloner (Computer History Museum)
Russell Rader (Hoover Institution Archives, Stanford University)
Lisa Miller (Hoover Institution Archives, Stanford University)

Topic: Ways in which we can apply “More product, less process” in digital realm. Speakers will be sharing two case studies.

Lisa Miller (Hoover Institution Archives)
Hoover is well-funded compared to most, but still lacking some tools for digital processing and preservation. Shows a wish list: digital repository system, dedicated IT staff, computer programmer, DAM system, Tools for METS, PREMIS, etc..

Available resources; PC and Mac computers, floppy disks, server space, some staff time, and eager researcher for the Katayev Collection (2007). Spurred the archives to create basic processing procedures. Mostly had Web 1.0 files (Word files, non-interactive media, etc.). Researchers just wanted content, not concerned so much with authenticity issues (diplomatics) like archivists.

Basic steps:

  1. Find computer media: looking for media in collections via finding aids and catalog. No standard way of indexing media, so serendipity plays a role in finding them.
  2. Get files off the disks. Scan for viruses.
  3. Use checksums for file integrity (MP5 checksums). Can also be used to de-dup collection. Verify checksums when move or duplicate the files.
  4. Preserve files with unaltered bits and with author’s filenames intact. Sometimes change to target formats (.txt, PDF, PDF/A, delimited text for spreadsheets and databases). Try to do as much batch processing as possible. Add prefix to filename to delineate converted files
  5. Centralize files in one place on a server. Verify checksums regularly and do backups on tape. Manually initiate checksum verification each month.
  6. Document work with a “Read Me” text file. (Nice idea.) Explains processing steps-unstructured metadata.
  7. Use creator’s semantic folder system. Researchers can use at the Hoover Institution. The files are not available online because of copyright issues, etc..
  8. Describe the aggregate in the finding aid. Put information on finding aid on OAC, even if just a stump record without the rest of the collection having been processed. Description based on creator’s file structure and naming convention. Still trying to figure out meaningful ways to describe the content and extent.

Problems:

  • Viruses= stop doing anything with the file.
  • Unformatted disks
  • File extensions are lacking
  • Filenames don’t have any meaning. Problem with digital camera photos,
  • Corrupted files.
  • Character and coding problems, especially with data from other countries.
  • Scalability of: unstructured metadata in read me files, workflows for hundreds of media items, Web 2.0 formats (complex formats)

Ending thoughts: not ideal process, but files are recovered and can be used by researchers. “Preservation is for five years or forever, which ever comes first.” In the future, want to make part of regular collection processing workflow, create truly compliant PDF/A files, establish quarantine station, find digital tools to facilitate/expand workflow, and optimize file delivery for researchers.

Russell Rader (Hoover Institute)
Digital projects exceed our reach and Rader posits that we stopped asking the right questions. (What are the right questions?) Also believes that archivists are still afraid of “the digital.”

Talking about keeping workflows simple, which is a good idea. Using open source and free programs and tools are good ideas. Archivists need to learn more technology.

Paula Jabloner: Welcome to Nerdvana
Started at Computer History Museum in 2004 and needed to get stuff online. Over 80,000 records online. Museum has a “get it done” attitude. Everyone was for online access because if it isn’t online it’s a “black hole.” Concentrated n the doable not the perfect, one catalog for all artifact types (physical objects, software, A/V, and digital files), simple and seamless online experience (so get easy search process, but may not be exhaustive or authoritative)= broad based access, not an interpretive catalog.

Idea behind quick and dirty processing is to make it available asap. Put a lot of trust in the audience, because the audience is highly technical. Expect the audience to understand the content of the records. Also, used a lot of volunteers and interns for the creating the catalog.

Implementing MPLP: two year processing experiment. One full-time processing archivist supervising interns and volunteers. 12,500 folder level records created by the end. Stripped down metadata entry: set it up so almost everything could be entered automatically. Could duplicate records to speed up processing too.

Finding aids available on website and OAC. No open hours at museum; everything is by appointment for research use of materials. Finding aids are very stripped down. Not a lot of context given in the finding aids and you get minimal access. Always a trade-off between speed of processing and describing and many access points with contextual finding aid. 70% of collection now available online via catalog records.

Success: 16 finding aids online (entire archival collection in catalog), 32,000 searchable catalog records, 575,000 page views for a year, and 450,000 catalog page views in a year. However, records can be confusing, searching could be more user-friendly, too many databases to manage, etc..

Take Home Message
Processing and preservation of digital materials is difficult. You can speed up processing, but will lose extensive metadata creation and some ability to scale process (example, scaling text “Read Me” files). I’m conflicted about MPLP: I want more stuff online and available, but don’t think that there will be time to go back are reprocess, so will this minimal processing and metadata creation be a detriment in the future? Or does it not really matter as “digital preservation is for five years or forever, which ever comes first”?