Forensics, Privacy, Security

Last session before closing keynote. On to the summaries of forensics, privacy, and security!

Questions posed: What is the proper boundary between public and private data? How far should archivists go in collecting what might be private data?

Session moderated by Elizabeth Churchill (Yahoo! Research)

Archival Applications of Digital Forensics Tools and Techniques or Why I started reading Forensic Cop Journal
Kam Woods (University of North Carolina)

Parallels between forensics and archiving: case files and archival packaging; exploit private data to support criminal prosecution and identify private and legally-encumbered data to redact/protect, etc..

Acquisition
Archives increasingly have to deal with ingesting heterogeneous fixed and removable media. Need to ensure reliable data extraction and reducing hidden risk. Need to know what you’re given. Want to establish “ground truth” about what is on the media. Looking at residual and system data.

Handle issues of privacy via forensics formats such as cryptopgraphic hashing and unique identifiers. Working with bulk_extractor tool to process data with proactive detection and decompression, stream processing during disk imaging, and parallelized processing. Also using fiwalk: creates DFXML from disk images using SleuthKit and creates Dublin Core metadata for files, and file level hashing. Tools, APIs, etc. available at afflib.org.

Developing other projects mentioned in earlier session: BitCurator and Realistic corpora for archival education and training.

The Personal in the Organizational: Value and Ethics
Sam Meister (University of Maryland)

Discussion of the issues and implications of personal data embedded in the records of failed businesses. Framing the talk as privacy as an ethical matter. Part of larger research endeavor.

Sherwood is a restructuring firm and offers private bankruptcy option. Sherwood becomes new owners of the company and winds the company down (managed liquidation). Therefore it also has a lot of private data and records.

Records of start-ups are messy. No records management. “These small organizations are like big people”: both are a bit messy and disorganized. Lots of personal, private data goes to Sherwood when Sherwood is given company.

No comprehensive legislature when it comes to preserving personal data. Therefore it is an ethical issue. Can look at Codes of Ethics: Privacy statements from Society of American Archivists and International Council on Archives.

Strategies/Questions
Looking at selection and appraisal: How can we collect these records? Difficult to know how much personal information is located in the records and where it is in the files. Options: redaction, not collecting employee records: each method has downsides.
Access: How do we give access to the records? How do we keep the private data private?

Transition of private records to public. There is complexity of rights and ethical questions about access and privacy (or the tension between the two). All about maintaining trust.

Take away: Many issues to think about in regards to forensics, privacy, and access. More questions than answers at this point, but definitely learned more about valuable projects furthering our understanding and practical ability to deal with the records coming to our archives.

Personal Health Data Panel

Notes on Personal Health Data Panel. Allons-y!

Panelists:Dave Marvit (Fujitsu Laboratories of America), Gordon Bell (Microsoft Research), Linda Branagan (Telemedicine Products, Medweb), and Khaled Hassounah (MedHelp)

The Quantitative Self aka Quantified Self (QS)
Gordon Bell

Started MyLifeBits over 10 years ago. One uses of SenseCam= capture health data. Challenges: privacy and entrenched, structured growth industry. Scanned everything he had in regards to health documents. You can record and keep lots of personal health data digitally now. Need to do wellness monitoring. Recommends a pedometer.

Bringing Personal Health Archiving to the Masses
Khaled Hassounah

MedHelp is the largest online health community (12 million users monthly), leading provider of PHRs and health applications, over 300 active condition specific communities and forums, partnerships with leading medical institutions, over 200 experts responding to users’ questions, and live chats with medical experts.

Community is very important to MedHelp. Three years ago decided to do Personal Health Records (PHR), have patients involved. Built it and no one came.

Why didn’t they come?
People are more interested in managing their health or a medical condition, records and archives are not relevant to most of the population most of the time, and users wanted to share, but privacy is selectively necessary. People want to decide what they want to share and what they want to keep private. To build community, you need to be able to share.

Need to give people something that is relevant to them, right now. Give them a tool they can use now. For example, Birth of a Tracker: Ovulation/Fertility Tracker. Very popular tool and other communities wanted trackers too. Key: need to have instant relevancy and benefits for the people in order for it to be popular. Other trackers created: Mood Tracker, Sleep Tracker, Pain Tracker, etc.. MedHelp has over 50 trackers now.

Make it really easy for people to decide whether they want to make the tracker information public or private. Make it obvious for people instead of hiding options (yay!). 85% of the people choose to make their trackers public.

What We Learned:
Have to focus on the activity, records and archives are foreign concepts, sharing is important and privacy should be an option, and the question that is most important to people is “Am I normal?”

Health Data
Linda Branagan

Electronic Medical Records (EMR) are maintained by your doctors (aka your chart). Governed by HIPAA and stored by healthcare provider.

Personal Health REcords (PHR)

  1. Type 1: Patient owned and operated. Online record of interactions with all your healthcare providers. Not covered by HIPPA. Example is Google Health.
  2. Type 2. Tethered PHRs (aka “patient window”). Probably not details of your physician encounter, still stored and maintained by healthcare providers, you may have a separate one for each provider, and may be provided by your insurance company. HIPAA applies.
  3. Type 3: PHRs. Self-collected data store, created by you, often stored on vendor websites, might incorporate or access via a Type 1 PHR, and some home health monitoring devices will deliver data to a physician’s EMR.

PHRs are not universally embraced by healthcare providers. Worry about correctness, liability, and usefulness. Not universally rejected either because: PHR-using patients are more likely to be participative, engaged, and compliant to treatment, may help avoid duplicate diagnostic tests, can help coordination during a complex episode of care (ex. having difficult diagnosis or if you have a chronic condition and an acute disease/illness), and can assist family/friend advocates.

Take away: First, this was not really a panel at all. This session consisted of three separate presentations and no interactions among the panelists. Interesting information about how people create and use personal health data online. I would have liked to hear discussion among the panelists and have a dialogue with the audience.

Teaching, Professional Development & Theory

First session after lunch. Time to talk about teaching and how it relates to personal digital archiving. Let’s get into the nitty gritty.

Digital Forensic Training
Cal Lee (University of North Carolina
Forensication: The incorporation of digital forensics methods, tools, and concepts in contexts other than criminal investigations.

Forensication of Archives: recover data when technology fail, capturing evidence from places that are not always immediately visible, ensuring that actions don’t make irreversible changes, attending to order of volatility, documenting what we do, so others will know what we might have changed, taking advantage of the information associated with files to ensure that users of the files understand their context of creation.

Collecting institutions are getting removable media and want to collect the online traces of individuals. And digital forensics field provides training and tools, primarily focused on law enforcement.

Example: School of Information at UNC Chapel Hill
Created lab for learning about the application of digital forensics to the acquisition of digital materials. Check out digitalcorpora.org

Want to build the capacity at UNC and translation of industry models and techniques to the archival world. Lots of questions to answer about: how to apply the tools, how much adaptation is required, and what software is most useful. Also looking at ethics of access: can’t be avoided because users can exploit forensic methods, even if we don’t.

Vision: widespread incorporation of forensic methods into routine processing of archival materials. BitCurator–a modular software environment that implements various batch processes on bitstreams to support two contexts: established forensic programs at institutions and those institutions/individuals getting started with digital forensics.

Personal Digital Archiving, the Diminishing Information Age, and the Archival Paradigm
Richard Cox (University of Pittsburgh)

Big picture context type of discussion. We are so immersed in technology that we are not listening to each other. Need to see how projects connect to each other and what are the practical applications.

People are losing confidence about being able to access their content= information is diminishing. Also, this is why people are interested in personal digital archiving. Worried about losing information with transition to to online/digital way of doing things. Cox’s example, ebooks as “ghost books” versus the physical book.

Problems: libraries closing, losing browsability, end of slow reading, students don’t know how to read and think critically, disappearing bookstores, declining newspaper sales and end of journals, worried about authority of news online, and library and information schools changing/transitioning to iSchools.

Archival paradigm needs to change to have archivists become enablers of others to be able to curate/archive their own data (personal archiving). People are worried about losing their data.

We need to think more deeply and broadly about digital archives and collaborate with each other.

Archival Sense-making: Personal Digital Archiving as an Iteration
Mark Matienzo (Yale University Library) and Amelia Abreu (University of Washington)

Frame personal digital archiving within the context of appraisal and archivalization, examine the contexts of archival sensemaking and identity creation.

Archival sensemaking is a situated action and archivialization is a conscious or unconscious decision process whether something is worth archiving and sensemaking is a theoretical guideline for the analyses of this study. They have taken sensemaking from other disciplines and drawing heavily on Brenda Dervin’s work.

How does sensemaking take place in personal digital archiving?
Collecting as meaningful negotiation. Also looking at context. Looking at archival genres (influenced by Derrida): collections and spaces where you can dwell on text and create new materials.

While sensemaking may be a promising framework in archival research, however there are limits to using sensemaking as a theoretical framework. (This is true of most imported theories, but it is great that these researchers are explicitly documenting the limitations.)

Take away: Library and information science education is changing and should change. We need to collaborate more and break down the silos among our projects. Theories from allied fields may be imported for archival research successfully, but we must be aware of the limitations. Final thoughts? Interesting things happening at graduate schools and we need to figure out how to share information in a more efficient and meaningful way.

Perspectives from Computer Industry Founders

Session on perspectives from computer industry founders. Fingers getting tired from typing, but we will carry on. To the notes!

Ted Nelson (Xanadu)

Considers himself the only dissenter in the computer industry. Started Xanadu in 1960. It was easy to create your own computer world in 1960 because no constraints. Worst problem now is the myth of technology. Most of what people consider technology are constructs and conventions.

Talking about lack of marginalia in digital documents. (I find this personally hilarious because Collin and I were talking about this issue on the way to the conference this morning. And we talked about how you can still do marginalia digitally and will hopefully be able to do more when things such as NoteSlate come out.) Nelson is talking about his idea for creating documents that have connections to show marginalia.

Need to represent connections. (Totally agree. Life is about connections because we are social creatures.)

Scholars Building a Personal Archive for Scholarly Use
Ed Feigenbaum (Stanford University)

Talking about SALT: Self Archiving Legacy Toolkit
Self= Probably means Professors Emeriti, especially those with archives worth preserving for scholarly use + DIY with only a little help from professional librarians
Toolkit=webpage formats and software to facilitate DIY

SALT’s JANUS Approach: two faces, looks outward to give access to researchers and students of today and years from now and other face uses Zotero to facilitate the work of scholars doing their archive building and enrichment

The two faces talk to each other on a regular basis. Need to sync between Stanford Digital Repository and Zotero cloud servers.

SALTworks is the name of the experimental system at Stanford. It supports full text search over the entire Feigenbaum digital archive. It contains 15,000 documents. It has users already, even though it is still experimental.

Learnings from a life’s work: The Doug Englebart Archives
Christina Englebart (Doug Engelbart Institute)

About her father’s archives. Doug Engelbart started research lab and created computer software, the computer mouse, and more. Definitely a computer pioneer. Came up with lots of innovations and terminology.

Saved a lot of materials for the archives. Lots of archiving happened in real-time because archiving function was built into their computer programs.

Then, re-archiving by placing the information on the web. First website was created in 1995. Had already gave a lot of documents to Stanford beginning in the 1980s. The material is housed in many different places online: Stanford, Computer History Museum, and the Internet Archive.

Lots of work always to do. Connecting technology to the vision is very important.

Take away: I must be distracted because I’m hungry for lunch as I don’t have an overarching take away from this session. Basically, think about what you are doing and how you might archive it…eventually. Back with more after lunch.

User Studies and User Behavior

Next up: User studies: careful observation of archival practices reveal some surprising things about user behavior. To the session notes: Allons-y!

CTRL-S is Poor Archival Practice
Devin Becker (University of Idaho) Collier Nogues (University of California, Irvine)

Did a study of writers via online, open ended survey, about 100 people responded. Writers serve as a sort of focusing agent for the field: increased value assigned to digital files by writers themselves and by archival community. 75% of the respondents were poets, 77% had published one or more books.

Why is this an important issue? Because people don’t save earlier drafts of their works. So you can’t see earlier drafts/versions like you can in, say, the Ernest Hemingway Collection at the JFK Library.

53% claimed to save over their files primarily, but only about 20% always did this. 35% saved drafts all in one file. 9% only saved drafts as printouts and have only one digital file.

Only 8% work exclusively digitally; most work in both paper and digital formats. Many have very strong views about what points of their workflow they use digital versus physical to do their work. “There is really no feeling of management whatsoever when it comes down to it.” People save things everywhere (not surprising to archivists).

Only 7% admitted to never backing up their files. Over 70% said they backup at least monthly. However, this backup is not always done really well. Most backup on external hard drives.

Implications
Benign neglect does seem to be these writers’ basic curatorial mode. People have a fear that electronic files all look alike unlike manuscript drafts. Anxiety about confusing files because they look the same. Writers are more anxious about the management of their files than they are about losing their files.

Recommendations for Archivists
Don’t meddle too much with writers’ files
Meddle a little: 80% would be interested in receiving information about recommended digital archiving practices
Propose: Writers’ Digital Preservation Awareness Week (Why don’t writers just participate in ALA’s Preservation Week? It’s coming up–April 24-30)

File Folders on Computers in Personal Digital Archiving
Hong Zhang (University of Illinois)

Talking about filing systems people use on their computers, can be seen as organic archives created by people. More hard drives coming to archives with lots of digital files. How do we decipher these files?

Methodology: multiple case studies with 12 participants, two rounds of interviews,m disk scan, re-finding tasks observations (part of Zhang’s dissertation work)

How do people archive their files?
Explicitly indicate archives folders via folder names, for example: “archive”
Implicitly indicate archives folders via dating folders, for example
Keeping the original structure when archiving because used to the structure and no motivation to change it when archiving because won’t be using the information again

Relationships among files may be complicated and important or almost non-existent. This is an important idea to remember when trying to appraisal, process, and archive personal digital collections.

Gmail is a Storyworld
Jason Zalinger (Rensselaer Polytechnic Institute)

We are all digital storytellers, historians, curators, etc. of our own lives. We are very good at capturing personal data, but we are not good at helping people make sense of it all. We are not good at encouraging people to explore their archives for self-reflection. When Gmail changes the interface, it changes your storyworld. Thousands of clues to our life stories are sitting in our archives.

User study: conducted six interviews, 3 male, 3 female, highly educated, ages 27-39, 3 via audio recording, 3 via IM, asked about their archives and about stories.

Findings and Design Recommendations

  • A Label Named “Forget”: everything a person wants to forget, but wants to archive. Design Recommendation: Forget & Remember labels built into Gmail. Pop-up message years later to read message and see if want to delete
  • Digital Regret: send emails that you regret later. Design Recommendation: Gmail has the “Undo” send button already. Gmail’s Mail Goggles makes you solve math problems in order to send emails (aka friends don’t let friends email drunk). Wants “Sleep On It”: sends email to your archive and then pop-up lets you re-read your email before sending it the next morning.
  • Characters: Conversation View (email threaded conversations). Design Recommendations: Storyfox would format your conversation thread to a Google Doc formatted to look like a screenplay or as a comic strip (Geomic)
  • How do you know what is meaningful? Design Recommendation: Gmail Meaning Labeler (crowdsourcing)
  • Design Recommendation from Interviewees: word clouds for email

Note: design recommendations are at the conceptual stage and Zalinger hasn’t created them.

Cognitively Motivated Lifelog Software
Aiden Doherty, Cathal Gurrin, Alan F. Smeaton (CLARITY: Centre for Sensor Web Technologies, Dublin City University)

People have talked about personal life archives for years. People have taken this further and created weird technologies to capture their life. However, the researchers use wearable sensors: SenseCam is a Microsoft Research Prototype, now the Vicon Revue: contains a camera and various sensors, GPS, Bluetooth and takes about 5,500 photos per day. Researchers have their own smartphone App: integrates all sensors, can connect to external capture devices, and uploads to a server in real-time.

What is an e-memory archive?
“We use sensors to capture and understand life activities.” Lots of information via the information captured by the sensors. (That’s a lot of data to mine) Don’t record audio because people stopped talking to the researchers. 4.5 years= around 7 million photos.

In one year: 12,500 events or moments, 20 million accelerometer and temperature and compass readings, 2.3 million GPS points, 25,000 unique Bluetooth encounters (wow!)

Want to build search engines for these e-memory archives because visuals are powerful memory clues. Great for remembering different parts of one’s life. Make search engines based on cognitive science. Biomimicry of how human mind stores and organizes memory to model for the search engines. (wow, again) Can determine unique events and moments out of the mundane and then finally display in the browser.

Applying 12 years of video/image search experiences showed many different axes of retrieval for information. Designed initial browser 4 years ago, larger images are more important, and some search functionalities. Then designed a new browser with more flexible search options. Newer browser is much better at finding events, but still at 2 minutes for retrieval. Need to think of new ways to tackle challenge of efficient and fast searching.

Take away: Users are idiosyncratic in their use and creation of digital files. This is not surprising, but kind of sad, for archivists–it means a lot of work to decipher the information when it comes to the archives. (Yay for job security, though.) Lots of data being created and need ways to search and display it visually. Very interesting session, especially the information about lifelog search engines.

Images: Capture and Collection

First morning session: “Images: Capture and Collection.” On to the session notes!

What is everyone doing with all these cheap cameras?
Daniel Reetz (DIY Book Scanner)

Created own book scanner and shared instructions online: diybookscanner.org. Lots of people are using these scanners for great projects all over the world. Cheap cameras can change the world, can liberate information and help others share information with others.

Cheap cameras are very cheap. Cheaper a camera is, the harder it is to control. Cameras define the our aesthetics. Photographs have become the basis of our memories. Aging of photographs as aging of memory.

Cheap cameras are everywhere. The most common camera in the world is in your computer mouse. Color in digital images is calibrated to what is most liked by people, not by math. It is what sells that defines the color settings in the camera. People like saturated photos and sharper images. (Interesting and scary at the same time) Lots of processing done within the camera before you ever get the image into Photoshop.

We can’t trust photos like we might like (this is not a new idea). “Consumer preference undermines control.” Technology is affected by desire and fantasy. People don’t want to show reality–they want to show idealized world.

Need to construct tool that help us determine how reliable the images are that we have in the archives. Lots of potential for use of cheap cameras and digital photos. Need to show people how to do more with their cameras.

The Center for Home Movies Digitization and Access Summit
Dwight Swanson (Center for Home Movies)

Talking about Summit at Library of Congress Packard Campus in September of 2010 (46 attendees). Problem addressing: limitations on access to home movies have resulted in limitations to our understanding and use of them. Want a way to easily find home movies online and way to upload/access home movies online.

Where are home movies online now?
YouTube, Internet Archive, Regional film archives, and film transfer companies. Center for Home Movies have an arrangement with Internet Archive for their home movies. Regional film archives have historically been the most active in collecting and providing access to home movies, but have been restricted by budget.

Challenges
What would we need to do and spend in order to implement a mass digitization and web portal project involving home movies and video from both public and private collections–getting them online for free public access?
What impact would the availability of these collections have on their use and analyses?

Summit Topics (can go here to download final report)

  1. Taxonomy of home movies. LoC wanted a taxonomy: definitions, genres and tropes
  2. Cataloging and description: metadata structures and management as well as crowdsourced tagging. Coming up with list of terms and fields needed to be included when describing home movies
  3. Legal issues: documents created for terms of use, privacy, takedown policies and had discussion of rights issues of orphan films
  4. Technical issues: comparison of film digitization systems, recommended technical standards, different workflow scenarios
  5. Use and users: scholarly users (why do home movies matter?) and commercial users (who are the people using home movies and what are they looking for?)
  6. The Film Collectors’ Community: perceptions of value of home movies due to companies such as eBay and engagement with collectors

Lingering questions from the Summit:
Who would be the primary users of a home movie portal?
What could it do that YouTube and the Internet Archive Can’t already do?
What types of media do we want to deal with?
What is the relationship between preservation and access?
What form should the project take?

Archiving Space: Capturing personal and shared spaces with explorable gigapixel imager
Rich Gibson (Gigapan Project)

“The world is the set where we live our lives.”

Gigapan allows users to upload photos and pan/zoom throughout the panoramic images. Software stitches the images together.

Spaces are changed and images allow us to see these changes. Many programs allow us to explore these changes and spaces online.

“Explorable gigapixel images change the way we see.” (I think that photographs in general change the way we see. Taking photographs definitely change the way we see, the way we compose our lives, and the way we constrain our world through the viewfinder.)

Gigapixel allows for different ways of curating art exhibits and displaying art. It can also be seen as a way of “archiving” transient, ephemera activities and exhibits.

(You should check out the website–lots of very cool images to explore. I could see spending a lot of time on the site.)

Take away: Images are important to our memories, our lives, and our identities. We need to think critically about how we interact with these visual images and the people that care about the images. We should also empower people to capture images and to think critically about their own visual record of their lives. (I love photography so this was a very interesting talk to me, personally. Also, if you want a wonderful book that will have you thinking about many of the issues brought up in this session in greater depth, check out Susan Sontag’s Regarding the Pain of Others.)

Day 2 Keynote: Clifford Lynch

Happy Friday! It’s Day 2 of the Personal Digital Archiving Conference 2011. Time for the morning keynote by Clifford Lynch from Coalition for Networked Information (CNI).

Talking about some of the key issues on Lynch’s mind. We are moving into second generation understanding of personal archives. We can see tensions around this evolution. By the mid-90s, we had realized there was a revolution in personal archiving. We were taking ideas from personal archiving in the physical space into the digital. See problems about saving digital files on bad media, concerned with loss of information esp. via drafting documents online (who keeps different versions of drafts), worried about ephemeral correspondence. But everything was extrapolated from the ideas of personal papers.

Now we are seeing a problem because now there is a shared space online. Material that is shared by groups and made public in limited sense, such as contained social media networks. How do we relate this to personal archives in the earlier sense?

Everything is being shared online and we find that the shared versions have more value because of added comments. We also face a problem of ownership. Example: family archives. Need collective decision making process. Not “pure” individual archives. This can lead to confusion.

Lots of emphasis on what happens to your stuff after you die and about honoring interests of the individuals. But passage of digital objects becomes much less clear when in shared spaces. It is a collective issue.

Implications:
Changes in decision making: collective.
Shared spaces are a vulnerable platform. We’ll see more abrupt shutdowns of online spaces in the coming years.
Digital records are very vulnerable when individuals change jobs.
Platform migration of all kinds in social settings are periods of peril/vulnerability for continuity of material. We need to think carefully about this issue.
Need to think about length of relationships of individuals have with a social platform. (very interesting point with emerging technologies)
How do these relationships with social platforms relate to length of relationships individuals have with memory institutions and archives? Need to figure this out.

Large scale of social media systems: LoC archiving Twitter. Need to have arrangements to preserve this massive amount of public information. We don’t understand this relationship in any complex way. Need to be thoughtful and understand these relationships and how to create these relationships to save this digital information.

Notion of public lives and a sense that there is some minimum record of information about an individual is held by many. We’ve built many systems to record and manage this type of information over the years. These are becoming much more open, connected, and extensive now. For example, look at scale of online genealogy. Lots of move to make information more transparent and more public. Need to think about how public, online social spaces interconnect with ideas of identity and societal relationships.

Question: What is a public part of a life? Do we have consensus? Not really.

What are actions that people can take that can become permanently public? How does this connect with public social spaces?

Many questions about how the individual and his/her information relates into the social setting and issues of public and private.

If we simply extrapolate the challenge from personal papers and shoehorn the development of the shared social spaces into this historic view, we will miss a tremendous amount of the complexity and issues (and potential solutions).

Take away: Personal digital archiving must be seen in a socially connected manner and we need to ask the difficult questions of how the individual relates to the social public spaces and their wishes about how their data is connected and viewed. Wonderful speaker, great ideas, fabulous talk, and a great start to Day 2!

PDA 2011: Day 1 Closing Keynote

Brian W. Fitzpatrick (The Data Liberation Front)

The Data Liberation Front wants to get people to think about how to get their data out of the cloud. The Data Liberation started in 1988. It’s very important that people have control over their data and make it easy to take their data.

There are business benefits to making it easy for users’ to take out their data. You get users’ trust by doing something good for the users. It’s not altruism, but a long-term strategy.

Choice: it is easier than ever for users to choose your product and to leave your product

Trust: You need to get the users’ trust in order to get their business for the long-term.

“Lock-in is not a business model.” It’s not good for users to not have control over their data. The Internet breaks all the distribution rules. It costs almost nothing. Now you get lock-in through innovation. Need to make product so good that the user doesn’t want (or need) to go anywhere else.

Most users don’t think about data liberation until the moment they want to leave.

Three questions to ask:
Can I get my data out?
How much is it going to cost to get my data out?
How much of my time is it going to take to get my data out?

Need a big download button to batch download your data. But there are issues: conversion issues, huge downloads, proprietary formats, and the largest issue: business that still try to lock-in people.

APIs are only the first step. Many users can’t use APIs. There is still a lot of work to do with data liberation. Want to make it even easier.

Take away: Data liberation for the win! Spread the word and the three questions to ask before giving a company your data to your family, friends, and library/archives users.

Economics: What are the costs of Personal Archiving?

Last session. Let’s talk money!

Wishful Thinking
Jeff Ubois (PrestoCentre)

Issues: Predictability, Boundaries (commercial and non-commercial), Institutions/Individuals

Will always produce more than we have the resources to save.

Numeric was a study in Europe looking at scanning costs. Film is much, much more expensive to ingest than text pages. (Need to get study) Local contractor (price to store a box): $200 to $700. Storage costs vary widely.

NIH collects gene sequencing information and is going deaccession data because price of sequencing is dropping much more quickly than cost of storage space.

Millions of dollars have been spent creating complex cost models for archives.Gap between project based funding and need for perpetuity of data. Real estimates: Princeton: $5,000/TB (100x media cost), PrestoPRIME: 40x cost of raw media cost. Basis for buy a brick: endow a TB (interesting idea).

Roles for Commerce: Ingest scales well, cataloging & indexing, but what are the long term promises? Partnerships: huge commercial uncertainties can mean harsh terms.

Digitizing archives is a way to engage the public and bridge individuals and institutions. Lots of room for collaborations.

Paying for long-term storage
David S. H. Rosenthal (LOCKSS Program at Stanford)

Business Models
Rent (Amazon), Monetize the content (Gmail selling adverts on accesses to it), Endow the data (sufficient capital up-front to pay for preservation)

Digital preservation is vulnerable to interruption of revenue stream. Endowment provides a relatively predictable return, but need to figure out how much will need in the future. However, it rests on assumptions that storage is the major cost of preservation and Kryder’s Law (storage costs go down exponentially) will continue at least another decade. So endowment business model may not work. Storage costs go down, but associated costs are not going down as much (eg cooling, space, power costs).

Why Kryder’s Law Might Not Hold:
Desktop PC market is going away, next drive technology transition problematic, solid state disks.

You will get what you paid for: pay now, get service later (no leverage if service not delivered), need escrow service, if service fails, transfer data to successor.

All this means that estimating endowment need is very difficult. Also, it there is a marketing problem if telling people need 70x the cost of storing raw data to have perpetual preservation of the digital data. So, once again, we have an issue with figuring out how to get a reliable revenue stream in the archives.

Internet Archive
Brewster Kahle

Cost of hardware= 20% of the total cost. Lots of the cost goes to people’s salary. Luckily that increasing storage does not mean same amount of increase in number of people at the Internet Archive. If costs go down, expectations go up.

What helps us is that a petabye is a lot of storage space. “So people may be running out of stuff.” And Kahle believes that preservation must be done in a non-commercial way. Non-profits last a lot longer than many corporations.

“Love the Data”
Preserve the data in a way that people care about and make it so people can get to the data easily. “Access drives preservation.” Dark archives is not a good idea. (Out of sight out of mind)

Three issues at the Internet Archive
Costs
Perception of Rights Issue
Therapy (ego stuff)

How much does it cost to digitize a box of stuff? $100-$750 per box because a lot of variation in the type of stuff in boxes. You find a lot of random things in boxes. Costs about $15/video hour to digitize and film is about $300 per program hour to digitize. Books and microfilm= $0.10/page to digitize.

Born digital: Have upload button on Internet Archive website, then they back-up and add metadata.

Costs $1-$2 million to start up scanning/ingesting a new type of media in order to build relationships, get hardware, adapt software, etc.

Really want to start digital archives project for individuals, working with personal archives. New avenue for the Internet Archive.

Perspectives on Funding
Steve Griffin (Library of Congress/National Science Foundation)

Need to ask whether research funding is keeping up with the way research is now happening. May need to change funding models. Need effort by scholarly researchers to get federal funding agencies to change models so they work for today’s scholars.

Take away: Very difficult to estimate costs of long-term digital preservation and it costs a lot so we have a marketing problem when soliciting funding. But we need funding, so we’ve got to figure this out. Also, economies of scale are very important and if you give people an easy way to upload their data (a la Internet Archive), people will upload a ton of stuff. So let’s keep positive and make the changes in funding structures that will allow us to preserve our digital data for the long-term.

Social Network Data: Making Sense of What's Online

A bit of confusion about where we were in the program. But we’re all good now, so let’s getting into the session notes on social network data. Allons-y!

Open Standards for Social Data Exchange and Archiving
Evan Prodromou (StatusNet)

Talking about social network data and standards.

Classes of social data include: profile data (who user is, contact information, what user likes, etc.), social media (text, images, audio, video, polls, checkins, events, Q&A), social graph (record of relationships and connections), social curation (commenting, tagging, sharing).

Challenges to archiving social network data:
Most social networks have limits on what users can do to archive their own data. Have API access rules, winner-take-all business models, etc..

Motivations for preservation of digital social network data: digital civil liberties, open source implementers, enterprise social networks, and social network federation. More pressure to create open data formats in order to preserve social network data.

Standards used in Social Network Media
FOAF “Friend of a Friend”: RDF-based
RSS and/or Atom
SIOC: RDF-based (pronounced “shock”), works with RSS and FOAF
Portable Contacts aka PoCo, VCard-like, XML
Activity Streams social media linked, upward compatible with Atom and RSS, JSON version available, exciting and keep your eye on it, increasing use in libraries

Interesting to hear about standards being used, but presentation was too fast to get down all the important information. Check out the links above for more information.

Recommendations: Produce Activity Streams and consume ActivityStreams, RSS, and Atom.

Charting Collections of Connections in Social Media: Mapping and Measuring SOcial Media Networks to Find Key Positions and Structures
Marc Smith (Connected Action Consulting Group)

Talking about nodexl and that most people do not capture information about their networks. People are social and crowds are important. Crowds now gather online (interaction with physical crowds is very interesting too). Online social media for coming together online now serialize comments.

NodeXL builds a graph that looks like a graph based on social media data. Example, creating graphs from Tweets that mention a certain word. You can find some examples on Flickr of these graphs.

In social networks, the most important thing is “position, position, position.” Archiving connections is possible, but few of the resellers or archives of social media do so. Archiving connections is as important as archiving digital object (great for contextualization).

NodeXL makes really interesting, sometimes confusing, but cool looking graphics. My colleague who researches social networks is all over this type of data representations and analyses. Very interesting.

“We envision hundreds of NodeXL data collectors around the world collectively generating a free and open archive of social media network snapshots on a wide range of topics.”

The Social Networks and Archival Context (SNAC) Project
Ray Larson (UC Berkeley)

Dealing with metadata surrounding collections held in archives. Project funded by NEH.

Data from: EAD finding aids from LoC, OAC, Northwest DIgital Archive, and Virginia Heritage; Authority records from LoC, Getty Vocabulary Program, Virtual International Authority File; other biographical sources (eg DBPedia).

EAC is now complemented by EAC or Encoded Archival Context: XML-based standards for descriptions of record creators= authority control. Want to have controlled vocabularies because we have the problem of many different names for same person, same name for different person. (Are they also adding these authority files to LoC? We need standards, but we don’t need a ton of standards that overlap so we have issues about deciding which one to use.)

Very nice looking interface for the authority files. Nice touch: noting from which archives they are deriving the names for the authority file. And then using data to create pretty infographic of connections–still under development. SNAC website for latest version to download and try out.

Take away: Connections are super-important and we need sophisticated ways to capture this information. I’m definitely going to download NodeXL and play around with it. If you use it, let me know how you like it.