Academia.eduAcademia.edu
-1- Approaching Digital Preservation Holistically Seamus Ross HATII, University of Glasgow of ‘novel’ software will eventually make it possible for them to sort it out is akin to believing that the ‘Elves of Cologne’ will one day return.2 If archives are to function in this new technological environment, they will need to be transparent, accessible and responsive to user needs and expectations. The umbrella term ‘digital curation’ encapsulates the many activities involved in caring for digital entities such as selection, documentation, management, storage, conservation, security, preservation, and provision of access. Curation focuses not just on preserving digital entities but on keeping them functional, supporting their continuous annotation and maintaining their fitness for purpose. Preservation is a lot narrower in focus. To paraphrase the Report of the Task Force on Archiving of Digital Information, it has the objective of retaining the ability to display, retrieve, manipulate and use digital information in the face of constantly changing technology.3 The problem with digital materials is that many factors seem to conspire to make them inaccessible. Technological advances foster obsolescence of access mechanisms and accelerate the loss of material. For example, while it is true that media degrade over time, even before they do, devices to access particular classes of media become scarce and curators can find it impossible to get the contents off the media. Often, even where the digital object is accessible it remains unintelligible because insufficient descriptive, technical, structural and management information about the object survives. So, while the aim of digital preservation is to preserve digital information or objects that are authentic, understandable and accessible over time, digital curation involves not only the preservation of digital materials but also the updating, correcting and annotating of materials. Digital curation and preservation are broad fields of enquiry and practice.4 Here we aim to provide an 1. INTRODUCTION The pervasiveness of information and communication technology (ICT) has transformed the way we create, access, use, and need to manage digital entities. The dependence by companies and public sector institutions on ICT is producing a massive reservoir of material waiting to be assessed for disposal or retention, which in some cases may mean bringing it into memory institutions. The quantities and diversity of the material pose obstacles even to its assessment by archivists, records managers and other information curators. The cultural and scientific heritage of the contemporary world that comes to be held in our memory institutions will provide historians with raw materials for interpreting the twenty-first century. These assets, moreover, serve as sustainable and renewable resources to be exploited in an ever-increasing diversity of ways. Users will expect to be able to do this. In their digital guise these materials provide core resources for enabling education, supporting life-long learning, underpinning the development of new products by creative industries, and improvements in our quality of life. E-commerce and e-government initiatives continue to raise awareness of the need for reliable and trustworthy information sources. Cunningham and Phillips argued that ‘[k]eeping information in electronic formats available for e-governance and e-democracy is a public good, just as health services, education and bridges are’.1 Our trust in the accountability of egovernment and its success, therefore, depends upon the institution of transparent, secure and workable digital curation mechanisms within public sector environments. Delivering this vision depends upon the survival of digital data in accessible, usable, reliable and authentic form. As a result, curation and preservation impacts on the working practices of public bodies, memory organisations, researchers and business sectors including, to mention only five, aerospace, entertainment, finance, pharmaceuticals and publishing. Long-term access to digital materials depends upon the active intervention by archivists, records managers and other digital curators. The notion that individuals or organisations can just keep everything and some piece 1 Cunningham, A. and Phillips, M., ‘Accountability and accessibility: ensuring the evidence of e-governance in Australia’, Aslib Proceedings: New Information Perspectives, 2005, 57.4, p. 314. 2 Tharlet, Eve, The Elves of Cologne, (Zurich, 2005), and Kopisch A., Die Heizenelmannchen (Cologne, 1836). 3 Commission on Preservation and Access and the Research Libraries Group, Preserving Digital Information, Report of the Task Force on Archiving of Digital Information (Mountain View, 1996), http://www.rlg.org/ArchTF/. 4 An excellent source for information about work in digital preservation is the PADI subject gateway (http://www.nla.gov.au/padi/). Another is the Kennisbank/Knowledge Bank of Digitale Duurzaamheid, http://www.digitaleduurzaamheid.nl/index.cfm?paginakeuz e=65&categorie=2. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -2- As a result, we often identify media degradation as a significant culprit in the loss of digital materials. It can be, but it is not the reason most commonly cited by data recovery companies. Moreover, Ontrack DataRecovery has demonstrated a disjuncture between customers’ perception of the causes of data loss and the actual causes of loss (see Table 6.1): overview of digital curation and preservation and to provide you with an intellectual framework within which to think about the challenges of curation and preservation.5 The chapter will approach these discussions against the backdrop of new research in this area. 6 7 Ross, S. and McHugh, A., ‘Audit and Certification: Creating a Mandate for the Digital Curation Centre’, Diginews, 9(5), 2005, http://www.rlg.org/en/page.php?Page_ID=20793#article1 . 8 An excellent summary of the problem is Cordeiro, M.I., ‘From Rescue to Long-term maintenance: Preservation as a As noted by Ontrack for the period 1995-6 Ross, S. and Day, M. (eds.), DCC Digital Curation Manual, (Glasgow, 2005 onwards), http://www.dcc.ac.uk/resource/curation-manual/, which will include some forty-five chapters on issues surrounding digital preservation, is managed by the Digital Curation Centre. 11% 26% 36% Computer Viruses 2% 4% 8% Natural Disaster 1% 2% 3% Hardware or System Problem 78% 56% 49% Software Corruption Programme Problem 7% 9% 4% or Other causes of loss include deliberate destruction; in late 2004 the Spanish Prime Minister, Señor Zapatero, reported that: ‘In the Prime Minister’s Office we did not have a single document or any data on computer because the whole Cabinet of the previous Government (Aznar) carried out a massive erasure. That means that we have nothing about what happened, information that might have been received, meetings or decisions that were taken from March 11 until March 14 [2004].’9 Discussion about the issues of web archiving figure hugely in the current literature. We have excluded them from discussion here because website preservation is a sub-set from an archival point of view of good records management. See, for instance, Dollar, Charles, Archival Preservation of Smithsonian Web Resources: Strategies, Principles, and Best Practices, (Washington , DC, 2001), http://www.si.edu/archives/archives/dollar%20report.html; McClure, Charles, and Sprehe, Timothy, Guidelines for Electronic Records Management on State and Federal Agency Websites, (New York and Washington, 1998), http://slis-two.lis.fsu.edu/~cmcclure/guidelines.html; Phillips, John T., ‘The Challenge of Web Site Records Preservation’, Information Management Journal , 2003, 3(1), pp. 42-51. There are tools for harvesting websites remotely, such as HTTrack, http://www.httrack.com/page/1/en/index.html . 6 Human Error Table 6.1: Data Loss— Perceptions and Causes The preservation literature makes much play of the fragility of digital materials, although, as we know from the recovery of data held on damaged media such as the tapes recovered from the Atlantic Ocean crash site of the Challenger Space Shuttle, this claim is overstated.8 5 As noted by Ontrack Engineer (2005) 2. OBSTACLES TO LONG-TERM ACCESS As Perceived by the Customer Cause At the heart of preservation lies planning and the recognition that ‘digital curation and preservation is a risk management activity at all stages of the longevity pathway’.7 In undertaking preservation, individuals and organisations must ‘right size’ their risk. Many of the approaches to preservation described in the literature are designed to be implemented and used by large organisations, and in particular National Archives and Libraries. Do the costs and processes perceived to be associated with preservation mean small and mediumsized institutions have no chance of preserving or actively curating digital entities in their care? As we shall see, there are proactive steps and scalable approaches that make the application of preservation strategies accessible to nearly all institutions. Technological developments and obsolescence can lead to the loss of digital materials. Digital objects are represented as streams of binary digits, commonly known as bits. These streams must be interpreted (using hardware and software) before they can be manipulated or rendered whether for display, printing or analysis. Raw bit streams are generally of little value and often meaningless.10 Sometimes it may be impossible to access and render them because a syntactical interpretation of the bit stream cannot be performed. In other instances semantic opaqueness arising from loss core function in the management of digital assets’, VINE, 34(1), 2004, pp. 6-16. 9 Sharrock, D., ‘Aznar accused of destroying Madrid Bomb Evidence and Deceiving the Public,’ The Times, (London, 14 December 2004), http://www.timesonline.co.uk/article/0,,3-1402824,00.html. 10 Ross, S. and Gow, A., Digital archaeology? Rescuing Neglected or Damaged Data Resources, (London, 1999), http://www.ukoln.ac.uk/services/elib/papers/supporting/pdf/ p2.pdf . S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -3- of context or process and their dynamic nature may leave the objects unintelligible. of custody helps with establishing authenticity, and we shall return to this later. Archivists and records managers will be aware that the organisational structure of institutions and the way information creation and management tools are deployed by them have in themselves become a preservation obstacle. The lack of collaboration between records managers, creators and IT staff contributes its share of problems as well. The failure of many records management approaches to link records management strategies with organisational objectives leaves records management without a strong corporate base. The role that records management plays in the area of compliance and risk management could be exploited to move preservation into a core business focus. Preservation is perceived as expensive and often the scale of this investment is under-appreciated and certainly almost never balanced against recognisable benefits. Each rendition of a digital object must carry the same force as the initial instantiation, sometimes labelled as the original. Reflecting on the conclusions from research conducted by ERPANET,13 the DELOS Digital Preservation Cluster14 and InterPARES, the very concept of ‘original’ seems an inappropriate label for digital objects. If there ever is an original of a digital document it exists only for a fleeting moment in the memory of the computer on which the digital object was created at the time it was created. Perhaps first renderings of digital objects might best be referred to as an initial ‘representation or instantiation’ (II). The problem is: how can we record the functionality and behaviour as well as the content of that initial instantiation so that we can validate subsequent instantiations? Where subsequent instantiations (SI) share precision of resemblance in content, functionality, and behaviour with the initial instantiations, the ‘SIs’ can be said to have the same authenticity and integrity as the ‘IIs’. So there is no copy in the digital age; every validated re-representation is in a sense ‘the original’. The likelihood that digital materials will be properly curated over time is closely tied to their recurring value or to their continued active usage. Recurring value arises from the use of digital objects for their evidentiary value, say to limit corporate liability, to demonstrate primary rights to an idea, invention or property, to meet compliance or regulatory requirements, or to achieve competitive advantage. Recurring value can arise when a resource can be reexploited whether through repackaging or release in some new and unexpected way. Certain data sets that are regularly exploited for commercial or research purposes, such as metrological or scientific data sets (e.g. protein databases) are likely to benefit from a level of care that will ensure their longer-term accessibility. 3. ENSURING EVIDENTIAL VALUE – AUTHENTICITY AND TRUST Digital preservation aims to ensure the value of digital entities. ‘When we work with digital objects we want to know they are what they purport to be and that they are complete and have not been altered or corrupted.’11 These twin concepts are encapsulated in the terms Authenticity and Integrity. As digital objects are more easily altered and corrupted than say paper documents and records, creators and preservers often find it challenging to demonstrate their authenticity. As digital objects that lack authenticity and integrity have limited value as evidence or usefulness as an information resource, the ability to establish authenticity of and trust in a digital object is crucial.12 A well-documented chain 11 InterPARES Authenticity Task Force, Authenticity Task Force Report in The Long-term Preservation of Authentic Electronic Records: Findings of the InterPARES Project, (Vancouver, 2002), http://www.interpares.org/book/index.cfm . 12 Ross, S., ‘Position Paper on integrity and authenticity of digital cultural heritage objects’, Integrity and Authenticity Could general characteristics of authenticity be identified that would apply to all digital objects? Or do different types of digital objects, record-keeping procedures and digital object creation practices, alongside the diversity of institutional requirements, mean that digital object preservation would require a variety of mechanisms to enable users and preservers to ascertain the authenticity of material? InterPARES argued that reliability and authenticity were two areas of independent responsibility – the creator was of Digital Cultural Heritage Objects, Thematic Issue 1, 2002, pp. 7-8; also available at http://www.digicult.info. . 13 ERPANET, with funding from the Swiss Federal Government and the European Commission (IST-200132706), led by the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow (United Kingdom), and its partners the Schweizerisches Bundesarchiv (Switzerland), ISTBAL at the Università di Urbino (Italy) and Nationaal Archief van Nederland (Netherlands), worked between November 2001 and the end of October 2004 to enhance the preservation of cultural and scientific digital objects. http://www.erpanet.org 14 http://www.dpc.delos.info. DELOS: A Network of Excellence on Digital Libraries funded under the 6th Framework’s IST Programme. The Project falls under thematic Priority: IST-2002-2.3.1.12 (Technology-enhanced Learning and Access to Cultural Heritage). Its project number is: 507618. DELOS focuses on six primary research domains ranging from digital library architectures to evaluation. The DELOS digital preservation cluster (DELOS-DPC) brings together researchers from seven European countries to lead cutting-edge research in digital preservation. . S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -4- responsible for reliability and the preserver took responsibility for the authenticity.15 While authenticity could be the subject of much new research at both practical and theoretical levels, here we can only draw attention to the issue. Confronted with digital representations, users appear to assume that, unless there is obvious evidence to the contrary, if the creator or holder of a digital object says that it is authentic then it is. As authenticity depends upon ‘establishing identity and demonstrating integrity’16 users require background services to allow them to verify the inferences they have drawn about the status of materials and the documentation.17 Archivists and records managers need mechanisms to assist in the maintenance of authenticity. For example, users need to know where the digital materials came from, how they came to be deposited, how they were ingested (e.g. under what conditions, using what technology, how the success of the ingest was validated), why they were created, where they were created, how they were created, and they need information as to how the digital object was maintained after its creation (e.g. was it maintained in a secure environment, was the software used to store and represent it changed?). This ‘data about data’ or metadata provides, as we shall see in an examination of a few legal cases, a crucial source of information to support assessments of the authenticity of digital entities. Where metadata is severable from the content, users will require evidence to demonstrate that the link has been maintained and that no new unvalidated links have been created. While some professions such as archival, legal and accountancy emphasise the significance of authenticity it does not appear to be the primary focus of current digital repositories. A small-scale survey conducted by Rachel Bradley in mid-2003 received responses from twenty-two out of forty digital repositories contacted in the USA and Canada. ‘The majority felt that ensuring authenticity and integrity represented a low priority compared with increasing access and preserving content.’18 She noted that many of these institutions were aware that they needed to address the authenticity and integrity issue, and planned to do so. Of course, if we adopt the InterPARES point of view, preservation of 15 Duranti L. (ed.), The Long-term Preservation of Authentic Electronic Records: Findings of the InterPARES Project, (San Miniato, 2005). 16 InterPARES Authenticity Task Force, (2002), op. cit. 17 Ross, S., ‘Reflections on the Impact of the Lund Principles on European Approaches to Digitisation’ in Strategies for a European Area of Digital Cultural Resources: Towards a Continuum of Digital Heritage, Conference Report, European Conference, The Hague, The Netherlands, 15-16 September 2004, pp. 88-98. 18 Bradley, R., ‘Digital Authenticity and Integrity: Digital Cultural Heritage Documents as Research Resources’, portal: Libraries and the Academy, 5(2), 2005, pp. 165-175. content without ensuring authenticity and integrity is pointless. Her conclusions are at odds with the work of Park (McGill University) who examined how practitioners viewed authenticity.19 At the heart of establishing authenticity lies trust and this is an area where, as Lynch has noted, we are just beginning to understand the issues.20 The maintenance of authenticity and integrity requires control of ingest and its verification and depends on immutability of the data store. But it is not just technological; it also depends upon the organisation that is managing the digital store and how they are perceived. This is one reason why the most commonly employed archival reference model, the Open Archival Information System (OAIS), puts such emphasis on designated communities and also why there is a resurgent interest in the processes of auditing and certifying of repositories. 4. HOW DO EUROPEAN INSTITUTIONS VIEW DIGITAL PRESERVATION? ERPANET, in an effort to understand better how record keepers, information technology staff and business managers viewed electronic records and their longerterm retention conducted nearly 100 case studies between 2002 and 2004.21 Our studies provide insights into current preservation practices in different institutional, juridical and business contexts as well as across both the public and private sectors. These insights we hoped would help us to understand contemporary approaches to digital longevity, enable cross-sectoral comparisons, provide an indication of the kinds of tools and education needed, and identify issues requiring further research. The research was conducted by a combination of structured questionnaire and interview. Interviewees received the questionnaire in advance of the interview, which was normally conducted by telephone. In the course of developing the questionnaire and interview protocol we examined how other projects, including the Pittsburgh Project and InterPARES I, conducted 19 Park, E., ‘Understanding authenticity in records and information management: analyzing practitioner constructs,’ The American Archivist, 64(2), 2001, pp. 270291. 20 MacNeil, H., ‘Providing Grounds for Trust: Developing Conceptual Requirements for the Long-term Preservation of Authentic Electronic Records,’ Archivaria, 50, 2000, pp. 52-78; MacNeil, H., ‘Providing Grounds for Trust II: The Findings of the Authenticity Task Force of InterPARES’, Archivaria, 54, 2002, pp. 24-58. 21 ERPANET conducted around 100 case studies between 2002 and the end of 2004, of which seventy-eight are published on the ERPANET website and are forthcoming in Ross, S. et al. (forthcoming, 2006), ERPANET Case Studies in Digital Preservation (San Miniato, 2006). S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -5- surveys. We aimed to avoid pitfalls that these projects encountered, but of course created our own. The template is available on the ERPANET website; perhaps those conducting similar studies will adopt it. To obtain a holistic picture of organisational attitudes towards digital preservation, we aimed to interview three different classes of individual within an organisation: a business manager, an IT manager and an archivist/records manager. Where many other studies aim to provide reports of this kind in as anonymous a way as possible, we took the decision that we wanted to identify the organisations that participated in the case studies. With a few exceptions the participating organisations allowed us to do just this. Our sample was drawn from across Europe, although countries in which ERPANET had a presence (Italy, Netherlands, Switzerland and the United Kingdom) are over-represented. We contacted just over 500 organisations and by the close of ERPANET we had achieved a participation rate of around 15.6 per cent. Convincing organisations to take part was a challenge. More than 60 per cent of the organisations did not respond to the initial enquiry or subsequent follow-up attempts. Others initially expressed willingness to take part but subsequently withdrew. A good example of the latter case was the Banca di Roma where archival and ICT staff indicated they wished to participate, but superiors in the Bank could not be encouraged to sign off on participation. The case studies investigated five themes. First, we aimed to understand how aware organisations were of the risks posed by storing material in digital form and how they perceived the potential impact of those risks on their organisation. Second, the survey was intended to provide us with information about how digital preservation impacted on the organisation. Third, we wanted to gather an impression of the actions that organisations took to prevent the loss of digital materials. Fourth, the study was intended to provide us with an appreciation of how organisations with preservation activities monitored them. Finally, the studies were designed to give us an indication as to how organisations would be planning to address their future preservation requirements. Drawing on the case studies we hoped to be able to establish evidence of best practice and to identify preservation approaches and justifications that other institutions could use to build business cases for preservation. In this chapter, I can not discuss the findings in detail; preliminary findings are available in print elsewhere,22 22 Ross, S., Greenan, M. and McKinney, P., ‘Digital Preservation Strategies: The Initial Outcomes of the ERPANET Case Studies’ in the Preservation of Electronic Records: New Knowledge and Decision-making, (Ottawa, 2004), pp. 99-111, or Ross, S., Greenan M. and McKinney, P., ‘Strategie per la conservazione digitale: Descrizione e and others will appear shortly. Awareness of the issues surrounding digital preservation varied across organisations more than across sectors. When we accessed the drivers for preservation action we found that cultural and historical value (as noted above) was given the lowest priority; this may reflect in part the nature of our target cohort, which included few cultural heritage institutions. Four core drivers stood out: core business focus, re-use, legal and regulatory compliance, and experience of information loss. Broadcasters recognised preservation as essential if they were to maximise the re-use potential of their resources, whereas pharmaceuticals were motivated to address preservation issues to ensure regulatory compliance. Others noting re-use were public sector bodies (European Patent Office), news agencies (Deutsche Presse Agentur, Swiss News Agency) and oil companies. Discussions by participants and presenters at the ERPANET Workshop on policies for digital preservation indicated that preservation policies and procedures ‘represent an issue that still needs a lot of attention. Little practical experience yet exists and most of the ideas are still rather theoretical. Although there are organisations that have a relatively longstanding experience with digital preservation.’23 This conclusion was borne out in the results of the case studies. When developing or purchasing new systems, respondents noted that preservation strategies were not usually articulated in the specifications. Retention policies were not often noted and where they were they were not necessarily implemented across the organisation.24 There was a general recognition that preservation and storage problems were aggravated by the complexity, diversity of types or formats, and size of the digital entities. Few organisations took a longterm perspective and those that did were either national information curating institutions (e.g. libraries) or institutions that felt regulatory risk exposure. In general, a sense emerged from the case studies that preservation required a pragmatic approach. risultati dei primi studi di casi di ERPANET’, Archivi e Computer, XIV/3.04, 2004, pp. 99-122. 23 ERPANET, 2003, Policies for Digital Preservation, ERPANET Training Seminar, Paris, 29-30 January 2003, http://www.erpanet.org/events/2003/paris/ERPAtrainingParis_Report.pdf , p. 16. 24 The findings of ERPANET in Europe are also borne out by evidence in the USA. In the recent case of In re Old Banc One Shareholders Securities Litigation, 2005 U.S. Dist. LEXIS 32154 (N.D. Ill., 8 December 2005), ‘Bank employees testified they did not know missing documents should have been retained, and the bank did not inform employees of the need to retain documents for this litigation or have employees read and follow the electronic version of the policy that was established.’ S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -6- The organisations participating in the research acknowledged that they have little information about the costs of digital preservation and, where they had tried to predict the costs, they had done so poorly – this is a conclusion borne out by other studies.25 Respondents noted that when they could quantify the costs they would be difficult to justify in the corporate environment. While the Deutsche Presse-Agentur (dpa) ‘was not in a position to reveal detailed figures’ it did acknowledge that ‘long-term preservation costs are roughly in the dimension of one per cent of the company’s turnover’.26 The Centraal Bureau voor de Statistiek (CBS)27 in the Netherlands reported that it had: …identified the cost benefits of digital records management and archiving as threefold: first, the records management can become an integral part of the automated working processes of the organisation; secondly, a decrease in the use of paper and increase in the management of digital records enable better sharing of documents and fewer localised collections of records; and thirdly, digital records management and archiving allows for organised maintenance of the organisational historical memory. There is no separate funding available for digital preservation activities, and the budget of the IT department is expected to cater for ongoing maintenance of the records.28 Benefits to commercial businesses to be derived from long-term digital preservation have proved elusive and ERPANET’s seminar on Business Models did not actually succeed in identifying them.29 In general, the value of digital preservation is only apparent long after the initial investment has had to be made. One of the primary justifications given in the literature for digital preservation is access. Within the cohort interviewed by ERPANET access was seen as primarily for internal use. Where external access was provided it was done with different approaches: intermediaries, information provided on CDs, and more rarely through online portals. The obstacles to access were security, privacy and technical challenges (e.g. lack of standard file format). What was surprising among the seventy-eight case studies analysed here was just how great the variation 25 Sanett, S., ‘Toward Developing a Framework of Cost Elements for Preserving Authentic Electronic Records into Perpetuity’, College & Research Libraries, 63.5, 2002, pp. 388-404. 26 ibid. 27 http://www.cbs.nl . 28 ERPANET Case Studies, http://www.erpanet.org . 29 ERPANET, 2004, Business Models Related to Digital Preservation, http://www.erpanet.org/events/2004/amsterdam/Amsterdam _Report.pdf , p. 17. of awareness of risk was – some were not aware that there was any risk and a very small number had a highly attuned sense of risk. The value placed on the digital materials by organisations depended in part on how dependent the organisation was on the material for business activity, with the highest value placed on information by organisations who either saw or exploited the potential re-use of information or identified the risks associated with its not being available. Responsibility for digital preservation was rarely taken at corporate level. Organisations did not have a single point of contact for preservation and within organisations there was not always a clearly identified individual who had responsibility for the activity. Preservation strategies were rare. The secretive nature of many organisations does not support collaborative action to address the preservation problem. What really stood out was the preponderance of the point of view that organisations should not invest internally in defining solutions but should wait for them to be provided externally. The findings of the ERPANET studies are complemented by those conducted elsewhere.30 For instance, a working group jointly sponsored by the Online Computer Library Center (OCLC) and the Research Libraries Group (RLG) on Preservation Metadata: Implementation Strategies (PREMIS) reported that ‘there is very little experience with digital preservation.’31 More recently a survey that ICABS conducted for the Koninklijke Bibliotheek, Building Networks in Digital Preservation: Recent Developments in Digital Preservation in 15 National Libraries (draft July 2005), found that libraries have not adopted a single strategy to achieve long-term preservation and access to the diversity of digital objects entering their collections. In fact, some have still not adopted any strategy, despite being cognisant of the risks posed by the poor curation of digital materials. In many cases this appears to reflect a lack of access to appropriate information resources, training, repository support, need for audit and certification services, and a need for access to research results. Surveys of national and local archives tell the same story; as Hans Hofman noted in the report on Enabling Persistent And Sustainable Digital Cultural Heritage in Europe (September 2004): [d]espite the resolutions and charters decided upon by the European Council of Ministers, the General Assembly of UNESCO and the NRG [National Representatives Group], and despite the fact that they 30 e.g. InterPARES I (http://www.interpares.org). 31 PREMIS Working Group, 2004, Implementing Preservation Repositories for Digital Materials, (Dublin, OH, and Mountain View, CA, 2004, (available at http://www.oclc.org/research/projects/pmwg/surveyreport.p df, p. 13. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -7- have raised a lot of awareness, the consequences have not yet been integrated or formulated into concrete action plans nor have they taken it beyond the level of a stand-alone topic. As long as the practical integration of persistence into our daily economic, social, cultural and policy issues is not achieved, it will be difficult to raise it and to make it politically appealing and interesting for funding.32 What was intriguingly absent from the discussions with the ERPANET case study interviewees was a focus on technology and this is despite IT staff having been interviewed. 5. A CASE STUDY – E-MAIL Preservation of e-mail is a challenge, even without the attachments. We are producing them by the billions each day, some 9.6 billon a day in 2001. For thirty years we have recognised the evidential significance of e-mails, although in an unpublished survey of National European Archives conducted in 1994 Edward Higgs and I found only three of the twenty responding archives had acquired any e-mails and even these were from their own administration systems. We see e-mails exploited in contemporary litigation (e.g. Zubulake v. UBS Warburg33 or the use of the Presidential Emails in US v. Philip Morris USA, Inc., et al.34) or in the case of long-term accountability of government (e.g. Armstrong v. Executive Office of the President). In Armstrong v. Executive Office of the President, commonly referred as the PROFS case, the courts found that in creating printed versions of electronic communications government agencies had not met their obligations under the Federal Records Act. The appellate court affirmed stating, ‘[w]ithout the missing information, the paper print-outs – akin to traditional memoranda with the ‘to’ and ‘from’ cut off and even the ‘received’ stamp pruned away – are dismembered documents indeed.’35 In the original decision, Judge Charles R. Richey had noted that the loss of searchability and linkages between messages made the 32 Hofman, H. and Lunghi, M., 2004, Enabling persistent and sustainable digital cultural heritage in Europe: The Netherlands questionnaire responses summary and Position Paper, http://www.minervaeurope.org/publications/globalreport/gl obalrepdf04/enabling.pdf; XLIV, presented at the Dutch Presidency on Towards A Continuum Of Digital Heritage – Strategies For A European Area Of Digital Cultural Resources. 33 Zubulake v. UBS Warburg, 217 F.R.D. 309 (S.D.N.Y. 2003). This is one of a series of rulings in this case. 34 See Memorandum Opinion in U.S. v. Philip Morris USA, Inc., et al., Civ. Action No. 99-2946 (D.D.C.), dated 7 July 2004, at 2 n.3, available at http://www.dcd.uscourts.gov/992496ak.pdf . 35 Armstrong v. Executive Office of the President, 1 F.3d 1274 (D.C. Cir. 1993) (Note: The decision was reversed on other grounds, 90 F.3d 553 (D.C. Cir. 1996)). printouts inadequate substitutes.36 Few organisations actually succeed in managing their e-mails effectively. Indeed few employees consider e-mail to be records or, as Jean Samuel put it, ‘User perception of e-mail status is in direct contradiction to the legal perception and the matter of unguarded remarks in e-mail (made because the writer deems the message to be unofficial)…’ exposes organisations to unanticipated risks that may result in financial loss or litigation (or both).37 Perhaps this claim is old but it is still relevant, as an AIIM survey in early 2005 demonstrated.38 Predicting the kinds of research that might be possible in digital archives of the future is difficult. As Perer, Shneiderman and Oard have noted: Historians and social scientists believe that archives are important artefacts for understanding the individuals and communities they represent. However, there are currently few methods or tools to effectively explore these archives…By presenting new ways to approach the exploration of email archives, not only do we provide a new step for exploration, but also raise awareness for the difficult task of understanding email archives.39 In the instance of e-mail, we might consider for a moment how we could use contextual information in email (e.g. ‘to’ and ‘from) data to identify both formal and informal communities within organisations and even to identify those individuals who play leadership roles within these communities. Josh Tyler and his colleagues at Hewlett-Packard developed a tool for doing this and then applied it. They ‘…found that it does an effective job of uncovering communities of practice with nothing more than email (‘to:’ and ‘from’) 36 Wallace, D.A, ‘Recordkeeping and Electronic Mail Policy: The State of Thought and the State of the Practice’, Society of American Archivists, (Orlando, 1998), http://www.mybestdocs.com/wallace.html . 37 Samuel, J., ‘Electronic Mail’, in E. Higgs (ed.), History and Electronic Artefacts, (Oxford, 1998) p. 111. 38 A study in early 2005 conducted by AIIM and Kahn Consulting ‘Electronic Communication Policies and Procedures,’ which attracted over 1000 respondents found among responding organisations that while ‘86% tell employees how they should use e-mail, … only 39% tell employees where, how, or by whom email messages should be retained’ (Skjekkeland, Atle, ‘Email mismanagement: A looming disaster’, M-iD, 2005, pp. 49-50). 39 Perer, A., Shneiderman, B. and Oard, D.W., ‘Using Rhythms of Relationships to Understand Email Archives’ examined a novel,’ (n.d. but probably 2005), (http://hcil.cs.umd.edu/trs/2005-08/2005-08.pdf), p. 18. In this study the team applied a new approach to understanding e-mail archives in a study of 45,000 messages collected over fifteen years by a single individual. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -8- data.’40 In other instances, we might wish to apply visualisation tools to reveal ‘the data and patterns that are hidden within the email archive.’41 What is evident is that users of digital archives will expect to be able to access, manipulate and analyse material in ways that were never possible in the past and the relationship between the user and the archive will shift. This really changes the way we need to think about digital curation and preservation. Here I have used e-mail as an example. Databases, especially scientific ones, might as easily have been selected. ‘Databases have been, and continue to be, a key technology for the storage, organisation and interrogation of information. They are a core module in most of today’s information systems.’42 In ‘Archiving Scientific Data’ Buneman et. al. explore many of the challenges of database preservation from performance to ensuring semantic continuity.43 The National Library of Australia has developed a tool, Xinq, to support the automated creation of interfaces to archived databases.44 The Swiss Federal Archives identified ingest of 40 Tyler, J.R., Wilkinson, D.M. and Huberman, B.A., ‘Email as Spectroscopy: Automated Discovery of Community Structure within Organizations,’ Communities and Technologies, 2003, pp. 81-96, (alternatively at: http://www.hpl.hp.com/research/idl/papers/email.email.pdf) . These communities often transcend organisational structures. Currently there are few instances where it is possible to obtain data to carry out more complex studies. One possible source that archivists might use to investigate how researchers in the future might examine archives of email will be to experiment with the Enron Email Dataset. The dataset is available at http://www.cs.cmu.edu/~enron and contains over 500,000 messages. B. Klimt and Y. Yang noted that the original dataset contained some 619,446 messages from 158 users before they produced a ‘cleaned Enron corpus’ and this includes some 225,000 e-mails of 151 senior executives during the period 1997 to 2004 (see, for example, http://sonic.ncsa.uiuc.edu/enron/about.htm). There is a discrepancy between the scale of ‘clean’ corpus as described on the website and by Klimt and Yang, 2004, in ‘Introducing the Enron Corpus’, http://www.ceas.cc/papers-2004/168.pdf. Most of the interest in the dataset has so far been from communications and information retrieval experts. The cleaning of the dataset removed for example duplicates, but that reduces the archival value of the dataset as duplicates tell their own story. 41 Donath, J., Visualizing Email Archives – draft, 2004, http://smg.media.mit.edu/papers/Donath/Email/Archives.dra ft.pdf , p. 2. 42 ERPANET, 2003, The Long-term Preservation of Databases, http://www.erpanet.org/events/2003/bern/Bern_Report_fina l.pdf 43 Buneman, P., Khanna, S., Tajima, K. and Tan, W.-C., ‘Archiving Scientific Data’, ACM Transactions on Database Systems, 29(1), 2002, pp. 2-42. 44 For Xinq, see http://www.nla.gov.au/xinq/download.html. databases into the Archives as a fundamental challenge.45 The crucial point is that archives will be confronted with a broad range of digital materials from databases to documents to e-mail to websites, and the current models all require a tremendous amount of human intervention on the part of the ingesting organisation to prepare material appraised for ingest to be ingested, documented and made accessible. Current approaches do not facilitate the effective initial ingest and longer-term management within the archival environment of digital materials. There is a growing demand for tools to support ingest and management of digital entities ranging from databases to documents to e-mails to software. Some archives, to reduce costs, minimise the diversity of skills required, and the variety of file formats convert all incoming files to a single format type for each type class, chosen because the type maximises the likelihood of longevity. This can lead to problems. First, what are the appropriate mechanisms to use to evaluate the ability of a file format to support long-term preservation? Second, what impact might ‘standardisation’ have on information embedded in the proprietary functionality of particular software packages? For instance, the functionality inherent in some documentary editing software to track changes in documents will provide invaluable documentary evidence for future scholars. Contemporary evidence for this comes, for example, from the release in October 2005 by the United Nations of the Microsoft Word version of the UN report into the murder of the former Lebanese Prime Minister Rafik Hariri. It emerged that key names had been dropped from the official report when ‘an electronic version distributed by UN officials on Thursday night allowed recipients to track editing changes’.46 The fact that hidden information in digital 45 Heuscher, S., Järmann, S., Keller-Marxer, P. and Möhle, F., ‘Providing Authentic Long-Term Archival Access to Complex Relational Data, European Space Agency Symposium’, Ensuring Long-Term Preservation and Adding Value to Scientific and Technical Data, (Frascati, Italy, 2004), http://arxiv.org/pdf/cs.DL/0408054. 46 Bone, J. and Blanford, N., ‘UN office doctored report on murder of Hariri’, 22 October 2005, Times Online, http://www.timesonline.co.uk/article/0,,2511837848,00.html. Failure of authors to note that Microsoft Word also holds revision history metadata enabled Richard M. Smith to identify the individuals who had been responsible for the last edits of the document and this contributed to our understanding of how the document evolved.‘IRAQ – Its Infrastructure of Concealment, Deception and Intimidation’, released by the Prime Minister’s Office on 6 February 2003. An analysis of the log can be found at http://www.computerbytesman.com/privacy/blair.htm. These tracking facilities within documentary editing software are often refereed to as ‘hidden dangers’, but in litigation and to future scholars they may provide sources of valuable information. As a reaction to this, there is an S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -9- documents often provides a window on intentions and original ideas, and can enable subsequent users to track how arguments and ideas developed will make the ability to delve into these digital footprints as crucial to future scholars as the editing and interlinear and marginal notes of authors (or even users) are to those working with analogue documents.47 6. TECHNICAL PRESERVATION APPROACHES Some archivists and records managers say that the only good approach to the preservation of electronic records is to print them to paper or to microfilm. During the ERPANET case study on preservation practices at the Council of Europe, conducted in mid-2003, representatives of the organisation reported that ‘[f]or certain categories of records, print-to-paper is the only means of preservation available. This holds especially true for email, where no possibility of digital archiving exists…’ [within the organisation].48 As a solution, this is a non-starter. As we saw in the PROFS case above, paper representations lose the richness associated with digital media including searching capabilities, linkages between digital entities, and relationships between information elements.49 There is an increasing increasing tendency for official documents to be released as PDFs rather than in their native word-processed formats (e.g. Danish Prime Minister, Anders Fogh Rasmussen, see Norup, T., ‘Danish PM’s private communications disclosed by MS Word’, The Risks Digest: Forum on Risks to the Public in Computers and Related Systems, 23(12), 12 January 2004, http://catless.ncl.ac.uk/Risks/23.12.html#subj4). See also the case of the SCO Group suit against DaimlerChrysler (2004) where a document created in Microsoft Word enabled lawyers to see that the SCO Group had spent time trying to aim the suit at Bank of America. 47 For a further discussion of this from a risk point of view, see Byres, S., ‘Scalable Exploitation of, and Responses to Information Leakage Through Hidden Data in Published Documents’, 2003, http://www.useragent.org/word_docs.pdf, or Byres, S., ‘Information Leakage Caused by Hidden Data in Published Documents’, IEEE Security and Privacy, 2(2), 2004, pp. 23-27. Remarkably, Byres found that among 100,000 documents downloaded from the web all had hidden information with 50 per cent having more than 50 words, 33 per cent having up to 500, and 10 per cent more that 500 words. This provides us with a contemporary example of the creation of tracks of document development that will help scholars to understand how they were shaped. 48 ERPANET Case Studies. We had similar findings at other institutions, such as the International Labour Organisation. 49 In earlier cases the court recognised the plaintiff’s need for access in electronic form in National Union Elec. Corp. v. Matsushita Elec. Ind. Co., 494 F. Supp. 1257 (E.D. Pa. 1980), and In re Air Crash Disaster at Detroit Metro, 130 F.R.D. 634 (E.D. Mich. 1989) flight simulation data was ordered to be provided on computer-readable nine-track magnetic tape. In printouts it lacked the operational expectation that digital materials will be produced electronically in litigation. As early as 1972, courts in the USA had recognised that digital representations of records had benefits in terms of accuracy and place in the chain of evidence that printouts lacked.50 They also benefited in terms of functionality and essentially the expectation is that, if evidence is created and used in digital form, then it should be made available in that way to litigants. 51 Computer forensics and digital archaeology lie at the other extreme.52 This approach suggests that we do nothing now. Conversion or recovery only happens after the original mechanisms, such as a hardware device or software application, for accessing the digital entity ceases to work effectively. This approach has the advantage that it reduces near-term costs. It does not necessarily make the costs higher in the future either, as it is possible that, as approaches to software design become more sophisticated, reverse engineering will become a much more cost-effective and viable process. Of course, you will recognise this as a high-risk strategy as it assumes capability, financial resource, media durability, motivation and continued development of computer forensic technologies. Other approaches more commonly cited in the literature include conversion and migration. The Library of Medicine (USA) has developed tools to enable users to convert from fifty types of electronic files into five output types (i.e. Portable Document Format (PDF), Multi-page Tagged Image File Format (TIFF), Singlepage TIFF, Text, and Synthesized Speech).53 The CAMiLEON project devised the concept of ‘migration on request’.54 Migration essentially involves moving a capabilities. Even when data was provided on paper in legal cases courts have ordered their production on digital media; see for instance Timken Co. v. United States, 659 F. Supp 239 (Ct. Int’l Trade, 1987). 50 See, for example, the ruling in employment discrimination case Adams v. Dan River Mills, Inc., 54 F.R.D. 220 (W.D. Va. 1972). 51 Linnen v. A.H. Robins Co., 1999, WL 462015 (Mass. Super. 16 June 1999). There are numerous other cases such as Storch v. IPCO Safety Prods. Co., 1997 WL 401589 (E.D. Pa. 16 July 1997), where the court ruled that in a world… ‘where much of our information is transmitted by computer and computer discs, it is not unreasonable for the defendant to produce the information on computer disc for the plaintiff.’ 52 Ross and Gow, (1999), op. cit. 53 http://docmorph.nlm.nih.gov/docmorph/docmorph.htm, see also Walker, F. and Thoma, G.,’ A Web-Based Paradigm for File Migration’, Proceedings of IS&T’s 2004 Archiving Conference, San Antonio, TX, April 2004, http://docmorph.nlm.nih.gov/docmorph/IST2004.pdf, pp. 93-97. 54 Wheatley, P., ‘Migration – CAMiLEON discussion paper’, Ariadne, 29, 2001, http://www.ariadne.ac.uk/issue29/camileon/ S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -10- file of one type to a new environment before the pathway from the older format to the newer one disappears. Migrations will be time and labour dependent and will be influenced by processes, systems and best practices. Rarely will a migration be clean. Migration options can range from format change to rerepresentation of data. There is certainly going to be loss of information, functionality, meaning or renderability as a result of migrations. Even six years on, it is still the case that: [b]efore we can see migration as a viable aid to preservation, more work is needed in the development of metrics for benchmarking and supporting the evaluation of the risks to the functionality of the data set, or losses resulting from particular changes. The question of ‘how much loss is acceptable’, whether this is in functionality, integrity, authenticity, or meaning has not been adequately addressed by any commercial or research initiatives.55 Part of the problem is that we do not have a clear indication as to the properties that a digital object must retain if that object is to have been migrated successfully. If we defined the preservation action and documented the process, could this be a way to compensate for the intrinsic loss of character as a result of migration? The National Archives of Australia introduced the concept of performance to explain the interaction between the source and the process. Some approaches migrate the source, while others migrate the process. Rarely are both migrated accurately. Emulation is another promising approach to preservation. In one guise it involves running a piece of software on machine 2 which makes it possible to execute code on machine 2 that was designed for machine 1. An early example was Trimble’s emulation of an IBM S/360 on a microcomputer. His reason for conducting the work was to investigate the price– performance ratio, which was shown to be better in the emulated environment than in the native S/360.56 Emulation is designed not merely to maintain the ability to represent and manipulate the digital resources but to do it in the native environment. approaches to ensuring that digital materials remain accessible. Work being conducted under the digital preservation cluster of the DELOS Network of Excellence in Digital Libraries examines the use of utility analysis to select the most appropriate preservation option.57 This research is delivering tools to support preservation choices that can be applied within a test-bed environment, whether the DELOS DPC test-bed or some other, which will enable those doing preservation planning to select the optimum preservation approach.58 Preservation of digital information requires active intervention; left unsecured it is susceptible to loss through the physical breakdown of the media, rendered inaccessible by technological advances, or left meaningless through lack of or insufficient contextual evidence.59 7. METADATA Courts have come to recognise, where digital documents are concerned, that different types of digital representations have different values to the end user. For instance, in a class action securities case in the USA, the court noted that production of documents in the tiff format was insufficient and it ordered that the defendant produce ‘responsive electronic documents in their native .pst format if that is how they were stored in Defendants’ usual course of business’.60 Furthermore, in this case the court noted that the searchable electronic format should include all available metadata. The contemporary value of metadata is increasingly recognised within the legal profession, although it is still poorly understood and little studied. One of its crucial roles is to support processes of validating the authenticity of the record; as printouts rarely include such system metadata they are not adequate 57 Rauch, C. and Rauber, A., ‘Preserving Digital Media: Towards a Preservation Solution Evaluation Metric’, Proceedings of the International Conference on Asian Digital Libraries (Shanghai, 2004) (alternatively accessible at http://www.dpc.delos.info/private/output/rau_icadl04.pdf). See a case study on the application of the approach to audio and video files, Rauch, C., Pavuza, F., Strodl, S. and Rauber, A., ‘Evaluating preservation strategies for audio and video files’, Proceedings of the DELOS Workshop on Digital Repositories: Interoperability and Common Service, (Crete, 2005), http://www.dpc.delos.info/private/output/rau_digrep05.pdf 58 Rauber, A. et. al., ‘DELOS DPC Testbed: A Framework for Documenting the Behaviour and Functionality of Digital Objects and Preservation Strategies, in Thanos C. (ed.), Delos Research Activities 2005, (Pisa, 2005), pp. 57-59. 59 Ross, (2000), op. cit., p. 5. 60 In re Verisign, Inc. Sec. Litig., 2004 WL 2445243 (N.D. Cal. 10 March 2004). In order to approach preservation effectively you need to characterise what you have, define the options that can enable you to handle it most effectively, and work out the way in which you carry out these processes. As we have seen, there are a number of different 55 Ross, S., Changing Trains at Wigan: Digital Preservation and the Future of Scholarship, (London, 2000), http://eprints.erpanet.org/45/01/seamusross_wigan_paper.pd f, p. 20. 56 Trimble, G.R., ‘Emulation of the IBM system/360 on a microprogrammable computer’, International Symposium on Microarchitecture, Conference record of the 7th annual workshop on Microprogramming, Palo Alto, CA, 1974, pp. 141-150. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -11- substitutes.61 In the case of Zenith Electronics Corporation v. WH-TV Broadcasting Corporation the court ruled that the defendant should supply the documents to the plaintiff electronically because the printed copies already provided lacked metadata and could not be easily searched.62 way or another, although we may not refer to it as metadata.66 There are a huge number of metadata initiatives and a broad range of metadata standards. In 2004 James Turner of the Université de Montréal developed the MetaMap ‘as a study aid and reference tool for understanding metadata standards, sets and initiatives (MSSIs) in the area of information management’.67 Contemporary legal cases, particularly in the USA, indicate that metadata are valuable in understanding and verifying the reliability and authenticity of digital materials from documents to databases. But the metadata discussed so far is intrinsic to the digital object itself and not something that was specially structured and mapped on top of the object to support its management and reuse. From the point of view of preservation there is general theoretical agreement that ‘the right metadata is key to preserving digital objects’.63 As Ballegooie and Duff recently commented in an essay that explores ‘Archival Metadata’, ‘[a]rchivists and records managers have always been metadata experts’.64 Predicting the value of metadata to future users of digital materials is difficult, but we can be fairly certain that future users will include both people and machines. They can play a role in rendering, understanding and validating digital materials. Metadata, ‘data about data’, serves the primary function of making other data useful. As NISO noted, metadata is ‘structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage’ digital materials.65 It is common now to classify metadata by function: descriptive, structural, administrative and contextual. Although we tend to consider metadata as an esoteric and specialised topic, we are all familiar with it in one 61 Williams v. Sprint/United Mgmt Co., 2005, WL 2401626 (D.Kan. 29 September 2005). The Memorandum and Order in the case (http://www.ksd.uscourts.gov.opinions/032200JWLDJW3333.pdf). 62 Zenith Elec. Corp. v. WH-TV Broad. Corp., 2004, WL 1631676 (N.D. Ill., 19 July 2004). 63 Duff, W., Hofman, H. and Troemel, M., ‘Getting what you want, knowing what you have, and keeping what you need’, ERPANET Training Seminar Marburg, Briefing Paper, (Glasgow, 2003), http://www.erpanet.org/events/2003/marburg/erpaTrainingMarburg_BriefingPaper.pdf, p. 3. 64 van Ballegooie, M. and Duff, W, ‘Archival Metadata’, Ross, S. and Day, M. (eds.), DCC Digital Curation Manual, (Glasgow, 2006), http://www.dcc.ac.uk/resource/curationmanual/chapters/archival-metadata. They provide detailed discussions of the Pittsburgh Project’s Business Acceptable Communications (BAC) model, Australian Recordkeeping Metadata Schema (RKMS), ISO Records Management Metadata, and the Public Record Office Victoria (VERS). There are many others. 65 NISO, Understanding Metadata, (Bethesda, 2004), http://www.niso.org/standards/resources/UnderstandingMet adata.pdf, p. 3. Based on our contemporary evidence we suppose that metadata are important, but in reality ‘[w]e do not have enough experience to indicate whether the metadata these systems record, or plan to record, are adequate for the purpose’.68 To manage digital objects it will be essential to adopt good descriptive, structural and administrative metadata standards and there is much good guidance that can be given in this area. The crucial metadata that needs concern us is preservation metadata. PREMIS defined preservation metadata ‘as the information a repository uses to support the digital preservation process’.69 This includes, for example, evidence of provenance and relationships, as well as technical, administrative and structural metadata. PREMIS has proposed a Data Dictionary that provides a core set of preservation metadata elements along with a detailed description in the form of a Data Dictionary. The PREMIS team adopted the point of view that the critical metadata were those that function in ‘maintaining viability, renderability, understandability, authenticity, and identity in a preservation context.’70 In 66 Day, M., ‘Metadata’, DCC Digital Curation Manual, Ross, S. (ed.), (Glasgow, 2005) (available at http://www.dcc.ac.uk/resource/curationmanual/chapters/metadata. 67 Turner, J.M., ‘The MetaMap: a Tool for Learning about Metadata Standards, Sets, and Initiatives’, in Bischoff, F.M., Hofman, H. and Ross, S. (eds.), Metadata in Preservation: Selected Papers from an ERPANET Seminar at the Archives School Marburg, in Veröffentlichungen der Archivschule Marburg, Institut für Archivwissenschaft, 40,2003, pp. 219-232. For the map itself, see: http://mapageweb.umontreal.ca/turner/. Other documentation schemata of note are those used in the audiovisual sector; see for instance Bauer, C., Rosensprung, F., Lajtos, S., Boch, L., Poncin, P. and Herben-Leffring, C., PrestoSpace Deliverable D15.1MDS1: Analysis of current audiovisual documentation models, (Paris, 2005), http://www.prestospace.org/project/deliverables/D151_Analysis_AV_documentation_models.pdf. 68 PREMIS Working Group, 2004, p. 5. 69 PREMIS Working Group, 2005, Data Dictionary for Preservation Metadata, Dublin OH and Mountain View CA: OCLC and RLG, http://www.oclc.org/research/projects/pmwg/premisfinal.pdf. 70 PREMIS Working Group, 2005, p. 9. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -12- deciding which metadata to use the PREMIS puts stress on core metadata, which are ‘things that most working repositories are likely to need to know in order to support digital preservation.’71 The PREMIS data model rests on a conceptualisation of metadata that is clear and implementable. From a records management vantage point, other emerging models such as ISO23801-1:2005 will probably be of even greater pertinence, but at the beginning of 2006 ISO23801 is still a work in progress.72 We might now wonder where metadata come from. Some metadata is inherent in the digital object itself (as we have seen from the discussions of legal cases above). A small amount can be extracted automatically (e.g. National Library of New Zealand (NLNZ) tool to extract technical metadata from a narrow class of digital object types) or is captured at creation, modification or ingest (e.g. event metadata). However, most metadata must be generated by human intervention and this is especially true for contextual metadata. It is this need for human intervention that makes metadata so costly and the relationship between cost and benefit so difficult to justify. How much metadata is enough and what is too much? Crucial in all this is establishing a cost–benefit relationship between the effort necessary to create the metadata and the usefulness of the metadata in ensuring access, management and preservation. Currently metadata is labour intensive to create and one of our objectives should be to change this. Recovering digital objects and recreating the interrelationships between them without adequate metadata is complex, as the data recovery industry shows.73 Metadata interoperability will become increasingly important. The Clever Recordkeeping Metadata Project (CRKM) (at Monash University) is leading the effort to support the exchange of metadata between different business, archival and records management systems. This ability to translate metadata between systems will allow us to go beyond the use of metadata to support the re-use over time of digital entities and their exchange between repositories to: enable archivists in the 21st century to go beyond Scott’s original vision of sequential multiple provenance to build archival systems that encompass Chris Hurley’s ‘parallel provenance and Jeannette Bastian’s communities of records and negotiate the complex matrices of mutual rights and obligations that Eric Ketelaar’s vision of shared ownership and joint heritage invokes.74 Curiously, we have few practical examples of the implementation of preservation metadata and, to the best of my knowledge, can point to no cases as of the end of 2005 where they have been used in practice to support preservation processes. 8. AUTOMATION AND PRESERVATION What is evident is that automation is a critical step in the development of preservation solutions. 75 For example, the quantities, quality and level of metadata consistency required for managing digital objects within repositories require that its extraction be in some way automated.76 To aid the process of ingest, selection and appraisal, for the preservation of digital material, the goal of a Glasgow-led team is to look at ways of automating the semantic metadata extraction process and create a prototype tool, and to integrate this tool with other metadata extraction tools and ingest processes used to underpin the automatic population of document repositories.77 While our research focuses on metadata extraction in the area of textual documents, some very good work has been done with audiovisual content elsewhere.78 We are using linguistic and layout 74 75 Ross, S. and Hedstrom, M., ‘Preservation Research and Sustainable Digital Libraries’, International Journal of Digital Libraries, vol. 5.4, 2005, http://eprints.erpanet.org/archive/00000095/, pp. 317-325. 76 Greenberg, J., Spurgin, K. and Crystal, A., ‘Final Report for the AMeGA (Automatic Metadata Generation Applications) Project’, International Journal of Metadata, Semantics and Ontologies, vol. 1.1, 2006, http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.p df, pp. 3-20. 77 Ross, S. and Kim, Y., ‘Digital Preservation Automated Ingest and Appraisal Metadata’, in Thanos, C. (ed.), DELOS Research Activities, (Pisa, 2005). 78 71 72 ibid. ISO 23081-1:2005, Information and Documentation Records Management Process – Metadata for Records – Part 1– Principles, (Geneva, 2005), in conjunction with subsequent two sections, will provide mechanisms to implement and use metadata within the framework of ISO 15489, Information and documentation – Records management, (Geneva, 2001). 73 Evans, J., McKemmish, S. and Bhoday, K.,‘Create Once, Use Many Times: The Clever Use of Recordkeeping Metadata for Multiple Archival Purposes’, 15th Annual International Congress on Archives, (Vienna, 2004), http://www.wien2004.ica.org/imagesUpload/pres_174_MCK EMMISH_Z-McK%2001E.pdf, p. 13. PrestoSpace, the FP-funded project in the area of digital curation of audiovisual materials has produced two state-of-the-art reports that deal with these issues – one looking at approaches to automated analysis of audiovisual content, Bailer, W., Höller, F., Messina, A., Airola, D., Schallauer,P. and Hausenblas, M., PrestoSpace Deliverable D15.3 MDS3 State of the Art of Content Analysis Tools for Video, Audio and Speech, (Paris, 2005), Ross and Gow, (1999), op.cit.. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -13- analysis techniques to automate this process of metadata extraction. The research within this task can be divided into six domains: (a) selecting metadata to be extracted and that can be extracted; (b) integrating previous and current related research; (c) designing a prototype metadata extraction tool; (d) implementing a prototype metadata extraction tool; (e) establishing a well-designed corpus of documents to validate the effectiveness of the prototype; and (f) testing and refining the prototype. Progress has been made with (a) to (c) above. As this research focuses mainly on automation of the acquisition of descriptive metadata, it will be of most immediate value to the digital library community and will only provide a part of the metadata required in the archival environment. Even where the exchange of a limited range of metadata can be instituted it results in reduced costs and improved accessibility of digital information by end users. There are numerous projects that aim to improve access to information through maximising the availability and use of metadata. For instance, the Commonwealth Metadata Pilot Project aims to improve access to Australian Government information published online by automating the contribution of metadata to the national bibliographic database, and by automating the archiving of content associated with the metadata in PANDORA, Australia’s web archive.79 In general, however, if preservation is to become mainstream then automation of as wide a range of processes as is possible is essential. Automation combined with workflow modelling and streamlining will reduce costs and help to focus preservation processes. Stephan Heuscher, formerly of the Swiss Federal Archives, has argued that, if we are to manage the process of ingest effectively, we need both methods and tools for modelling these processes.80 Essentially the objective has to be to shift as much of the preservation activity from a manual process to an automatic (or at least a semi-automatic) one. By automating the work flows we can integrate services (e.g. checking of format or representation information registries to the form in which a particular piece of data is represented is appropriate), logging and record creation can be standardised, costs can be reduced, errors eliminated, and security and reliability enhanced. 9. INGEST One of the challenges for digital preservation professionals is to decide at what level to ingest materials. Imagine for a moment that your archives or http://www.prestospace.org/project/deliverables/D153_Content_Analysis_Tools.pdf. 79 http://www.nla.gov.au/ntwkpubs/gw/65/html/p04a01.html 80 Heuscher, S., ‘Workflow Modelling Language Evaluation for an Archival Environment,’ Archivi & Computer, XIV (3/04), 2004, pp. 123-140. library was offered an unplanned bequest of the computer files of a local Member of Parliament who had unexpectedly died. The executors of his will offer to make copies of the files on to portable media themselves or to deposit copies along with the laptop and desktop computers the parliamentarian had been using for the past three and six years respectively. What do you do? Obviously the copy on to some portable medium has advantages; it is easy to handle and can probably be put on a shelf until you have time to deal with it (say, for a year or so), but a copy, as any computer forensic professional will assure you, is not necessarily complete – metadata may be missing and certainly deleted files will not appear in the copy although they might be present on the existing hardware.81 If you do not accept the computers, you should at least seek a clone of the discs. In most situations this kind of scenario will not arise as most ingests will be planned and the archive will have some influence over the process of preparation of material for submission. Indeed it may even have had a hand in the specification of the system that was used to generate the digital materials. The ingest process is not just about ensuring that the digital materials that you have acquired are effectively placed in the repository; it is also about controlling the ingest process so that you can verify that the items ingested have not compromised the reliability, usability or authenticity of the digital entities themselves. This ensures that no extra baggage (e.g. viruses) is inadvertently brought into the repository, which might put the contents of the repository at risk. Archives will be ingesting digital objects from one of four main classes. Each of these classes has its own unique properties, which have an impact on an archive’s ability to select, acquire, manage, preserve and provide access to them. The ease with which the archive can control the process by which materials are prepared for appraisal and acquisition and the amount of effort that will be involved in ingesting the material into its repository will depend not only on the types of objects. It will also reflect such factors as the number of instances of the object type, the complexity of any individual instance, the file formats, and to a lesser extent the volume of the object. For example, it may prove less labour intensive and technically challenging to ingest large digital objects of low complexity in comparison with large numbers of smaller composite digital objects created using specialised software and having a high degree of interconnectedness between elements. The ease of handling composite objects will also be influenced by the degree of stickiness of the bindings between the composite elements. 81 A good summary of this digital persistence can be found in Chapter 7 of Farmer, D. and Venema, W., Forensic Discovery, (Boston, 2004). See also Ross and Gow, 1999. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -14- Digital entities are likely to be presented to collecting or institutional archives in one of four ways. The handling of each class of digital material will require different policies and procedures. In each case it is likely that collecting archives will encounter a greater diversity of content than, say, corporate or institutional archives. Portable Objects – includes CD-ROMs, tapes, solid state devices and other portable media that house content ranging from databases, documents, audio, images or even software. The Collecting Archive is likely to have little control over how this material is submitted, although some archives will be in a position to collaborate with the content creators to improve the process by which digital objects are created and presented for archiving. In many instances organisations will put in place systems that add metadata and functionality to ease ingest. The diversity of organisations producing digital objects on portable media and the variety of types of objects mean that the archives will be confronted with an expanding, rather than narrowing, range of digital objects. Some will require specialised analysis and attention if they are to be ingested into the digital repository and even then it may only prove feasible to preserve the bit stream and not the capability to render the content of the object or to recreate its functionality. The archive will need to decide whether ingesting the bit stream is sufficient or whether the original medium (and any packaging) needs to be retained as well, even though it is unlikely that suitable peripheral devices (e.g. tape drives) will be available in the future to access the material subsequently.82 Transfers – these will usually arrive as electronic transfers (frequently online) from within the organisation itself, although in the case of national or local archives the depositing bodies may be different institutional units. In general, these will be planned deposits and the institutional archive will have some control over how the content was created and submitted (e.g. formats, documentation). Unpublished Personal Digital Materials – these digital objects will be mainly documents (e.g. drafts of 82 Johan Steenbakkers in The NEDLIB Guidelines – Setting up a Deposit System for Electronic Publications, (The Hague, 2000), argued that digital documents should be separated from their original carrier because the carriers were intended for publishing and not for archiving. While in digital management terms he is absolutely correct, there may be some curatorial benefits from retaining the original carriers. D. Swade, Science Museum (London), has for more than a decade promoted this view (‘Collecting Software: Preserving Information in an Object-Centred Culture’, in Ross, S. and Higgs, E. (eds.), Electronic Information Resources and Historians: European Perspectives, (St Katharinen, 1993), pp. 93-104). Indeed, in at least one legal case the carrier was considered to be metadata. publications, e-mails) of authors and politicians and are most likely to be encountered in a collecting archive. For the most part these will in the near term be produced with fairly standard application packages and be primarily stand-alone documents, suites of images, small-scale personal databases or e-mail records. Discussions with potential depositors would provide information about how the material should be configured and presented for deposit and enable the depositor to provide contextual metadata which otherwise might be unobtainable. For instance, where depositors can be encouraged to produce crucial metadata or where they can note the interrelationship between particular materials, the processes of ingest and cataloguing can be enhanced and the labour required reduced. Outputs of Digitisation Programmes – archives increasingly aim to represent their analogue holdings in digital form and, as Anderson’s chapter in this volume makes evident, users expect access in digital form even to materials that were created in analogue. By controlling how digital objects are created, the metadata that is created along with them, and the processes by which they are delivered to the digital repository, the effort required to ingest can then be contained. The nature of the objects has an impact on the effort that is involved in bringing the objects into the archives holdings. Standards in handling digital objects have an impact on their preservation, authenticity and integrity, and how they can be delivered to users. It is widely recognised that, where preservation functionality can be built into systems that are employed to create the digital objects, the costs of selection, ingest, preservation and access can be reduced.83 The application of the Continuum model (see chapter 1 in this volume) should mean that those institutions and individuals using a records management system would not encounter situations of this kind. In general, collectors of digital objects rarely have control over the construction of any of the digital objects that they will be ingesting. Cunningham and Phillips, commenting on the situation in Australia in relation to government publications, noted that ‘[t]here is no standard approach by government agencies for creating, describing and organising their publications, and each is different.’84 They recognise the need to impose standards if the costs of ingesting these materials are to be contained.85 ERPANET investigated issues surrounding the ingest 83 Ross, (2000), op. cit., p. 15. 84 Cunningham and Phillips, (2005), op. cit., p. 310. 85 Ross, S., Digital Library Development Review, National Library of New Zealand, (Wellington, 2003), http://www.natlib.govt.nz/files/ross_report.pdf, pp. 43-52. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -15- of some classes of digital materials and released guidance on the process.86 10. PERHAPS A MODEL WILL HELP Adopting a shared model to approach digital preservation may improve the communication between curators and users. One preservation model in common use is the Reference Model for an Open Archival Information System (OAIS) (ISO14721:2003), 87 which specifies a conceptual framework for a generic archival system. OAIS was developed by key players in the space community under the aegis of the Consultative Committee for Space Data Systems and is now an ISO Standard. In creating OAIS, space researchers noted that observations made in space science were both irreplaceable and were not reproducible.88 If the data were to be made accessible in the future, they and their associated metadata would need to be moved across different technologies. The model reflects a recognition that information will be represented in different formats and that these representations will change over time. OAIS details all the functions of a preservation environment. It charts the preparation, submission, storage, maintenance, retrieval and delivery of digital objects. The model is implementation independent and could be delivered using a range of technologies and at a variety of scales. The OAIS reference model establishes a common framework of terms and concepts, and maps the basic functions of an archival system: ingest, data management, archival storage, administration, preservation planning, and access. Representation Information (RI) details how the intellectual content is represented and how to extract meaningful information from a stream of bytes. Content Information (CI) is the combination of the original bit stream along with the Representation Information. Preservation Description Information (PDI) contains the additional information required to identify the object, describe the processes applies to it, and support the understanding of the content over time. A Designated Community identifies the users of the preserved data, and itself will change over time. The Knowledge Base for a Designated Community indicates that the group has some prior level of knowledge and familiarity with the content, 86 87 88 ERPANET, 2004, erpaguidance: Ingest Strategy, http://www.erpanet.org/guidance/docs/ERPANETIngestToo l.pdf Reference Model for an Open Archival Information System (OAIS) – ISO 14721,( 2002), http://www.ccsds.org/documents/650x0b1.pdf. Esanu, J., Davidson, J., Ross, S. and Anderson, W., ‘Selection, Appraisal, and Retention of Digital Scientific Data: Highlights of an ERPANET/CODATA Workshop’, Data Science Journal, 3, 30 December 2004, http://journals.eecs.qub.ac.uk/codata/journal/Contents/3_04/ 3_04pdfs/DS390.pdf, p. 230. allowing the Representation Information required to be limited. The model classifies information into object information classes – content, preservation description, packaging and description. A Content Data Package is then represented using a series of information packages; a container encapsulates the content information and the preservation description information (PDI). There are also information packages for submission, archival storage/management and dissemination, which identify the changes that are made to the information package as it passes through the system. Key to OAIS is the concept of the designated community and the relationship between the producer and the consumer of information. The key step is to use OAIS to develop repositories. It is the repositories that will ‘carry the load’. Where these repositories instantiate concepts expressed in the OAIS model, they go a long way towards ensuring that the architectures will be robust enough to ensure the ability to ingest, manage and make accessible the materials the designated communities expect them to handle. Do bear in mind, however, that the OAIS model is just one model preservation environment. The OAIS model is not without its critics,89 and there are other approaches emerging such as a bottom-up approach proposed by David Rosenthal and his colleagues90 or the one that underpins ISO15489. 11. Get a Digital Repository Underpinning all digital preservation activities is repository design. Repositories are not unlike the buildings that house traditional archives and libraries – they need to be renewed and the contents they hold shifted to upgraded shelving or newer environments.91 As we have seen earlier, one of the characteristics of technology is its fluidity. This means that a repository is only a temporary holding bay, even if we are thinking in five- to ten-year time spans. Although taking a strongly OAIS and library-centric view, OCLC and RLG’s report on Attributes of Trusted Repositories92 provides a high-level model for the design, delivery and maintenance of a digital repository. It outlines the processes that need to be certified and auditable if an institution is to be said to be running a trusted digital repository. For example, they press for clear statements 89 See, for instance, PREMIS Working Group, 2004, pp. 27-28. 90 Rosenthal, D.S H., Robertson, T., Lipkis, T., Reich, V. and Morabito, S., ‘Requirements for Digital Preservation Systems’, D-Lib Magazine, 11(11), 2005. 91 Anderson, S. and Heery, R., Digital Repositories Review, (London, 2005), http://www.jisc.ac.uk/uploaded_documents/digitalrepositories-review-2005.pdf 92 RLG/OCLC Working Group on Digital Archive Attributes, 2002, Trusted Digital Repositories: Attributes and Responsibilities, http://www.rlg.org/longterm/repositories.pdf S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -16- by repository owners on such matters as policies and assumptions (e.g. practices, environment and security), definition of processes in place to manage fidelity checks for ingest, and metadata creation and management processes. Central to the RLG/OCLC model is the recognition that all processes related to the running of the repository need to be well documented, auditable and validated. At the simplest level, a repository must be able to accept digital objects regardless of type, format or medium.93 Once the items have been ingested into the repository they must be held in a secure way and the authenticity and integrity of the digital entities must not be compromised. Materials ingested into the repository must be capable of being output in formats that could be ingested into a ‘next-generation repository’. Documentation must be accessible and disaster recovery functionality inherent in the design. Repositories will be used by both people and machines. Repositories will be of value to depositors and to organisations needing to ensure secure access to their digital records and assets. During the process of ingesting materials, metadata including descriptive, administrative and contextual types will be attached to the objects. Among these metadata will be persistent identifiers. Persistent identifiers provide a method of uniquely identifying a digital object. There are almost as many persistent identifier schemes as there are flavours of ice cream. Examples include DOI (Digital Object Identifier), Persistent Handles, ARK (Archival Resource Key) and Persistent Uniform Resource Locator (PURL). Each one has its strengths.94 Your repository will need to adopt one scheme and decide at what level of granularity it will be used. Change will be a feature of all repositories. The underlying storage technologies will be replaced on a regular basis, services will be closed down and new ones started, and workflows will be adapted as technology, policies or processes change. The holdings of the repositories will need to be moved to new storage media (i.e. refreshed), migrated or just emulated. If change is a feature of repositories, then flexibility in technical infrastructure and organisational approach is the necessary response. risks. Among the risks are accidental loss of records and digital assets, information leakage,95 record duplication, unauthorised modification, loss of accessibility, severing of the relationship between the data and their metadata, and loss of provenance evidence, 96 integrity and authenticity. The Swiss Federal Archives’ Archiving of Electronic Digital Data and Records (ARELDA) project sought a solution to the permanent archiving of digital records to enable it to fulfil its obligations under the Swiss Federal Archives Act. They are investing some eleven million euros in the development of the digital archive during the period between 2001 and 2008. They expect their data storage requirements will grow at around 20 TB per year. We could have chosen the UK National Archives (Kew), the Dutch National Archives, Swedish National Archives, the National Library of New Zealand (NLNZ) or US National Archives as our examples. All these institutions are making substantial investments in establishing digital repositories to facilitate the storage of and access to digital assets, whether records or published resources. These are all expensive operations with development costs ranging from twenty-four million NZ$ in the case of the NLNZ to more than three hundred million US$ in the case of National Archives and Records Administration (NARA). These will all be gold-plated solutions. Most institutions cannot afford solutions of this kind. This does not mean that you should do nothing. The National Archives of Australia (NAA), in defining its digital preservation repository for public records, has interpreted the OAIS model in a way that has made it possible for them to establish a cost-effective The functional preservation environment.97 requirements at repository level can be made quite simple where the need is to link the digital object, which might be stored as a bit stream in the file system 95 There are many high-profile cases of this, among them the loss by United Parcel Service (May 2005) of a backup tape belonging to the retail division of Citigroup in the USA and containing social security numbers and transaction histories on both open and closed accounts for nearly four million customers. 96 For a helpful discussion of provenance from the vantage of scientific datasets, see pages 179-183 of the Digital Archiving Consultancy (DAC), the Bioinformatics Research Centre (University of Glasgow (BRC)) and the National eScience Centre (NeSC), 2005, Large-scale data sharing in the life sciences: Data standards, incentives, barriers and funding models (The ‘Joint Data Standards Study’), prepared for the Biotechnology and Biological Sciences Research Council, the Department of Trade and Industry, the Joint Information Systems Committee for Support for Research, the Medical Research Council, the Natural Environment Research Council, the Wellcome Trust, http://www.mrc.ac.uk/pdf-jdss_final_report.pdf 97 Stephen Ellis and Andrew Wilson of the NAA described the plans during the March 2003 interview. In this volume Currall examines the security issues associated with digital information. Here we note only a few of the risks associated with information because, with an effective repository supported by robust policies and procedures, it is possible to manage these 93 It might be possible to run a repository that specialised in handling a narrow range of object representation types, for example only handling image formats. 94 ERPANET, 2004, Workshop on Persistent Identifiers, http://www.erpanet.org/events/2004/cork/Cork%20Report.p df S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -17- with an XML or SQL database containing the metadata about the object. The digital repository structure deployed by the NAA is designed to minimise access to the underlying repository layer by users. By isolating users from the raw storage, the repository provides an additional layer of information security. Another key feature is the decision by the designers to enable the objects to be examined at ingest (e.g. checked for such unwanted payloads as viruses), processed and wrapped before being placed into the repository.98 The heart of repositories is not the technology. It is the policies and procedures that underlie them: deposit agreements, submission information guidelines, management plans, access policies, disaster recovery plans and preservation strategies (e.g. migration). The greatest challenges to the survival of repositories is not the technology, but the organisational and cultural apparatus that makes the operations work and how the institution establishes the trust of the communities of repository users. How can a repository secure the trust of depositors, users (people and machines) and regulatory bodies that they have the mechanisms in place to secure digital assets for the long term? What steps will they need to take to maintain that trust? And, most importantly, what happens if they lose it? Repository management can be a highly complex task. One way to reduce the complexity is to identify a set of basic repository management functions such as storing, copying, depositing and maintenance of disparate types of data. For the objects and metadata it holds, a digital repository must provide secure storage, facilitate the maintenance of integrity and authenticity, and permit the authorised destruction of items. Five primary functions that must be enabled at an administrative level are ingest, retrieve, track, verify and destroy, and at a user level retrieval and verification are the key services that are needed.99 A number of projects have focused on laying down the foundations for the longterm storage of digital objects.100 There are projects that have developed off-the-shelf architectures and solutions: Flexible Extensible Digital Object and Repository Architecture (Fedora),101 Dspace102 and LOCKSS.103 None these are general information preservation environments or can fulfil the requirements of being a trusted repository application. 98 http://www.naa.gov.au 99 Ross, S., (2003), op. cit. More importantly, none of these repository models or applications integrates preservation functionality. The DELOS Preservation Cluster’s Cologne Team has led the establishment of a complete design for the persistency modules that need to be included in the design process of any digital repository.104 The work completed includes a Unified Modelling Language (UML) model for preservation components that could be used in the design of any digital repository. One way to address the trust problem noted above is through establishing an audit and certification process for repositories. So far mechanisms to support audit and certification of repositories are still in the developmental phase. In the USA, RLG and NARA established the Digital Repository Certification Task Force, which in 2005 published a draft check list for certifiable elements of a digital archive.105 There are, however, approaches open to institutions of all sizes from archives to companies that will position your organisation to be prepared to take advantage of audit and certification schemes that may come into place. With the best will in the world, audit processes do not shield organisations against security breaches or disasters that result in leakage or loss of information. If your organisation wishes to establish itself as a trusted repository, the following nine steps should lay the foundation for a sustainable repository infrastructure: • Define the objectives and aims of your repository and from those specify the services it will provide. The objectives and services should be documented. • Determine whether your organisation is best placed to develop a repository on its own or whether it should establish a shared repository or purchase use of repository services. • Develop policies and procedures for managing all processes: ingest, data management, archival storage, administration, preservation planning, and access. • Put in place mechanisms to monitor the application of these policies and how effective they are. • Define senior management roles and responsibilities in relationship to repositories. 100 ibid. 104 101 http://www.fedora.info An excellent discussion of Fedora can be found in ‘The Mellon Fedora Project: Digital Library Architecture Meets XML and Web Services’, (Payette, S. and Staples, T., in Agosti, M. and Thanos, C. (eds.), ECDL 2002, LNCS 2458, pp. 406-421). Herrmann, V. and Thaller, M., ‘Integrating Preservation Aspects in the Design Process of Digital Libraries’, DELOS Deliverable 6.5.1, 2005, http://www.dpc.delos.info/private/output/DELOS_WP6_d6 51_finalv3_5_cologne.pdf 105 RLG/NARA Task Force on Digital Repository Certification: Audit Checklist for Certifying Digital Repositories, http://www.rlg.org/en/pdfs/rlgnararepositorieschecklist.pdf. 102 http://dspace.org/index.html 103 http://lockss.stanford.edu/ S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -18- • Ensure that all services, technologies (hardware and software), exceptions and practices are documented. • Develop and maintain risk registers that clearly identify risks, indicate their likelihood, specify their probable impact, describe how you would address the risk if it did occur, and note what you are doing to avoid its arising. • Maintain status reports and minutes of meetings. • Define, implement, monitor and test disaster recovery services. 12. FUTURE RESEARCH DIRECTIONS Digital curation and preservation is a fertile research domain. Research problems include theoretical issues, methodological challenges and practical needs. After more than twenty years of research in digital curation and preservation, the actual theories, methods and technologies that can either foster or ensure digital longevity remain startlingly limited. If you contrast Roberts (1994) with Tibbo (2003) it is obvious that, although our understanding of the problems surrounding digital preservation has advanced, the approaches to preservation remain limited.106 There are many possible explanations for this situation; for instance, there has been a lack of appreciation of the research challenges posed by digital preservation, a lack of a sense of urgency, the lack of proven business cases that might have encouraged the development of this as a research or technology sector, the fact that in the past the research agenda has been driven by information professionals working in memory institutions or corporate records management teams, the limited funding for this kind of research and, of course, the speed of technological development. Recently, changes in the research and technology landscape have raised research interest in the challenges surrounding digital curation and made it evident that there are substantial commercial opportunities. For instance, in 2001the National Science Foundation (NSF) in the United States – through its Digital Library Programme (DLI2), and the European Commission – through the Network of Excellence it funds in the area of Digital Libraries (DELOS), supported a workgroup to propose a research agenda in the area of digital preservation.107 This 106 107 Roberts, D., ‘Defining Electronic Records, Documents and Data,’ Archives and Manuscripts, 22 (May 1994), pp. 1426; Tibbo, H.R., ‘On the Nature and Importance of Archiving in the Digital Age,’ Advances in Computers, 57, 2003, pp. 1-67. Ross and Hedstrom, (2005), op. cit., Hedstrom, M. et al., Invest to Save: Report and Recommendations of the NSFDELOS Working Group on Digital Archiving and Preservation, (Pisa & Washington DC, 2003), http://delosnoe.iei.pi.cnr.it/activities/internationalforum/JointWGs/digitalarchiving/Digitalarchiving.pdf, Report of the European Union DELOS and US National Science research agenda has been used by the European Commission and others to plan funding programmes. More recently an international workshop co-sponsored by the Joint Information Systems Committee (JISC), the Digital Curation Centre (DCC), The Council for the Central Laboratory of the Research Councils (CCLRC) and the British Library brought together experts to examine the challenges to digital curation and what our research focus should be during the next decade.108 13. CONCLUSIONS Long-term access to digital materials is a process. Charles Dollar commented that to secure digital materials for 100 years we should think in shorter periods because there is currently no 100-year solution.109 Preservation is hard and the hype has made it harder. There is a general belief that digital technologies make work easier and securing access cheaper. In many ways this is so. Preservation requires active engagement. Solutions that you will put in place today you will replace tomorrow. You should think of digital preservation as a dynamic process that requires focus, policies, procedures and planning. You should not think of it as primarily a technical activity because it is not. Digital preservation processes should ensure that we pass usable, authentic and reliable evidence to the future. Current approaches are inadequate. There are seven proactive steps that practising archivists and records managers can take to ensure that they are acting to secure digital records and resources in their care: • • • • Keep appropriately skilled up. Act as an active advocate for digital preservation activities. Ensure that your organisation has effective policies and procedures governing the creation, management (both retention and disposal) and curation of digital materials. Be attentive to the maintenance of the digital materials in your care (e.g. note when it is time to refresh media, track formats in which your holdings are represented to make certain that you migrate before ‘migration pathways’ for the format disappear). Foundation Workgroup on Digital Preservation and Archiving. (Alternatively at http://eprints.erpanet.org/94/01/NSF_Delos_WG_Pres_final .pdf) 108 Digital Curation and Preservation: Defining the research agenda for the next decade, report of the Warwick workshop, 7-8 November 2005, http://www.dcc.ac.uk/training/warwick_2005/Warwick_Wo rkshop_report.pdf 109 Charles Dollar, 2004, HATII University of Glasgow and Digital Curation Centre, Conversation, 6 December 2004. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross -19- • • • Avoid proprietary standards for representation, encoding, software, hardware and especially for backup services. Do not assume that there is a single solution to all your preservation challenges or that, if you adopt one approach for a set of digital materials at a given time, you will not in ten years’ time use a different approach. Whatever preservation approaches you apply (e.g. media refreshing, migration, emulation) they must be controlled, monitored, documented, audited and validated. The conceptual and methodological developments in the creation and management of trusted repositories, user needs assessment and evaluation, and the appraisal and ingest of digital materials are radically altering how we think about and handle digital materials. By understanding just how these developments enable us to document society in the digital age, we can appreciate the impact that changes in the way we communicate and create documents is having on the record we will be passing to future generations. The commoditisation of information leads to a change in the perception of how and why it should be managed. The archival profession is at the heart of this change. 14. NOTE ON WEB SITE CITATIONS All sites cited in this article were accessed in February 2006. 15. ACKNOWLEDGEMENTS Thanks to my Digital Curation Centre colleagues Andrew McHugh, Maureen Pennock and Adam Rusbridge, and to Hans Hofman of the Dutch National Archives, Professor Andrew Prescott of the University of Sheffield, Professor Helen Tibbo of the University of North Carolina (Chapel Hill) and Alistair Tough of the University of Glasgow, who all made valuable suggestions. I am grateful to Michael Day for discussions of PREMIS. Any errors or omissions are, of course, my own. S Ross, 2006, ‘Approaching Digital Preservation Holistically’, in A Tough and M Moss (eds.), Information Management and Preservation, (Oxford: Chandos Press) © Seamus Ross