How it could work…

The following workflows show how the open source tools and preservation techniques that the project has tested and explored could fit into a wider process for the effective management of electronic records.

The approach relies on:

  • adapting the available functionality of a well-managed file servers system
  • training and relationship building with relevant staff by the records manager
  • the identification of sets or classes of electronic records held on shared drives that need to be captured and managed in a structured way
  • a commitment to transform use of the ‘shared drive’ from a long-term, unstructured collection of documents of varying relevance and quality to a temporary, collaborative area with strict controls on size and data retention

 

 

Click on the image above to see it full size.

I’ve included the full document here, which has several other ‘scenarios’ for managing electronic records using these tools: Records management workflows

‘The DROID we are looking for?’ – Using the open source tools

One of the objectives of this project is to examine how useful and practical the Open Source preservation tools are.

I thought that it would be worth recording my experience in installing and using the software in order to carry out the tests for the project. I’ve heard so many people over the years, from my mother to system administrators, use the phrase ‘I’m not very technical or IT-savvy’ so I won’t say that. I will be honest and confess I am just ‘IT-savvy’ enough to embark on something like this but not savvy enough to master it correctly. I think many of us working in records management occupy this sort of space but it shows that none of the following should be attempted without the input and guidance of your IT department.

Deciding to use open source tools to support records management is a different approach for a records manager. We are used to stages such as building a business case, procuring a system and implementing it. Open source tools are a lot simpler. Find it on SourceForge and – if your IT department says it’s OK – click download and there it is on your desktop.

DROID

According to its website DROID (Digital Record Object Identification) is “a software tool developed by The National Archives to perform automated batch identification of file formats”.

I found this downloading this tool very simple to do and it is very intuitive to use.

The functionality of DROID that stood out for me was the analysis and reporting that it could produce. Fundamental to my conception of the business case for electronic records management is the amount spent on storage and back-up of large, unstructured shared drives. Whilst I had used simple ‘search’ analysis in Windows Explorer to assess shared drives, the DROID tool did this reporting quicker and was able to produce it as a CSV download that allowed for analysis. The possibility for producing metrics to support a business case is therefore much enhanced with DROID.

This also provided me with the idea of using DROID to run regular reports on a structured, organised File Servers repository. One of the challenges of a ‘file servers/existing tools’ approach to electronic records management is the analysis and reporting that can be produced. DROID allows a simple manual process of taking a few minutes at most that could provide a regular ‘snapshot’ to map the development, growth and reduction of an electronic repository over time. I have demonstrated how this might work in scenarios 2 and 3 of my project output ‘Records management workflows’.

DROID therefore was an immediate help and has the potential to be a valuable tool for records managers.

XENA
According to its website, http://xena.sourceforge.net/index.php XENA (Xml Electronic Normalising for Archives) is a tool intended to “aid in the long term preservation of digital records”. The tool was developed by the National Archives of Australia.

XENA needed a few other things installed on my PC before I could get it up and running. Therefore I had to enlist the help of my IT support people to install OpenOffice and upgrade my JAVA settings accordingly. 

Once it was installed, XENA was quite easy to use. It follows the OAIS model by converting the document into a preservable digital object. You can view some of the content through a ‘XENA viewer’, but to make it properly available you can publish or ‘normalise’ the document, usually as an Open Office document or PDF.  (Further details on the outcome of our testing are available elsewhere on the project blog).

The concept of a safely preserved set of documents of which a copy can be created when required is quite a departure from the idea of a EDRMS master repository of documents which a user can search. Could a user – a member of staff or a member of the public – search a catalogue of document titles and keywords rather than the documents themselves? This might be impractical for ‘short retention /regularly accessed’ types of records. But it could be ideal for permanent retention documents (such as governance minutes, regulations, strategies) that cover both bases as valuable administrative records and documents of archival or public interest.

Challenges / Risks

1) There is an assumption that XENA and DROID will continue to be supported and relevant in years to come. That they are products of National Archives programmes does suggest a longevity that means the software is worth using for long-term preservation.  Would the Open Office output of XENA run more of a risk of obsolescence than the XENA objects?

2) The use of these tools requires a lot of manual work for the records manager – running reports, converting documents etc. – which can be automated by the use of an EDRMS system. However, you could argue that this sort of work was a key part of past – and still persisting – paper records management processes. Additionally, an EDRMS brings, alongside significant costs, a host of system administrator and training overheads which would be sidestepped by this sort of approach.

Conclusion

Both these open source tools, which are free to download and took little longer than 30 minutes to get up and running, provide some really interesting and valuable functionality for records managers. Converting them from a desktop ‘toy’ to a scalable working component in a records management system or process is a much more difficult step, but my testing of DROID and XENA has certainly given me plenty to think about.

Significant Properties in Office documents

Significant properties are properties of digital objects or files which are considered to be essential for their interpretation, rendering or meaningful access. One of the aims of digital preservation is to make sure we don’t lose or corrupt these properties through our preservation actions, especially migration. For this particular project, we want to be assured that a Xena normalisation has not affected the significant properties of our electronic records.

Much has been written on this complex subject. Significant properties are to do with preserving the “performance” and continued behaviour of an object; they aren’t the same thing as the technical metadata we’ve already assessed, nor the same as descriptive metadata. For more information, see The InSPECT project, which is where we obtained advice about significant property elements for various file formats and digital object types.

However, the guidance we used for significant properties of document types was Document Metadata: Document Technical Metadata for Digital Preservation, Florida Digital Archive and Harvard University Library, 2009.

To summarise it, Florida and Harvard suggest we should be interested in system counts of pages, words, paragraphs, lines and characters in the text; use of embedded tables and graphics; the language of the document; the use of embedded fonts; and any special features in the document.

When normalising MS Office documents to Open Office, the Xena process produces a block of code within the AIP wrapper. I assume the significant properties of the documents are somewhere in that block of code. But we can’t actually view them using the Xena viewer.

We can however view the properties if we look at the end of the chain of digital preservation, and go to the DIP. For documents, The Open Office rendering of a Xena AIP has already been shown to have retained many of the significant properties of the original MS Word file. (See previous post). The properties are visible in Open Office, via File, Properties, Statistics:

The language property of the document is visible in Tools, Options, Language Settings (see image below). I’m less certain about the meaning or authenticity of this, and wonder if Open Office is simply restating the default language settings of my own installation of Open Office; whereas what we want is the embedded language property of the original document.

The other area Open Office fails to satisfy us is with the embedded fonts. However, these features are more normally associated with PDF files, which depend for their portability success on embedding fonts as part of their conversion process.

For our project, what we could do is take the Open Office normalisation and use the handy Open Office feature, a magic button which can turn any of its products into a PDF.

We could use the taskbar button to do this, but I’d rather use File, Export as PDF. (See screenshot below.) This gives me the option to select PDF/A-1a, an archival PDF format; and lossless compression for any images in the document. Both those options will give me more confidence in producing an object that is more authentic and can be preserved long-term.

This experiment yields a PDF with the following significant properties about the embedded fonts:

These are the Document Property fields which Adobe supports.

The above demonstrates how, in the chain of transformation from MS Word > Xena AIP > Open Office equivalent > Exported PDF, these significant properties may not be viewable all in one place, but they do survive the normalisation process.

Meeting up

It is always great to share ideas and experiences with others in your field, and so the meeting of the Digital infrastructure programme – of which our project is part – was enjoyable and very useful.

The first thing, after arriving at the impressive LSE Library building, was to introduce ourselves and provide some 5 min presentations on our projects. A full list is here: 
http://www.jisc.ac.uk/whatwedo/programmes/preservation/12-11projectlist.aspx

Through our project and my growing general interest, digital archiving techniques are helping me to shape some new approaches to more conventional electronic records management problems.

The outcome of Manchester’s Caracanet Case Study, which deals with the acquisition and preservation of email accounts, will be invaluable to records managers trying to define approaches to managing emails as records across their organisations.

The Institute of Education’s Digital Directorate project was interesting to me as archivists approaching essentially a records management challenge. The inter-disciplinary approach can surely only benefit both fields.

I am a keen advocate of training and awareness and several projects – DICE, PrePARE, SHARD – looked at ways of engaging the producers of research data with preservation issues. I’ll also be very interested in the outcomes of Bristol’s DataSafe project. There are so many methods now of delivering training; these projects should give some useful pointers to what works best and what our target audiences want.

The discussions that followed covered the value and benefits of digital preservation and how best to establish a business case for DP projects. I’m still convinced that it will be the hard numbers around reduced storage costs that will give a bedrock to the other DP benefits. The implied processes of appraisal and selection around what groups of records to keep will surely give us the opportunity to get rid of the huge amounts of unstructured information many organisations are storing.

There was a debate about the best approaches to ‘community engagement’ around digital preservation, including an account of a previous three day ‘hackathon’ as part of the AQuA project. Whilst I am a big fan of listservs and online forums, this event showed that getting a few people in a room to share their experience is still the best way to build and maintain a ‘community’.

Thanks to JISC and the SPRUCE project for organising and to the attendees for a really enjoyable and useful meeting.

Testing the DPSP

I did a day’s testing of the Digital Preservation Software Platform (DPSP) in December. The process has been enough to persuade me it isn’t quite right for what we intend with our project, but that’s not intended as a criticism of the system, which has many strengths.

The heart of DPSP is performing the normalisation of office documents to their nearest Open Office equivalent, using Xena. Around this, it builds a workflow that enables automated generation of checksums, quarantining of ingested records, virus checking, and deposit of converted records into an archival repository. It also generates a “manifest” file (not far apart from a transfer list), and logs all of the steps in its database.

The workflow is one that clearly suits The National Archives of Australia (NAA) and matches their practice, which involves thorough checking and QA to move objects in and out of quarantine, in and out of a normalisation stage, and finally into the repository. All of these steps require the generation of Unique IDs and folder destinations which probably match an accessioning or citation system at NAA; there’s no workaround for this, and I simply had to invent dummy IDs and folders for my purpose. The steps also oblige a user to log out of the database, and log back in so as to perform a different activity; this is required three times. This process is undoubtedly correct for a National Archives and I would expect nothing less. It just feels a bit too thorough for what we’re trying to demonstrate with the current project, our preservation policy is not yet this advanced, and there aren’t enough different staff members to cover all the functions. To put it another way, we’re trying to find a simple system that would enable one person (the records manager) to perform normalisations, and DPSP could be seen as “overkill”. I seem to recall PANDORA, the NLA’s web-archiving system, was similarly predicated on a series of workflows where tasks were divided up among several members of staff with extensive curation and QA.

My second concern is that multiple sets of AIPs are generated by DPSP. Presumably the intention is that the verified AIPs which end up in the repository will be the archive copies, and anything generated in the quarantine and transformation stages can be discarded eventually. However, this is not made explicit in the workflow, neither is the removal of such copies described in the manual.

My third problem is an area which I’ll have to revisit, because I must have missed something. When running the Xena transformations, DPSP creates two folders – one for “binary” and one for “normalised” objects. The distinction here is not clear to me yet. I’m also worried because out of 47 objects transformed, I ended up with only 31 objects in the “normalised” folder.

The gain with using DPSP is that we get checksums and virus checks built into the workflow, and complete audit trails also; so far with our method of “manual” transformations with XENA we have none of the above, although we do get checksums if we use DROID. But I found the DPSP workflow a little clunky and time-consuming, and somewhat counter-intuitive in navigating through the stages.