Monthly Archives: January 2012

Significant Properties in Office documents

Significant properties are properties of digital objects or files which are considered to be essential for their interpretation, rendering or meaningful access. One of the aims of digital preservation is to make sure we don’t lose or corrupt these properties through our preservation actions, especially migration. For this particular project, we want to be assured that a Xena normalisation has not affected the significant properties of our electronic records.

Much has been written on this complex subject. Significant properties are to do with preserving the “performance” and continued behaviour of an object; they aren’t the same thing as the technical metadata we’ve already assessed, nor the same as descriptive metadata. For more information, see The InSPECT project, which is where we obtained advice about significant property elements for various file formats and digital object types.

However, the guidance we used for significant properties of document types was Document Metadata: Document Technical Metadata for Digital Preservation, Florida Digital Archive and Harvard University Library, 2009.

To summarise it, Florida and Harvard suggest we should be interested in system counts of pages, words, paragraphs, lines and characters in the text; use of embedded tables and graphics; the language of the document; the use of embedded fonts; and any special features in the document.

When normalising MS Office documents to Open Office, the Xena process produces a block of code within the AIP wrapper. I assume the significant properties of the documents are somewhere in that block of code. But we can’t actually view them using the Xena viewer.

We can however view the properties if we look at the end of the chain of digital preservation, and go to the DIP. For documents, The Open Office rendering of a Xena AIP has already been shown to have retained many of the significant properties of the original MS Word file. (See previous post). The properties are visible in Open Office, via File, Properties, Statistics:

The language property of the document is visible in Tools, Options, Language Settings (see image below). I’m less certain about the meaning or authenticity of this, and wonder if Open Office is simply restating the default language settings of my own installation of Open Office; whereas what we want is the embedded language property of the original document.

The other area Open Office fails to satisfy us is with the embedded fonts. However, these features are more normally associated with PDF files, which depend for their portability success on embedding fonts as part of their conversion process.

For our project, what we could do is take the Open Office normalisation and use the handy Open Office feature, a magic button which can turn any of its products into a PDF.

We could use the taskbar button to do this, but I’d rather use File, Export as PDF. (See screenshot below.) This gives me the option to select PDF/A-1a, an archival PDF format; and lossless compression for any images in the document. Both those options will give me more confidence in producing an object that is more authentic and can be preserved long-term.

This experiment yields a PDF with the following significant properties about the embedded fonts:

These are the Document Property fields which Adobe supports.

The above demonstrates how, in the chain of transformation from MS Word > Xena AIP > Open Office equivalent > Exported PDF, these significant properties may not be viewable all in one place, but they do survive the normalisation process.

Meeting up

It is always great to share ideas and experiences with others in your field, and so the meeting of the Digital infrastructure programme – of which our project is part – was enjoyable and very useful.

The first thing, after arriving at the impressive LSE Library building, was to introduce ourselves and provide some 5 min presentations on our projects. A full list is here:

Through our project and my growing general interest, digital archiving techniques are helping me to shape some new approaches to more conventional electronic records management problems.

The outcome of Manchester’s Caracanet Case Study, which deals with the acquisition and preservation of email accounts, will be invaluable to records managers trying to define approaches to managing emails as records across their organisations.

The Institute of Education’s Digital Directorate project was interesting to me as archivists approaching essentially a records management challenge. The inter-disciplinary approach can surely only benefit both fields.

I am a keen advocate of training and awareness and several projects – DICE, PrePARE, SHARD – looked at ways of engaging the producers of research data with preservation issues. I’ll also be very interested in the outcomes of Bristol’s DataSafe project. There are so many methods now of delivering training; these projects should give some useful pointers to what works best and what our target audiences want.

The discussions that followed covered the value and benefits of digital preservation and how best to establish a business case for DP projects. I’m still convinced that it will be the hard numbers around reduced storage costs that will give a bedrock to the other DP benefits. The implied processes of appraisal and selection around what groups of records to keep will surely give us the opportunity to get rid of the huge amounts of unstructured information many organisations are storing.

There was a debate about the best approaches to ‘community engagement’ around digital preservation, including an account of a previous three day ‘hackathon’ as part of the AQuA project. Whilst I am a big fan of listservs and online forums, this event showed that getting a few people in a room to share their experience is still the best way to build and maintain a ‘community’.

Thanks to JISC and the SPRUCE project for organising and to the attendees for a really enjoyable and useful meeting.

Testing the DPSP

I did a day’s testing of the Digital Preservation Software Platform (DPSP) in December. The process has been enough to persuade me it isn’t quite right for what we intend with our project, but that’s not intended as a criticism of the system, which has many strengths.

The heart of DPSP is performing the normalisation of office documents to their nearest Open Office equivalent, using Xena. Around this, it builds a workflow that enables automated generation of checksums, quarantining of ingested records, virus checking, and deposit of converted records into an archival repository. It also generates a “manifest” file (not far apart from a transfer list), and logs all of the steps in its database.

The workflow is one that clearly suits The National Archives of Australia (NAA) and matches their practice, which involves thorough checking and QA to move objects in and out of quarantine, in and out of a normalisation stage, and finally into the repository. All of these steps require the generation of Unique IDs and folder destinations which probably match an accessioning or citation system at NAA; there’s no workaround for this, and I simply had to invent dummy IDs and folders for my purpose. The steps also oblige a user to log out of the database, and log back in so as to perform a different activity; this is required three times. This process is undoubtedly correct for a National Archives and I would expect nothing less. It just feels a bit too thorough for what we’re trying to demonstrate with the current project, our preservation policy is not yet this advanced, and there aren’t enough different staff members to cover all the functions. To put it another way, we’re trying to find a simple system that would enable one person (the records manager) to perform normalisations, and DPSP could be seen as “overkill”. I seem to recall PANDORA, the NLA’s web-archiving system, was similarly predicated on a series of workflows where tasks were divided up among several members of staff with extensive curation and QA.

My second concern is that multiple sets of AIPs are generated by DPSP. Presumably the intention is that the verified AIPs which end up in the repository will be the archive copies, and anything generated in the quarantine and transformation stages can be discarded eventually. However, this is not made explicit in the workflow, neither is the removal of such copies described in the manual.

My third problem is an area which I’ll have to revisit, because I must have missed something. When running the Xena transformations, DPSP creates two folders – one for “binary” and one for “normalised” objects. The distinction here is not clear to me yet. I’m also worried because out of 47 objects transformed, I ended up with only 31 objects in the “normalised” folder.

The gain with using DPSP is that we get checksums and virus checks built into the workflow, and complete audit trails also; so far with our method of “manual” transformations with XENA we have none of the above, although we do get checksums if we use DROID. But I found the DPSP workflow a little clunky and time-consuming, and somewhat counter-intuitive in navigating through the stages.