Tag Archives: digital preservation

Significant Properties in Office documents

Significant properties are properties of digital objects or files which are considered to be essential for their interpretation, rendering or meaningful access. One of the aims of digital preservation is to make sure we don’t lose or corrupt these properties through our preservation actions, especially migration. For this particular project, we want to be assured that a Xena normalisation has not affected the significant properties of our electronic records.

Much has been written on this complex subject. Significant properties are to do with preserving the “performance” and continued behaviour of an object; they aren’t the same thing as the technical metadata we’ve already assessed, nor the same as descriptive metadata. For more information, see The InSPECT project, which is where we obtained advice about significant property elements for various file formats and digital object types.

However, the guidance we used for significant properties of document types was Document Metadata: Document Technical Metadata for Digital Preservation, Florida Digital Archive and Harvard University Library, 2009.

To summarise it, Florida and Harvard suggest we should be interested in system counts of pages, words, paragraphs, lines and characters in the text; use of embedded tables and graphics; the language of the document; the use of embedded fonts; and any special features in the document.

When normalising MS Office documents to Open Office, the Xena process produces a block of code within the AIP wrapper. I assume the significant properties of the documents are somewhere in that block of code. But we can’t actually view them using the Xena viewer.

We can however view the properties if we look at the end of the chain of digital preservation, and go to the DIP. For documents, The Open Office rendering of a Xena AIP has already been shown to have retained many of the significant properties of the original MS Word file. (See previous post). The properties are visible in Open Office, via File, Properties, Statistics:

The language property of the document is visible in Tools, Options, Language Settings (see image below). I’m less certain about the meaning or authenticity of this, and wonder if Open Office is simply restating the default language settings of my own installation of Open Office; whereas what we want is the embedded language property of the original document.

The other area Open Office fails to satisfy us is with the embedded fonts. However, these features are more normally associated with PDF files, which depend for their portability success on embedding fonts as part of their conversion process.

For our project, what we could do is take the Open Office normalisation and use the handy Open Office feature, a magic button which can turn any of its products into a PDF.

We could use the taskbar button to do this, but I’d rather use File, Export as PDF. (See screenshot below.) This gives me the option to select PDF/A-1a, an archival PDF format; and lossless compression for any images in the document. Both those options will give me more confidence in producing an object that is more authentic and can be preserved long-term.

This experiment yields a PDF with the following significant properties about the embedded fonts:

These are the Document Property fields which Adobe supports.

The above demonstrates how, in the chain of transformation from MS Word > Xena AIP > Open Office equivalent > Exported PDF, these significant properties may not be viewable all in one place, but they do survive the normalisation process.

Meeting up

It is always great to share ideas and experiences with others in your field, and so the meeting of the Digital infrastructure programme – of which our project is part – was enjoyable and very useful.

The first thing, after arriving at the impressive LSE Library building, was to introduce ourselves and provide some 5 min presentations on our projects. A full list is here: 
http://www.jisc.ac.uk/whatwedo/programmes/preservation/12-11projectlist.aspx

Through our project and my growing general interest, digital archiving techniques are helping me to shape some new approaches to more conventional electronic records management problems.

The outcome of Manchester’s Caracanet Case Study, which deals with the acquisition and preservation of email accounts, will be invaluable to records managers trying to define approaches to managing emails as records across their organisations.

The Institute of Education’s Digital Directorate project was interesting to me as archivists approaching essentially a records management challenge. The inter-disciplinary approach can surely only benefit both fields.

I am a keen advocate of training and awareness and several projects – DICE, PrePARE, SHARD – looked at ways of engaging the producers of research data with preservation issues. I’ll also be very interested in the outcomes of Bristol’s DataSafe project. There are so many methods now of delivering training; these projects should give some useful pointers to what works best and what our target audiences want.

The discussions that followed covered the value and benefits of digital preservation and how best to establish a business case for DP projects. I’m still convinced that it will be the hard numbers around reduced storage costs that will give a bedrock to the other DP benefits. The implied processes of appraisal and selection around what groups of records to keep will surely give us the opportunity to get rid of the huge amounts of unstructured information many organisations are storing.

There was a debate about the best approaches to ‘community engagement’ around digital preservation, including an account of a previous three day ‘hackathon’ as part of the AQuA project. Whilst I am a big fan of listservs and online forums, this event showed that getting a few people in a room to share their experience is still the best way to build and maintain a ‘community’.

Thanks to JISC and the SPRUCE project for organising and to the attendees for a really enjoyable and useful meeting.

Adapting OAIS for records management

In planning this project we’re obviously referencing the OAIS standard for a digital archive.

To quote from Wikipedia, ‘Open Archival Information System (or OAIS) is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community’.

The OAIS approach reminds me of the old style paper records store method for managing records. We (the records manager) receive the closed files and store them safely and securely. When a record is required we are asked by the department for a specific box or file. We then make it available.

This is being replicated in several ways in current records management approaches. At one conference I heard of one organisation moving its paper files to offsite storage, then providing a ‘scan on demand’ approach when a file was requested. The offsite store would provide a scanned copy and email it to the requesting department. I thought this was a nice way of sidestepping a massive digitisation project – a digital ‘copy’ is provided when required, the mass of original paper retained in a low-cost storage environment.

In-house electronic document management solutions, from a simple intranet to a full blown EDRMS, have often relied on the information being available instantly via a search.  That obviously relies on the quality of the stored records and speed / accuracy of the search engine.

Could the OAIS approach work in an internal business context? Can the electronic records be submitted or ‘declared’ to the records management team and stored. The records with long term retention converted into preservable formats. On demand, the records can be made available through a print to PDF facility. The audit trail of access is inherent in the very record produced.

There are several visual depictions of the standard on produced on page 4-1 on http://public.ccsds.org/publications/archive/650x0b1.pdf, to which I add my ‘scrawled on the back of a napkin’ version to explain how this approach might work:

The benefit, you’d hope, would be ‘future proofed’ electronic documents managed and made available at a reasonable cost over decades. The drawbacks would, perhaps be in the user experience where people are used to the instant search return of a Google query. Now the user would effectively be searching a catalogue of record references and requesting access to the one they want.

The challenge is: would the approach actually work? That is one of the things we are trying to explore with this project.

What and why…

Most of the information now created in the University is ‘born digital’. Staff create documents in file formats that have much shorter life spans than paper. We need to mitigate the risk that these files become unreadable.

Buying an EDRMS is a huge cost for an organisation in terms of licences, implementation and training. Is there a way that electronic records can be effectively managed and preserved without this cost?

We thought: “Can we use our existing infrastructure, with some open source tools, to build a practical, cost-effective solution to the long-term management of our key electronic records?”

This project will build a test environment for converting files into preservable formats. Will they be fit-for-purpose as university records?

The aim of the project is to identify the opportunities and challenges to this approach to electronic records management. Is it a viable alternative? Is it practical for records managers?

Kicking off…

My name is Kit Good, University Records Manager and Freedom of Information Officer at the University of London. I’m lucky enough to work in an institution where the technology department, the University of London Computer Centre (ULCC), has a dedicated Digital Archiving team with a lot of experience in delivering preservation projects around electronic records.

 I’d met with Ed Pinsent, Digital Archivist, several times since I started last year to discuss the challenges around electronic records management. It was back at the start of September when Ed approached me about submitting a bid to the JISC 12/11 Digital Infrastructure Programme.

Ed and Kit

Our bid proposed testing the concept of ‘a simple toolkit of services and software that can plug into a network drive and create preservation copies of core business documents that require permanent preservation’. We were delighted to find out earlier this month that our bid was successful. 

We are hoping that the outcomes of this project will be useful to records managers, the digital archivist community and the Higher Education Sector as a whole. This blog will track the development of the project from its November kick-off to its February close.

More posts to follow…