Significant Properties in Office documents

Significant properties are properties of digital objects or files which are considered to be essential for their interpretation, rendering or meaningful access. One of the aims of digital preservation is to make sure we don’t lose or corrupt these properties through our preservation actions, especially migration. For this particular project, we want to be assured that a Xena normalisation has not affected the significant properties of our electronic records.

Much has been written on this complex subject. Significant properties are to do with preserving the “performance” and continued behaviour of an object; they aren’t the same thing as the technical metadata we’ve already assessed, nor the same as descriptive metadata. For more information, see The InSPECT project, which is where we obtained advice about significant property elements for various file formats and digital object types.

However, the guidance we used for significant properties of document types was Document Metadata: Document Technical Metadata for Digital Preservation, Florida Digital Archive and Harvard University Library, 2009.

To summarise it, Florida and Harvard suggest we should be interested in system counts of pages, words, paragraphs, lines and characters in the text; use of embedded tables and graphics; the language of the document; the use of embedded fonts; and any special features in the document.

When normalising MS Office documents to Open Office, the Xena process produces a block of code within the AIP wrapper. I assume the significant properties of the documents are somewhere in that block of code. But we can’t actually view them using the Xena viewer.

We can however view the properties if we look at the end of the chain of digital preservation, and go to the DIP. For documents, The Open Office rendering of a Xena AIP has already been shown to have retained many of the significant properties of the original MS Word file. (See previous post). The properties are visible in Open Office, via File, Properties, Statistics:

The language property of the document is visible in Tools, Options, Language Settings (see image below). I’m less certain about the meaning or authenticity of this, and wonder if Open Office is simply restating the default language settings of my own installation of Open Office; whereas what we want is the embedded language property of the original document.

The other area Open Office fails to satisfy us is with the embedded fonts. However, these features are more normally associated with PDF files, which depend for their portability success on embedding fonts as part of their conversion process.

For our project, what we could do is take the Open Office normalisation and use the handy Open Office feature, a magic button which can turn any of its products into a PDF.

We could use the taskbar button to do this, but I’d rather use File, Export as PDF. (See screenshot below.) This gives me the option to select PDF/A-1a, an archival PDF format; and lossless compression for any images in the document. Both those options will give me more confidence in producing an object that is more authentic and can be preserved long-term.

This experiment yields a PDF with the following significant properties about the embedded fonts:

These are the Document Property fields which Adobe supports.

The above demonstrates how, in the chain of transformation from MS Word > Xena AIP > Open Office equivalent > Exported PDF, these significant properties may not be viewable all in one place, but they do survive the normalisation process.