Here’s a copy of our final report, published today. The PDF is 1.63MB, 55 pages long.
Kit Good will describe a potential records management workflow using these open source tools. This post describes a potential preservation workflow. Since UoL does not yet have a dedicated digital preservation service, this workflow will have to remain hypothetical, but the gist of it is whether these tools can succeed in creating preservable objects with sufficient technical metadata to ensure long-term continuity.
The records to be preserved would arrive as Submission Information Packages (SIP). As a first step we would want to ingest selected records into the digital archive repository we haven’t got yet. In our hypothetical scenario, this would probably be done by the records manager (much the same way he transfers records of permanent value to the University archivist). If we had a digital archivist, all we would need is an agreed methodology between that person and the records manager, for example a web-based front end like Dropbox that allows us to move the files from one place to another, with some metadata and other documentation.
At a bare minimum, the archive service would want to run the following QA steps on the submitted records:
- Virus check
- Confirm submission formats can be supported in the repository
DROID can perform some of these steps for us with its built-in checksum, and the way it identifies formats by comparing them with its online registry. (This would need to align itself with a written preservation policy of some sort). At the moment we lack a virus checker, but it would not be unfeasible to use an open source virus checker such as Avast. This is one step where the DPSP works out nicely, with its built-in virus check and quarantine stage.
Next, we might want to validate the formats in more detail, since DROID doesn’t look at the signature of the file formats. This is where JHOVE comes in, although the limitations of that tool with regard to Office documents have already been noted.
You’ll notice by this point we are not advocating use of the NZ Metadata Extraction Tool, since its output has been found a bit lacking, but this would be the stage to deploy it.
Next step is to normalise the records using Xena. This action is as close as we get to migration in our repository and the actions of Xena have already been described.
These steps create a little “bundle” of objects:
- The DROID report in CSV format
- The JHOVE output in XML format
- The normalised Xena object in .xena format
For our purposes, this bundle represents the archival copy which in OAIS terms is an Archival Information Package (AIP). The DROID and JHOVE files are our technical metadata, while the actual data we want to keep is in the Xena file.
Move to archive
Step four would be to move this AIP bundle into archival storage, keeping the “original” submissions elsewhere in the managed store.
A fifth step would be to re-render the Xena objects as dissemination copies, as needed when requested by a member of staff who wanted to have access to a preserved record. For this, we would open the Xena object using the Xena viewer and create a Dissemination Information Package (DIP) version by clicking the OpenOffice button to produce a readable document. (As noted previously, we can also make this OO version into a PDF).
This access step creates yet more digital objects. In fact if we look at the test results for the 12 spreadsheets in our test corpus, we have created the following items in our bundle:
- One DROID file which analysed all 12 spreadsheets – in CSV
- One JHOVE file which analysed all 12 spreadsheets – in XML
- 12 normalised Xena objects – in XENA
- A test Open Office rendition of one the spreadsheets in ODS
- A PDF rendering of that OO version
In a separate folder, there are 12 MET files in XML – one for each spreadsheet
In real life we certainly wouldn’t want to keep all of these objects in the same place; the archival files must be stored separately from the dissemination versions.
Many of the above stages present opportunities for a checksum script to be run, to ensure that the objects have not been corrupted in the processes of transformation and moving them in and out of the archival store. If we wanted to go further with checking, we would re-validate the files using JHOVE.
Sounds simple, doesn’t it? But there are quite a few preservation gaps with this bare-bones toolkit.
The technical metadata outputs from DROID and JHOVE are all “loose”, not tightly fused with the object they actually relate to. To put it another way, the processes described above have involved running several separate actions on the object that is the target for preservation. It has also created several separate outputs, which have landed in several different destinations on my PC.
We could manually move everything into a “bundle” as suggested above, but this is extra work and feels a bit insecure and unreliable. For real success, we need a method (a database, perhaps) that manages all the loose bits and pieces under a single UID, or other reliable retrieval method. Xena does create such a UID for each of its objects – it’s visible there in the dc:identifier tag. So we may stand a chance of using that element in future iterations of this work.
Another thing which may seem trivial, but JHOVE, DROID and Xena can do batch processing, and MET cannot. This results in a mismatch of outputs, and the outputs are created in different formats.
There is also some duplication among the detail of the technical metadata that has been extracted from the objects.
Ideally we’d like the technical metadata to be embedded within a single wrapper, along with the object itself. The Xena wrapper seems the most obvious place for this. I lack the technical ability to understand how to do it, though. What I would like is some form of XML authoring tool that enables me to write the DROID and JHOVE output directly into the Xena wrapper.
All the steps described above are manual. Obviously if we were going to do this on a larger scale we would want to automate the actions. This is not surprising and we knew this would be the case before we embarked on the project, but it’s good to see the extent to which our workflow remains non joined-up.
Likewise, our audit trail for preservation actions is a bit distributed. Both DROID and JHOVE give us a ‘Last Modified’ date for each object processed, and Xena embeds naa:last-modified within the XML output for each object, but ideally these dates ought to be retrievable by the preservation metadata database and presented as a kind of linear chain of events. We’d also like to have a field identifying what the process was that triggered the ‘Last Modified’ date.
How we manage the descriptive metadata in this process? In order to deliver a DIP to our consumer, they would have to know the record exists and want to know something useful about it to enable them to retrieve it. We have confirmed that descriptive metadata survives the Xena process, but how can we expose it for our users?
What we’re talking about here is a searchable front end to the archival store which we haven’t yet built. Kit is proposing a structured file store for his records, so maybe we need to expand on this approach and think about ways of delivering a souped-up archival catalogue for these assets.
Significant properties are properties of digital objects or files which are considered to be essential for their interpretation, rendering or meaningful access. One of the aims of digital preservation is to make sure we don’t lose or corrupt these properties through our preservation actions, especially migration. For this particular project, we want to be assured that a Xena normalisation has not affected the significant properties of our electronic records.
Much has been written on this complex subject. Significant properties are to do with preserving the “performance” and continued behaviour of an object; they aren’t the same thing as the technical metadata we’ve already assessed, nor the same as descriptive metadata. For more information, see The InSPECT project, which is where we obtained advice about significant property elements for various file formats and digital object types.
However, the guidance we used for significant properties of document types was Document Metadata: Document Technical Metadata for Digital Preservation, Florida Digital Archive and Harvard University Library, 2009.
To summarise it, Florida and Harvard suggest we should be interested in system counts of pages, words, paragraphs, lines and characters in the text; use of embedded tables and graphics; the language of the document; the use of embedded fonts; and any special features in the document.
When normalising MS Office documents to Open Office, the Xena process produces a block of code within the AIP wrapper. I assume the significant properties of the documents are somewhere in that block of code. But we can’t actually view them using the Xena viewer.
We can however view the properties if we look at the end of the chain of digital preservation, and go to the DIP. For documents, The Open Office rendering of a Xena AIP has already been shown to have retained many of the significant properties of the original MS Word file. (See previous post). The properties are visible in Open Office, via File, Properties, Statistics:
The language property of the document is visible in Tools, Options, Language Settings (see image below). I’m less certain about the meaning or authenticity of this, and wonder if Open Office is simply restating the default language settings of my own installation of Open Office; whereas what we want is the embedded language property of the original document.
The other area Open Office fails to satisfy us is with the embedded fonts. However, these features are more normally associated with PDF files, which depend for their portability success on embedding fonts as part of their conversion process.
For our project, what we could do is take the Open Office normalisation and use the handy Open Office feature, a magic button which can turn any of its products into a PDF.
We could use the taskbar button to do this, but I’d rather use File, Export as PDF. (See screenshot below.) This gives me the option to select PDF/A-1a, an archival PDF format; and lossless compression for any images in the document. Both those options will give me more confidence in producing an object that is more authentic and can be preserved long-term.
This experiment yields a PDF with the following significant properties about the embedded fonts:
These are the Document Property fields which Adobe supports.
The above demonstrates how, in the chain of transformation from MS Word > Xena AIP > Open Office equivalent > Exported PDF, these significant properties may not be viewable all in one place, but they do survive the normalisation process.
In OAIS terms, the Xena digital object is our Archival Information Package (AIP). This is the object that would be stored and preserved in UoL’s archival storage system (if we had one).
For the purposes of this project, the records manager needs to be assured we can render and deliver a readable, authentic version of the record from the Xena object. In OAIS terms, this could be seen as a Dissemination Information Package (DIP) derived from an AIP. Among other things, the DIP package ought to include sufficient descriptive metadata.
We’ve defined a minimal set of what we would need for records management purposes (names, dates for authenticity) and for information retrieval, and criteria for assessing overall success of the transformations – such as legibility, presentation, look and feel, and basic functionality. See our previous post for the documentation of how we arrived at our criteria.
For most of the examples below we are looking at standard Xena normalisations, which can produce a DIP by the following methods:
- Export the AIP to nearest Open Office equivalent
- Export the AIP to native format
- Export the AIP to a format of our choice (an option we haven’t tried as yet)
Emails are slightly more complex – see below for more detail, and previous post on emails.
Documents, spreadsheets and powerpoints
For these MS Office documents, Xena normalises by converting them to their nearest Open Office equivalent. We can view this Open Office version via the Xena Viewer simply by clicking on Show in OpenOffice.org.
In the case of a sample docx, this shows us the document with its original formatting intact. In Open Office, we can now look at File, Properties, and confirm the following descriptive metadata are intact (some of these we will revisit when we look at significant properties):
- Company Name
- Number of pages
- Number of tables
- Number of graphics
- Number of OLE objects
- Number of paragraphs
- Number of words
- Number of characters
- Number of lines
- Date of creation / created by who
- Date of modification / modified by who
One missing property is a discrete ‘Author’ field in the OO version.
Xena Viewer has its own Adobe-like viewer for PDFs. When we look at the Document Properties for a normalised PDF, we get the following descriptive metadata:
- File size
- Page count
- Creator (i.e. software that created the file)
- Producer (i.e. software that created the file)
- Date of creation
- Date of modification
We may not need much descriptive metadata for individual images. We looked at the converted images using a free EXIF viewer tool and the date values (creation and modification) are intact in the normalised object.
For emails, the AIP will either be a standard Xena normalised object or a binary Xena object. Broadly, the former can export to an XML version and the latter can export back to its native format (.msg file). We’re always guaranteed to get a minimum of descriptive metadata whichever transformation we enact. This would be:
Results – progress so far
Below is a PDF of my table of analysis of descriptive metadata (and significant properties).
We also have Kit’s assessment of the transformations, including his comments on whether a reasonable amount of metadata is intact.
For this post, I just want to look at how emails are rendered as Xena transformations. I should stress that Xena doesn’t expect you to normalise individual .msg files, as we have done, but instead convert entire Outlook PST files using the readpst plugin. Nevertheless we still got some useable results with the 23 message files in our test corpus, by running the standard normalisation action.
In our test corpus, the emails generally worked well. Almost all of them had an image file attached to the message – a way of the writer identifying their parent organisation. Two test emails had attachments (one a PDF, one a spreadsheet). Four of them contained a copy of an e-Newsletter, content which for some reason we can’t open in the Xena viewer. I’ll want to revisit that, and also try out some other emails with different attachment types.
When we open a normalised email in the Xena Viewer, we can see the following as shown in the screenshot below:
- The text of the email message in the large Package NAA window.
- Above this, a smaller window with metadata from the message header. (In Xena, this is unique to email as far as we can see, and it’s very useful. Why can’t it do this for properties of other documents?)
- Below, a window displaying the technical metadata about the rendition.
- Below the large window on the left, a small window which is like a mini-viewer for all attachments.
- To the right of this window is technical metadata about each attachment.
- Right at the bottom, a window showing the technical metadata about the entire package. Even from this view, it’s clear that Xena is managing an email as separate objects, each with their own small wrapper, within a larger wrapper.
The immediate problem we found here was truncation of the message in the large window. This is being caused by a character rendering issue, and the content isn’t actually lost, as we’ll see when we look at another view.
In the Raw XML view (above image), we can see the XML namespaces being used for preservation of metadata from the email header (http://preservation.naa.gov.au/email/1.0) and a plaintext namespace (http://preservation.naa.gov.au/plaintext/1.0) for the email text.
If we copy the email text from here into a Notepad file, we can see the problem causing the truncation. Something in the system doesn’t like inverted commas. The tag plaintext:bad_char xml:space=”preserve” tells us something is wrong.
In the XML Tree View (image below) we can also see the email being managed as a Header and Parts.
Exporting from Xena
I tried clicking the Export button for an email object. This does the following:
- Puts the message into XML. It can now open in a browser and appears reasonably well formatted.
- Creates an XSL stylesheet – probably this is what formats the message correctly
- Detaches the attachment and stores it in an OO equivalent, or its native format (might be worth investigating this for more examples and seeing what comes out.)
So for a single email with a small image footer and a PDF attachment, I get the following four objects:
- Message rendered in XML
- XSL stylesheet
- One image file
- One PDF file
When opened in a browser, the XML message file also contains hyperlinks to its own attachments, which has some potential use for records management purposes (provided these links can be managed and maintained properly).
So far so good. Unfortunately this XML rendition still has the bad_char problem! It would be possible to open this XML file in UltraEdit and do a find and replace action, but editing a file like this undermines the authenticity of the record, and completely goes against the grain for records managers and archivists.
The second option we have for .msg files is to normalise them using Xena’s Binary Normalisation method instead. This creates something that can’t be read at all in the Xena Viewer, but when exported, this binary object converts back into an exact original copy of the email in .msg format, with no truncation or other rendering problems. Attachments are intact too. It creates the same technical metadata, as expected.
We’ve also tried the PST method, which works using the readpst plugin that comes with Xena. This is pretty nifty and speedy too. The normalisation process produces a Xena digital object for every single email in the PST, and a PST directory also. When exported, we get the XML rendition and stylesheet also. So far, a few PROs and CONs with the PST method:
- PRO1 – no truncation of the message.
- PRO2 – lots of useable metadata.
- CON1 – file names of the XML renditions are not meaningful. Each email becomes a numbered item in an mbox folder.
- CON2 – attachments simply don’t work in this process.
- CON3 – PST files are not really managed records. In fact Kit Good’s initial reaction to this was “If I could ban .pst files I would as I think they are a bit of a nightmare for records management (and data management generally). I could see the value of this in the case of archiving an email account of a senior employee (such as the VC) or eminent academic member of staff but I think it would be impractical for staff, especially as .pst files are often used as a ‘dump’ when inboxes get full.”
In this post, I’m just going to look at the normalised Xena object, which is an Archival Information Package (AIP). It’s encoded in a file format whose extension is .xena and which is identified by Windows as a ‘Xena preserved digital object’ type.
We’ve created these for normalisations of each of our principal records types – documents, spreadsheets, PowerPoint, PDFs, images and emails.
When we click on one of these it launches the Xena Viewer application. This gives an initial view of the AIP in the default NAA Package View. This is very useful as it gives a quick visual on how well the conversion / normalisation has worked. (It seems particularly strong on rendering documents and common image formats.)
The AIP is actually quite a complex object though. It is in fact an XML wrapper which contains the transformed object, and the metadata about it. This conforms with the preservation model as proposed by the Library of Congress Metadata Encoding and Transmission Standard, which suggests that building an XML profile like this is the best way to make a preserved object self-describing over time.
The main components of a Xena AIP are (A) technical metadata about the wrapper and the conversion process, some of which is expressed as Dublin Core-compliant metadata, and some of which conforms to an XML namespace defined by the NAA; (B) a string of code that is the transformed file; and (C) a checksum.
The Xena Viewer offers three ways of seeing the package:
1) The default NAA Package view. For documents, this shows the representation of the transformed object in a large window. Underneath that window is a smaller scrolling window which displays the transformation metadata. Underneath this is an even smaller window displaying the package signature checksum.
2) The Raw XML view. In this view we can see the entire package as a stream of XML code. This makes clear the use of XML namespaces and Dublin Core elements for the metadata.
3) The XML Tree View. This makes clear the way the AIP is structured as a package of content and metadata.
The default view changes slightly depending on object type:
- For documents, the viewer window shows the OO representation in a manner that keeps some of the basic formatting intact. To be precise, it’s a MIME-compliant representation of the OO document.
- For images, the viewer shows a Base 64 rendering of an image file.
- For PDF documents, the viewer integrates a set of Adobe Reader-like functions – search, zoom, properties, page navigation etc. This seems to be done by a JPedal Library GUI.
- For Spreadsheets and Powerpoint files, the viewer doesn’t show anything in the big window (but we’ll get to this when we follow the “Show in OpenOffice” option, which will be the subject of another post).
- For emails, the Xena Viewer has a unique view that not only displays the message content, but also the email header, and any attachments, both in separate windows. We’ll look at this again in a future post about how Xena works with emails.
As an Archival Information Package in OAIS terms, a Xena object is clearly compliant and suitable for preservation purposes. The only problem for our project is getting more of our metadata into the AIP. To put it another way, integrating more metadata that records our digital preservation actions. Ideally, we would like to be able to perform a string of actions (DROID identification of an object, virus checking, checksum etc) and integrate the accumulated metadata into a single AIP. However, this is not in scope of the project.
I did a day’s testing of the Digital Preservation Software Platform (DPSP) in December. The process has been enough to persuade me it isn’t quite right for what we intend with our project, but that’s not intended as a criticism of the system, which has many strengths.
The heart of DPSP is performing the normalisation of office documents to their nearest Open Office equivalent, using Xena. Around this, it builds a workflow that enables automated generation of checksums, quarantining of ingested records, virus checking, and deposit of converted records into an archival repository. It also generates a “manifest” file (not far apart from a transfer list), and logs all of the steps in its database.
The workflow is one that clearly suits The National Archives of Australia (NAA) and matches their practice, which involves thorough checking and QA to move objects in and out of quarantine, in and out of a normalisation stage, and finally into the repository. All of these steps require the generation of Unique IDs and folder destinations which probably match an accessioning or citation system at NAA; there’s no workaround for this, and I simply had to invent dummy IDs and folders for my purpose. The steps also oblige a user to log out of the database, and log back in so as to perform a different activity; this is required three times. This process is undoubtedly correct for a National Archives and I would expect nothing less. It just feels a bit too thorough for what we’re trying to demonstrate with the current project, our preservation policy is not yet this advanced, and there aren’t enough different staff members to cover all the functions. To put it another way, we’re trying to find a simple system that would enable one person (the records manager) to perform normalisations, and DPSP could be seen as “overkill”. I seem to recall PANDORA, the NLA’s web-archiving system, was similarly predicated on a series of workflows where tasks were divided up among several members of staff with extensive curation and QA.
My second concern is that multiple sets of AIPs are generated by DPSP. Presumably the intention is that the verified AIPs which end up in the repository will be the archive copies, and anything generated in the quarantine and transformation stages can be discarded eventually. However, this is not made explicit in the workflow, neither is the removal of such copies described in the manual.
My third problem is an area which I’ll have to revisit, because I must have missed something. When running the Xena transformations, DPSP creates two folders – one for “binary” and one for “normalised” objects. The distinction here is not clear to me yet. I’m also worried because out of 47 objects transformed, I ended up with only 31 objects in the “normalised” folder.
The gain with using DPSP is that we get checksums and virus checks built into the workflow, and complete audit trails also; so far with our method of “manual” transformations with XENA we have none of the above, although we do get checksums if we use DROID. But I found the DPSP workflow a little clunky and time-consuming, and somewhat counter-intuitive in navigating through the stages.
We extracted technical metadata from our test corpus of common office documents. Within that corpus, there was a small range of versions of MS Office, PDFs, images and email (12 distinct file formats in all). The attached PDF is a comparative table detailing what was extracted by DROID, JHOVE and the Metadata Extraction Tool.
A report on findings will go into our final report. Early indications are that DROID provides the most helpful and consistent technical metadata. The Metadata Extraction Tool has been rather disappointing, particularly in the way that it incorrectly identifies Microsoft Office Open files as Open Office formats, and sometimes struggles with versions. As is already known, JHOVE works for a limited range of file formats, and seems to default to reporting “application/octet-stream” for the mimetype of a format it doesn’t recognise.
Both Kit Good and myself are now equipped with the XENA software and various other tools (DROID, JHOVE, Metadata Extractor Tool) on our desktops. We’ve been making progress with the first round of evaluation of these tools. Kit kindly provided a corpus of test documents we could work on – including documents, spreadsheets, images, emails, and presentations. (Hint: they were of course copies and not original electronic documents).
To evaluate the tools in a methodical way, I drew up these draft evaluation sheets for each digital object type.
For the technical metadata elements, I reused the AHDS minimum PREMIS requirements as a starting point. My thinking was that I’m really only interested in object characteristics, so as yet no Events, Agents, or Rights metadata (all strong PREMIS features) because we’re not actually building a digital repository for this project.
Lastly our table on assessing successful conversions is inspired by Clausen’s fine work on Handling File Formats.
Our hope is that this amounts to a fair set of criteria that conforms to existing literature, and can work for assessing the value of tools, and the converted outputs, which can be easily marked and scored.