Kit Good will describe a potential records management workflow using these open source tools. This post describes a potential preservation workflow. Since UoL does not yet have a dedicated digital preservation service, this workflow will have to remain hypothetical, but the gist of it is whether these tools can succeed in creating preservable objects with sufficient technical metadata to ensure long-term continuity.
The records to be preserved would arrive as Submission Information Packages (SIP). As a first step we would want to ingest selected records into the digital archive repository we haven’t got yet. In our hypothetical scenario, this would probably be done by the records manager (much the same way he transfers records of permanent value to the University archivist). If we had a digital archivist, all we would need is an agreed methodology between that person and the records manager, for example a web-based front end like Dropbox that allows us to move the files from one place to another, with some metadata and other documentation.
At a bare minimum, the archive service would want to run the following QA steps on the submitted records:
- Virus check
- Confirm submission formats can be supported in the repository
DROID can perform some of these steps for us with its built-in checksum, and the way it identifies formats by comparing them with its online registry. (This would need to align itself with a written preservation policy of some sort). At the moment we lack a virus checker, but it would not be unfeasible to use an open source virus checker such as Avast. This is one step where the DPSP works out nicely, with its built-in virus check and quarantine stage.
Next, we might want to validate the formats in more detail, since DROID doesn’t look at the signature of the file formats. This is where JHOVE comes in, although the limitations of that tool with regard to Office documents have already been noted.
You’ll notice by this point we are not advocating use of the NZ Metadata Extraction Tool, since its output has been found a bit lacking, but this would be the stage to deploy it.
Next step is to normalise the records using Xena. This action is as close as we get to migration in our repository and the actions of Xena have already been described.
These steps create a little “bundle” of objects:
- The DROID report in CSV format
- The JHOVE output in XML format
- The normalised Xena object in .xena format
For our purposes, this bundle represents the archival copy which in OAIS terms is an Archival Information Package (AIP). The DROID and JHOVE files are our technical metadata, while the actual data we want to keep is in the Xena file.
Move to archive
Step four would be to move this AIP bundle into archival storage, keeping the “original” submissions elsewhere in the managed store.
A fifth step would be to re-render the Xena objects as dissemination copies, as needed when requested by a member of staff who wanted to have access to a preserved record. For this, we would open the Xena object using the Xena viewer and create a Dissemination Information Package (DIP) version by clicking the OpenOffice button to produce a readable document. (As noted previously, we can also make this OO version into a PDF).
This access step creates yet more digital objects. In fact if we look at the test results for the 12 spreadsheets in our test corpus, we have created the following items in our bundle:
- One DROID file which analysed all 12 spreadsheets – in CSV
- One JHOVE file which analysed all 12 spreadsheets – in XML
- 12 normalised Xena objects – in XENA
- A test Open Office rendition of one the spreadsheets in ODS
- A PDF rendering of that OO version
In a separate folder, there are 12 MET files in XML – one for each spreadsheet
In real life we certainly wouldn’t want to keep all of these objects in the same place; the archival files must be stored separately from the dissemination versions.
Many of the above stages present opportunities for a checksum script to be run, to ensure that the objects have not been corrupted in the processes of transformation and moving them in and out of the archival store. If we wanted to go further with checking, we would re-validate the files using JHOVE.
Sounds simple, doesn’t it? But there are quite a few preservation gaps with this bare-bones toolkit.
The technical metadata outputs from DROID and JHOVE are all “loose”, not tightly fused with the object they actually relate to. To put it another way, the processes described above have involved running several separate actions on the object that is the target for preservation. It has also created several separate outputs, which have landed in several different destinations on my PC.
We could manually move everything into a “bundle” as suggested above, but this is extra work and feels a bit insecure and unreliable. For real success, we need a method (a database, perhaps) that manages all the loose bits and pieces under a single UID, or other reliable retrieval method. Xena does create such a UID for each of its objects – it’s visible there in the dc:identifier tag. So we may stand a chance of using that element in future iterations of this work.
Another thing which may seem trivial, but JHOVE, DROID and Xena can do batch processing, and MET cannot. This results in a mismatch of outputs, and the outputs are created in different formats.
There is also some duplication among the detail of the technical metadata that has been extracted from the objects.
Ideally we’d like the technical metadata to be embedded within a single wrapper, along with the object itself. The Xena wrapper seems the most obvious place for this. I lack the technical ability to understand how to do it, though. What I would like is some form of XML authoring tool that enables me to write the DROID and JHOVE output directly into the Xena wrapper.
All the steps described above are manual. Obviously if we were going to do this on a larger scale we would want to automate the actions. This is not surprising and we knew this would be the case before we embarked on the project, but it’s good to see the extent to which our workflow remains non joined-up.
Likewise, our audit trail for preservation actions is a bit distributed. Both DROID and JHOVE give us a ‘Last Modified’ date for each object processed, and Xena embeds naa:last-modified within the XML output for each object, but ideally these dates ought to be retrievable by the preservation metadata database and presented as a kind of linear chain of events. We’d also like to have a field identifying what the process was that triggered the ‘Last Modified’ date.
How we manage the descriptive metadata in this process? In order to deliver a DIP to our consumer, they would have to know the record exists and want to know something useful about it to enable them to retrieve it. We have confirmed that descriptive metadata survives the Xena process, but how can we expose it for our users?
What we’re talking about here is a searchable front end to the archival store which we haven’t yet built. Kit is proposing a structured file store for his records, so maybe we need to expand on this approach and think about ways of delivering a souped-up archival catalogue for these assets.