Here’s a copy of our final report, published today. The PDF is 1.63MB, 55 pages long.
How it could work…
The following workflows show how the open source tools and preservation techniques that the project has tested and explored could fit into a wider process for the effective management of electronic records.
The approach relies on:
- adapting the available functionality of a well-managed file servers system
- training and relationship building with relevant staff by the records manager
- the identification of sets or classes of electronic records held on shared drives that need to be captured and managed in a structured way
- a commitment to transform use of the ‘shared drive’ from a long-term, unstructured collection of documents of varying relevance and quality to a temporary, collaborative area with strict controls on size and data retention
Click on the image above to see it full size.
I’ve included the full document here, which has several other ‘scenarios’ for managing electronic records using these tools: Records management workflows
A preservation workflow using our toolkit
Kit Good will describe a potential records management workflow using these open source tools. This post describes a potential preservation workflow. Since UoL does not yet have a dedicated digital preservation service, this workflow will have to remain hypothetical, but the gist of it is whether these tools can succeed in creating preservable objects with sufficient technical metadata to ensure long-term continuity.
Ingest
The records to be preserved would arrive as Submission Information Packages (SIP). As a first step we would want to ingest selected records into the digital archive repository we haven’t got yet. In our hypothetical scenario, this would probably be done by the records manager (much the same way he transfers records of permanent value to the University archivist). If we had a digital archivist, all we would need is an agreed methodology between that person and the records manager, for example a web-based front end like Dropbox that allows us to move the files from one place to another, with some metadata and other documentation.
Validation
At a bare minimum, the archive service would want to run the following QA steps on the submitted records:
- Fixity
- Virus check
- Confirm submission formats can be supported in the repository
DROID can perform some of these steps for us with its built-in checksum, and the way it identifies formats by comparing them with its online registry. (This would need to align itself with a written preservation policy of some sort). At the moment we lack a virus checker, but it would not be unfeasible to use an open source virus checker such as Avast. This is one step where the DPSP works out nicely, with its built-in virus check and quarantine stage.
Next, we might want to validate the formats in more detail, since DROID doesn’t look at the signature of the file formats. This is where JHOVE comes in, although the limitations of that tool with regard to Office documents have already been noted.
You’ll notice by this point we are not advocating use of the NZ Metadata Extraction Tool, since its output has been found a bit lacking, but this would be the stage to deploy it.
Transformation
Next step is to normalise the records using Xena. This action is as close as we get to migration in our repository and the actions of Xena have already been described.
These steps create a little “bundle” of objects:
- The DROID report in CSV format
- The JHOVE output in XML format
- The normalised Xena object in .xena format
For our purposes, this bundle represents the archival copy which in OAIS terms is an Archival Information Package (AIP). The DROID and JHOVE files are our technical metadata, while the actual data we want to keep is in the Xena file.
Move to archive
Step four would be to move this AIP bundle into archival storage, keeping the “original” submissions elsewhere in the managed store.
Access
A fifth step would be to re-render the Xena objects as dissemination copies, as needed when requested by a member of staff who wanted to have access to a preserved record. For this, we would open the Xena object using the Xena viewer and create a Dissemination Information Package (DIP) version by clicking the OpenOffice button to produce a readable document. (As noted previously, we can also make this OO version into a PDF).
This access step creates yet more digital objects. In fact if we look at the test results for the 12 spreadsheets in our test corpus, we have created the following items in our bundle:
- One DROID file which analysed all 12 spreadsheets – in CSV
- One JHOVE file which analysed all 12 spreadsheets – in XML
- 12 normalised Xena objects – in XENA
- A test Open Office rendition of one the spreadsheets in ODS
- A PDF rendering of that OO version
In a separate folder, there are 12 MET files in XML – one for each spreadsheet
In real life we certainly wouldn’t want to keep all of these objects in the same place; the archival files must be stored separately from the dissemination versions.
Checking
Many of the above stages present opportunities for a checksum script to be run, to ensure that the objects have not been corrupted in the processes of transformation and moving them in and out of the archival store. If we wanted to go further with checking, we would re-validate the files using JHOVE.
Sounds simple, doesn’t it? But there are quite a few preservation gaps with this bare-bones toolkit.
Gap #1
The technical metadata outputs from DROID and JHOVE are all “loose”, not tightly fused with the object they actually relate to. To put it another way, the processes described above have involved running several separate actions on the object that is the target for preservation. It has also created several separate outputs, which have landed in several different destinations on my PC.
We could manually move everything into a “bundle” as suggested above, but this is extra work and feels a bit insecure and unreliable. For real success, we need a method (a database, perhaps) that manages all the loose bits and pieces under a single UID, or other reliable retrieval method. Xena does create such a UID for each of its objects – it’s visible there in the dc:identifier tag. So we may stand a chance of using that element in future iterations of this work.
Another thing which may seem trivial, but JHOVE, DROID and Xena can do batch processing, and MET cannot. This results in a mismatch of outputs, and the outputs are created in different formats.
There is also some duplication among the detail of the technical metadata that has been extracted from the objects.
Gap #2
Ideally we’d like the technical metadata to be embedded within a single wrapper, along with the object itself. The Xena wrapper seems the most obvious place for this. I lack the technical ability to understand how to do it, though. What I would like is some form of XML authoring tool that enables me to write the DROID and JHOVE output directly into the Xena wrapper.
Gap #3
All the steps described above are manual. Obviously if we were going to do this on a larger scale we would want to automate the actions. This is not surprising and we knew this would be the case before we embarked on the project, but it’s good to see the extent to which our workflow remains non joined-up.
Gap #4
Likewise, our audit trail for preservation actions is a bit distributed. Both DROID and JHOVE give us a ‘Last Modified’ date for each object processed, and Xena embeds naa:last-modified within the XML output for each object, but ideally these dates ought to be retrievable by the preservation metadata database and presented as a kind of linear chain of events. We’d also like to have a field identifying what the process was that triggered the ‘Last Modified’ date.
Gap #5
How we manage the descriptive metadata in this process? In order to deliver a DIP to our consumer, they would have to know the record exists and want to know something useful about it to enable them to retrieve it. We have confirmed that descriptive metadata survives the Xena process, but how can we expose it for our users?
What we’re talking about here is a searchable front end to the archival store which we haven’t yet built. Kit is proposing a structured file store for his records, so maybe we need to expand on this approach and think about ways of delivering a souped-up archival catalogue for these assets.
‘The DROID we are looking for?’ – Using the open source tools
One of the objectives of this project is to examine how useful and practical the Open Source preservation tools are.
I thought that it would be worth recording my experience in installing and using the software in order to carry out the tests for the project. I’ve heard so many people over the years, from my mother to system administrators, use the phrase ‘I’m not very technical or IT-savvy’ so I won’t say that. I will be honest and confess I am just ‘IT-savvy’ enough to embark on something like this but not savvy enough to master it correctly. I think many of us working in records management occupy this sort of space but it shows that none of the following should be attempted without the input and guidance of your IT department.
Deciding to use open source tools to support records management is a different approach for a records manager. We are used to stages such as building a business case, procuring a system and implementing it. Open source tools are a lot simpler. Find it on SourceForge and – if your IT department says it’s OK – click download and there it is on your desktop.
DROID
According to its website DROID (Digital Record Object Identification) is “a software tool developed by The National Archives to perform automated batch identification of file formats”.
I found this downloading this tool very simple to do and it is very intuitive to use.
The functionality of DROID that stood out for me was the analysis and reporting that it could produce. Fundamental to my conception of the business case for electronic records management is the amount spent on storage and back-up of large, unstructured shared drives. Whilst I had used simple ‘search’ analysis in Windows Explorer to assess shared drives, the DROID tool did this reporting quicker and was able to produce it as a CSV download that allowed for analysis. The possibility for producing metrics to support a business case is therefore much enhanced with DROID.
This also provided me with the idea of using DROID to run regular reports on a structured, organised File Servers repository. One of the challenges of a ‘file servers/existing tools’ approach to electronic records management is the analysis and reporting that can be produced. DROID allows a simple manual process of taking a few minutes at most that could provide a regular ‘snapshot’ to map the development, growth and reduction of an electronic repository over time. I have demonstrated how this might work in scenarios 2 and 3 of my project output ‘Records management workflows’.
DROID therefore was an immediate help and has the potential to be a valuable tool for records managers.
XENA
According to its website, http://xena.sourceforge.net/index.php XENA (Xml Electronic Normalising for Archives) is a tool intended to “aid in the long term preservation of digital records”. The tool was developed by the National Archives of Australia.
XENA needed a few other things installed on my PC before I could get it up and running. Therefore I had to enlist the help of my IT support people to install OpenOffice and upgrade my JAVA settings accordingly.
Once it was installed, XENA was quite easy to use. It follows the OAIS model by converting the document into a preservable digital object. You can view some of the content through a ‘XENA viewer’, but to make it properly available you can publish or ‘normalise’ the document, usually as an Open Office document or PDF. (Further details on the outcome of our testing are available elsewhere on the project blog).
The concept of a safely preserved set of documents of which a copy can be created when required is quite a departure from the idea of a EDRMS master repository of documents which a user can search. Could a user – a member of staff or a member of the public – search a catalogue of document titles and keywords rather than the documents themselves? This might be impractical for ‘short retention /regularly accessed’ types of records. But it could be ideal for permanent retention documents (such as governance minutes, regulations, strategies) that cover both bases as valuable administrative records and documents of archival or public interest.
Challenges / Risks
1) There is an assumption that XENA and DROID will continue to be supported and relevant in years to come. That they are products of National Archives programmes does suggest a longevity that means the software is worth using for long-term preservation. Would the Open Office output of XENA run more of a risk of obsolescence than the XENA objects?
2) The use of these tools requires a lot of manual work for the records manager – running reports, converting documents etc. – which can be automated by the use of an EDRMS system. However, you could argue that this sort of work was a key part of past – and still persisting – paper records management processes. Additionally, an EDRMS brings, alongside significant costs, a host of system administrator and training overheads which would be sidestepped by this sort of approach.
Conclusion
Both these open source tools, which are free to download and took little longer than 30 minutes to get up and running, provide some really interesting and valuable functionality for records managers. Converting them from a desktop ‘toy’ to a scalable working component in a records management system or process is a much more difficult step, but my testing of DROID and XENA has certainly given me plenty to think about.
Significant Properties in Office documents
Significant properties are properties of digital objects or files which are considered to be essential for their interpretation, rendering or meaningful access. One of the aims of digital preservation is to make sure we don’t lose or corrupt these properties through our preservation actions, especially migration. For this particular project, we want to be assured that a Xena normalisation has not affected the significant properties of our electronic records.
Much has been written on this complex subject. Significant properties are to do with preserving the “performance” and continued behaviour of an object; they aren’t the same thing as the technical metadata we’ve already assessed, nor the same as descriptive metadata. For more information, see The InSPECT project, which is where we obtained advice about significant property elements for various file formats and digital object types.
However, the guidance we used for significant properties of document types was Document Metadata: Document Technical Metadata for Digital Preservation, Florida Digital Archive and Harvard University Library, 2009.
To summarise it, Florida and Harvard suggest we should be interested in system counts of pages, words, paragraphs, lines and characters in the text; use of embedded tables and graphics; the language of the document; the use of embedded fonts; and any special features in the document.
When normalising MS Office documents to Open Office, the Xena process produces a block of code within the AIP wrapper. I assume the significant properties of the documents are somewhere in that block of code. But we can’t actually view them using the Xena viewer.
We can however view the properties if we look at the end of the chain of digital preservation, and go to the DIP. For documents, The Open Office rendering of a Xena AIP has already been shown to have retained many of the significant properties of the original MS Word file. (See previous post). The properties are visible in Open Office, via File, Properties, Statistics:
The language property of the document is visible in Tools, Options, Language Settings (see image below). I’m less certain about the meaning or authenticity of this, and wonder if Open Office is simply restating the default language settings of my own installation of Open Office; whereas what we want is the embedded language property of the original document.
The other area Open Office fails to satisfy us is with the embedded fonts. However, these features are more normally associated with PDF files, which depend for their portability success on embedding fonts as part of their conversion process.
For our project, what we could do is take the Open Office normalisation and use the handy Open Office feature, a magic button which can turn any of its products into a PDF.
We could use the taskbar button to do this, but I’d rather use File, Export as PDF. (See screenshot below.) This gives me the option to select PDF/A-1a, an archival PDF format; and lossless compression for any images in the document. Both those options will give me more confidence in producing an object that is more authentic and can be preserved long-term.
This experiment yields a PDF with the following significant properties about the embedded fonts:
These are the Document Property fields which Adobe supports.
The above demonstrates how, in the chain of transformation from MS Word > Xena AIP > Open Office equivalent > Exported PDF, these significant properties may not be viewable all in one place, but they do survive the normalisation process.
DIPs from a Xena object – descriptive metadata
In OAIS terms, the Xena digital object is our Archival Information Package (AIP). This is the object that would be stored and preserved in UoL’s archival storage system (if we had one).
For the purposes of this project, the records manager needs to be assured we can render and deliver a readable, authentic version of the record from the Xena object. In OAIS terms, this could be seen as a Dissemination Information Package (DIP) derived from an AIP. Among other things, the DIP package ought to include sufficient descriptive metadata.
We’ve defined a minimal set of what we would need for records management purposes (names, dates for authenticity) and for information retrieval, and criteria for assessing overall success of the transformations – such as legibility, presentation, look and feel, and basic functionality. See our previous post for the documentation of how we arrived at our criteria.
For most of the examples below we are looking at standard Xena normalisations, which can produce a DIP by the following methods:
- Export the AIP to nearest Open Office equivalent
- Export the AIP to native format
- Export the AIP to a format of our choice (an option we haven’t tried as yet)
Emails are slightly more complex – see below for more detail, and previous post on emails.
Documents, spreadsheets and powerpoints
For these MS Office documents, Xena normalises by converting them to their nearest Open Office equivalent. We can view this Open Office version via the Xena Viewer simply by clicking on Show in OpenOffice.org.
In the case of a sample docx, this shows us the document with its original formatting intact. In Open Office, we can now look at File, Properties, and confirm the following descriptive metadata are intact (some of these we will revisit when we look at significant properties):
- Title
- Subject
- Keywords
- Comments
- Company Name
- Number of pages
- Number of tables
- Number of graphics
- Number of OLE objects
- Number of paragraphs
- Number of words
- Number of characters
- Number of lines
- Size
- Date of creation / created by who
- Date of modification / modified by who
One missing property is a discrete ‘Author’ field in the OO version.
PDFs
Xena Viewer has its own Adobe-like viewer for PDFs. When we look at the Document Properties for a normalised PDF, we get the following descriptive metadata:
- File size
- Page count
- Title
- Author
- Creator (i.e. software that created the file)
- Producer (i.e. software that created the file)
- Date of creation
- Date of modification
Images
We may not need much descriptive metadata for individual images. We looked at the converted images using a free EXIF viewer tool and the date values (creation and modification) are intact in the normalised object.
Emails
For emails, the AIP will either be a standard Xena normalised object or a binary Xena object. Broadly, the former can export to an XML version and the latter can export back to its native format (.msg file). We’re always guaranteed to get a minimum of descriptive metadata whichever transformation we enact. This would be:
- Title
- Author
- Recipient
- Date
Results – progress so far
Below is a PDF of my table of analysis of descriptive metadata (and significant properties).
We also have Kit’s assessment of the transformations, including his comments on whether a reasonable amount of metadata is intact.
Meeting up
It is always great to share ideas and experiences with others in your field, and so the meeting of the Digital infrastructure programme – of which our project is part – was enjoyable and very useful.
The first thing, after arriving at the impressive LSE Library building, was to introduce ourselves and provide some 5 min presentations on our projects. A full list is here:
http://www.jisc.ac.uk/whatwedo/programmes/preservation/12-11projectlist.aspx
Through our project and my growing general interest, digital archiving techniques are helping me to shape some new approaches to more conventional electronic records management problems.
The outcome of Manchester’s Caracanet Case Study, which deals with the acquisition and preservation of email accounts, will be invaluable to records managers trying to define approaches to managing emails as records across their organisations.
The Institute of Education’s Digital Directorate project was interesting to me as archivists approaching essentially a records management challenge. The inter-disciplinary approach can surely only benefit both fields.
I am a keen advocate of training and awareness and several projects – DICE, PrePARE, SHARD – looked at ways of engaging the producers of research data with preservation issues. I’ll also be very interested in the outcomes of Bristol’s DataSafe project. There are so many methods now of delivering training; these projects should give some useful pointers to what works best and what our target audiences want.
The discussions that followed covered the value and benefits of digital preservation and how best to establish a business case for DP projects. I’m still convinced that it will be the hard numbers around reduced storage costs that will give a bedrock to the other DP benefits. The implied processes of appraisal and selection around what groups of records to keep will surely give us the opportunity to get rid of the huge amounts of unstructured information many organisations are storing.
There was a debate about the best approaches to ‘community engagement’ around digital preservation, including an account of a previous three day ‘hackathon’ as part of the AQuA project. Whilst I am a big fan of listservs and online forums, this event showed that getting a few people in a room to share their experience is still the best way to build and maintain a ‘community’.
Thanks to JISC and the SPRUCE project for organising and to the attendees for a really enjoyable and useful meeting.
Emails in Xena
For this post, I just want to look at how emails are rendered as Xena transformations. I should stress that Xena doesn’t expect you to normalise individual .msg files, as we have done, but instead convert entire Outlook PST files using the readpst plugin. Nevertheless we still got some useable results with the 23 message files in our test corpus, by running the standard normalisation action.
In our test corpus, the emails generally worked well. Almost all of them had an image file attached to the message – a way of the writer identifying their parent organisation. Two test emails had attachments (one a PDF, one a spreadsheet). Four of them contained a copy of an e-Newsletter, content which for some reason we can’t open in the Xena viewer. I’ll want to revisit that, and also try out some other emails with different attachment types.
When we open a normalised email in the Xena Viewer, we can see the following as shown in the screenshot below:
- The text of the email message in the large Package NAA window.
- Above this, a smaller window with metadata from the message header. (In Xena, this is unique to email as far as we can see, and it’s very useful. Why can’t it do this for properties of other documents?)
- Below, a window displaying the technical metadata about the rendition.
- Below the large window on the left, a small window which is like a mini-viewer for all attachments.
- To the right of this window is technical metadata about each attachment.
- Right at the bottom, a window showing the technical metadata about the entire package. Even from this view, it’s clear that Xena is managing an email as separate objects, each with their own small wrapper, within a larger wrapper.
The immediate problem we found here was truncation of the message in the large window. This is being caused by a character rendering issue, and the content isn’t actually lost, as we’ll see when we look at another view.
In the Raw XML view (above image), we can see the XML namespaces being used for preservation of metadata from the email header (http://preservation.naa.gov.au/email/1.0) and a plaintext namespace (http://preservation.naa.gov.au/plaintext/1.0) for the email text.
If we copy the email text from here into a Notepad file, we can see the problem causing the truncation. Something in the system doesn’t like inverted commas. The tag plaintext:bad_char xml:space=”preserve” tells us something is wrong.
In the XML Tree View (image below) we can also see the email being managed as a Header and Parts.
Exporting from Xena
I tried clicking the Export button for an email object. This does the following:
- Puts the message into XML. It can now open in a browser and appears reasonably well formatted.
- Creates an XSL stylesheet – probably this is what formats the message correctly
- Detaches the attachment and stores it in an OO equivalent, or its native format (might be worth investigating this for more examples and seeing what comes out.)
So for a single email with a small image footer and a PDF attachment, I get the following four objects:
- Message rendered in XML
- XSL stylesheet
- One image file
- One PDF file
When opened in a browser, the XML message file also contains hyperlinks to its own attachments, which has some potential use for records management purposes (provided these links can be managed and maintained properly).
So far so good. Unfortunately this XML rendition still has the bad_char problem! It would be possible to open this XML file in UltraEdit and do a find and replace action, but editing a file like this undermines the authenticity of the record, and completely goes against the grain for records managers and archivists.
The second option we have for .msg files is to normalise them using Xena’s Binary Normalisation method instead. This creates something that can’t be read at all in the Xena Viewer, but when exported, this binary object converts back into an exact original copy of the email in .msg format, with no truncation or other rendering problems. Attachments are intact too. It creates the same technical metadata, as expected.
We’ve also tried the PST method, which works using the readpst plugin that comes with Xena. This is pretty nifty and speedy too. The normalisation process produces a Xena digital object for every single email in the PST, and a PST directory also. When exported, we get the XML rendition and stylesheet also. So far, a few PROs and CONs with the PST method:
- PRO1 – no truncation of the message.
- PRO2 – lots of useable metadata.
- CON1 – file names of the XML renditions are not meaningful. Each email becomes a numbered item in an mbox folder.
- CON2 – attachments simply don’t work in this process.
- CON3 – PST files are not really managed records. In fact Kit Good’s initial reaction to this was “If I could ban .pst files I would as I think they are a bit of a nightmare for records management (and data management generally). I could see the value of this in the case of archiving an email account of a senior employee (such as the VC) or eminent academic member of staff but I think it would be impractical for staff, especially as .pst files are often used as a ‘dump’ when inboxes get full.”
Xena objects
In this post, I’m just going to look at the normalised Xena object, which is an Archival Information Package (AIP). It’s encoded in a file format whose extension is .xena and which is identified by Windows as a ‘Xena preserved digital object’ type.
We’ve created these for normalisations of each of our principal records types – documents, spreadsheets, PowerPoint, PDFs, images and emails.
When we click on one of these it launches the Xena Viewer application. This gives an initial view of the AIP in the default NAA Package View. This is very useful as it gives a quick visual on how well the conversion / normalisation has worked. (It seems particularly strong on rendering documents and common image formats.)
The AIP is actually quite a complex object though. It is in fact an XML wrapper which contains the transformed object, and the metadata about it. This conforms with the preservation model as proposed by the Library of Congress Metadata Encoding and Transmission Standard, which suggests that building an XML profile like this is the best way to make a preserved object self-describing over time.
The main components of a Xena AIP are (A) technical metadata about the wrapper and the conversion process, some of which is expressed as Dublin Core-compliant metadata, and some of which conforms to an XML namespace defined by the NAA; (B) a string of code that is the transformed file; and (C) a checksum.
The Xena Viewer offers three ways of seeing the package:
1) The default NAA Package view. For documents, this shows the representation of the transformed object in a large window. Underneath that window is a smaller scrolling window which displays the transformation metadata. Underneath this is an even smaller window displaying the package signature checksum.
2) The Raw XML view. In this view we can see the entire package as a stream of XML code. This makes clear the use of XML namespaces and Dublin Core elements for the metadata.
3) The XML Tree View. This makes clear the way the AIP is structured as a package of content and metadata.
The default view changes slightly depending on object type:
- For documents, the viewer window shows the OO representation in a manner that keeps some of the basic formatting intact. To be precise, it’s a MIME-compliant representation of the OO document.
- For images, the viewer shows a Base 64 rendering of an image file.
- For PDF documents, the viewer integrates a set of Adobe Reader-like functions – search, zoom, properties, page navigation etc. This seems to be done by a JPedal Library GUI.
- For Spreadsheets and Powerpoint files, the viewer doesn’t show anything in the big window (but we’ll get to this when we follow the “Show in OpenOffice” option, which will be the subject of another post).
- For emails, the Xena Viewer has a unique view that not only displays the message content, but also the email header, and any attachments, both in separate windows. We’ll look at this again in a future post about how Xena works with emails.
As an Archival Information Package in OAIS terms, a Xena object is clearly compliant and suitable for preservation purposes. The only problem for our project is getting more of our metadata into the AIP. To put it another way, integrating more metadata that records our digital preservation actions. Ideally, we would like to be able to perform a string of actions (DROID identification of an object, virus checking, checksum etc) and integrate the accumulated metadata into a single AIP. However, this is not in scope of the project.
Testing the DPSP
I did a day’s testing of the Digital Preservation Software Platform (DPSP) in December. The process has been enough to persuade me it isn’t quite right for what we intend with our project, but that’s not intended as a criticism of the system, which has many strengths.
The heart of DPSP is performing the normalisation of office documents to their nearest Open Office equivalent, using Xena. Around this, it builds a workflow that enables automated generation of checksums, quarantining of ingested records, virus checking, and deposit of converted records into an archival repository. It also generates a “manifest” file (not far apart from a transfer list), and logs all of the steps in its database.
The workflow is one that clearly suits The National Archives of Australia (NAA) and matches their practice, which involves thorough checking and QA to move objects in and out of quarantine, in and out of a normalisation stage, and finally into the repository. All of these steps require the generation of Unique IDs and folder destinations which probably match an accessioning or citation system at NAA; there’s no workaround for this, and I simply had to invent dummy IDs and folders for my purpose. The steps also oblige a user to log out of the database, and log back in so as to perform a different activity; this is required three times. This process is undoubtedly correct for a National Archives and I would expect nothing less. It just feels a bit too thorough for what we’re trying to demonstrate with the current project, our preservation policy is not yet this advanced, and there aren’t enough different staff members to cover all the functions. To put it another way, we’re trying to find a simple system that would enable one person (the records manager) to perform normalisations, and DPSP could be seen as “overkill”. I seem to recall PANDORA, the NLA’s web-archiving system, was similarly predicated on a series of workflows where tasks were divided up among several members of staff with extensive curation and QA.
My second concern is that multiple sets of AIPs are generated by DPSP. Presumably the intention is that the verified AIPs which end up in the repository will be the archive copies, and anything generated in the quarantine and transformation stages can be discarded eventually. However, this is not made explicit in the workflow, neither is the removal of such copies described in the manual.
My third problem is an area which I’ll have to revisit, because I must have missed something. When running the Xena transformations, DPSP creates two folders – one for “binary” and one for “normalised” objects. The distinction here is not clear to me yet. I’m also worried because out of 47 objects transformed, I ended up with only 31 objects in the “normalised” folder.
The gain with using DPSP is that we get checksums and virus checks built into the workflow, and complete audit trails also; so far with our method of “manual” transformations with XENA we have none of the above, although we do get checksums if we use DROID. But I found the DPSP workflow a little clunky and time-consuming, and somewhat counter-intuitive in navigating through the stages.