IS14 Diverse preservation risks in large archives with millions of objects
Detailed description While we ingested millions of objects in the past, we expanded our knowledge about the risks of the objects. However, before we could make a decision whether the risks required immediate action, we need to have an idea about the severity of the risk. It would be a very costly activitity to re-ingest the millions of objects so a more effective solution would save a lot of time and money.
Example issues:
-          Some pdf files seem to have a password, which hinders the access to it but also possible preservation actions. It would help to get a list of affected “archive identifiers” in order to contact the supplier of these objects or to start a special action
-          Some objects have related zip files with unknown file formats in it. It would help if we could get an overview of the file format extensions in the zip file, just to get an idea of the risk. When necessary, a preservation action could be planned.
- Objects are only superficial identified at ingest, a more proper identification with a better tool would be preferable.

The idea behind this is to get an idea of certain risks (intelligent reporting) so to enable the organization to take measures if the reports show that the risk is high. It is not the intention to change objects in whatever way, only to determine some characteristics.

Finally there is a relationship with the institutional policies (WP.PW.2)
Scalability Challenge
Currently we have millions of objects that we would like to have more information about which we did not capture at Ingest and which can't be re-ingested only to get this kind of information
Issue champion Barbara Sierman (KB-NL) digital preservation manager
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven (ONB)
Possible Solution approaches - make existing tools fit for working in the background of an archive and creating a consolidated list of the results, by extracting information from the objects or the metadata
- create a new tool to work in the background of an archive and capable to do a set of specified actions, like counting file extensions, read a specific metadata field in an object
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets KB-IR content
Solutions Reference to the appropriate Solution page(s), by hyperlink


Objectives Scalability, reliability, automation
Success criteria Ideal situation: have a program running in the back ground of our repository that automatically detects some criteria given beforehand, for example detect whether pfd's require a password, or what is in the zip file of additional material in an object. And report about these findings in a clear and understandable way.
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
The action should not need too much capacity of the CPU resources, as it need to be running in the background and should not conflict with production activities, this should be adjustable. If so then the scanning of 20 million objects might take a month.
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
Yes, it should be possible for a non developer curator to use this tool on a sample set, to verify beforehand specific material
Actual evaluations links to acutual evaluations of this Issue/Scenario
