Digital Preservation and Data Curation Requirements and Solutions

Skip to end of metadata
Go to start of metadata
About these pages

The pages referenced below form a network of Datasets, preservation and curation Issues with those Datasets, and Solutions to those Issues. As such, these pages capture information and requirements about concrete digital preservation and curation challenges, that are present in specific datasets and collections. The experiences of solving these Issues are written up on Solution pages. These in turn link to pages in the OPF Tool Registry, and to actual code that can be downloaded and re-used.

The purpose of these pages is to share experiences in solving preservation and curation problems, so we can learn from each other, and to articulate practitioners needs and requirements to those who are in a position to produce practical solutions to their problems.

Support

The work collated on this page is supported by: Open Preservation Foundation, Jisc, European Commission, Digital Preservation Coalition, SPRUCE Project, AQuA Project, SCAPE Project, and you!

Practitioners need better characterisation tools

Analysis of the Datasets, Issues and Solutions collated on this page indicated a broad cross section of preservation requirements, but an overriding need for more effective characterisation. Practitioners need to understand more about their data and it's condition, typically for quality assurance, appraisal and assessment and for identifying preservation risks. This analysis and details of these conclusions are described in this poster, published at the 8th International Digital Curation Conference, Amsterdam, January 2013.

Get involved

Anyone can contribute to these pages. All you need to do is register for an OPF account (its quick, free, and anyone can do it), and then start adding comments, adding value to existing pages, or contributing new ones. Please help us make this a valuable resource for all!

Datasets


These are the Datasets or collections that relate to specific preservation Issues which in turn (may) have Solutions developed for them. The Datasets are categorised by their media type. Click this link to create a new Dataset, then edit the italicised text.


Audio datasets

Label: audio

Page: M3P audio CD Collection
Page: Ri Archive collections - audio, video and image examples
Page: User-generated audio field recordings
Page: Audio Collection (York)
Page: LAVC audio
Page: Endangered Archives Programme (EAP)

Disk image datasets

Label: disk_image

Page: M3P audio CD Collection
Page: Environmental Artists Datasets
Page: OST archive with attachments
Page: Wang Laboratories, Inc. Records 5.25" disk images
Page: Lynn Igoe Floppy Discs
Page: When the Weather is Uggianaqtuq
Page: Spreadsheets & word processing documents from the Rhoda K. Channing record group
Page: FSU Science Education Curriculum Collection of multimedia programs
Page: Amiga OFS file system images
Page: Computer game disc images
Page: Lion King floppy disc Dataset
Page: MS-DOS 3.30 Floppy Installation Dataset
Page: Disk Images
Page: Realistic Disk Image Collection for Research and Education

Document datasets

Label: document

Page: Ida Roper Herbarium archive
Page: PDF files from the Archaeology Data Service's grey literature library collection
Page: Northumberland Estates - Current Electronic Records
Page: Scanned photograph album of early 20th Century images of Goldsmiths University
Page: Middlesex University eprints repository full text documents
Page: Spreadsheets & word processing documents from the Rhoda K. Channing record group
Page: Database containing a unique list of Danish words
Page: KB Open Access Journals PDFs
Page: Deposited personal collection
Page: Big Dance Archive
Page: History Workshop Journal - Digital Archive Deposit
Page: Open Access PDFs
Page: Externally Generated Content
Page: MS Word 97-2003 Documents (NANETH)
Page: Seven Stories author & illustrator files
Page: McLean Museum
Page: Sgrin Archive
Page: eTheses
Page: ADS Grey Literature Library

Email datasets

Label: email

Page: Environmental Artists Datasets
Page: Program on Public Life administrative records and director email
Page: Email archive in OST format (LeFurgy)
Page: Email Mailbox Collections
Page: PRONI Digital Preservation Project
Page: Web based emails

Geodatasets

Label: geodatasets

Page: Archaeology Data Service archive

Image datasets

Label: image

Page: Ida Roper Herbarium archive
Page: Collection of audio-visuals and digitised images
Page: Dorset History Centre collection of digitised images
Page: Scanned photograph album of early 20th Century images of Goldsmiths University
Page: Ri Archive collections - audio, video and image examples
Page: Valid and well-formed TIFF's with scanline corruption dataset
Page: Big Dance Archive
Page: Vanley Burke Archive - sample for digital asset audit
Page: University Photographs with embeded metadata
Page: Camera raw file images
Page: 10 IDP samples from BL
Page: Wellcome Library digitisation
Page: User generated content (images)
Page: Mass digitisation of images (York)
Page: JISC1 19th Century Digitised Newspapers (BL)
Page: Historic photographic collection
Page: East London Theatre Archive
Page: Digitised Books (ONB)
Page: Digitised Books (ONB, Google Books)
Page: Brightsolid digitisation of British Library newspapers
Page: BOPCRIS
Page: 19th Century Books (BL)
Page: India Papers Collection
Page: Seven Stories author & illustrator files
Page: National Fairground Archive
Page: McLean Museum
Page: Malformed TIFF images
Page: Leeds image duplicates and versions
Page: BL 19th Century digitised newspaper collection
Page: Endangered Archives Programme (EAP)

Mixed/Misc datasets

Label: mixed_misc

Page: Blackwater Estuary Fish Traps Monitoring Survey
Page: Northumberland Estates - Current Electronic Records
Page: London 2012 Partnership governance and communication records
Page: Laser Scanning data of Gabo sculptures
Page: 3D Modelling Data for Terracotta Roundels
Page: Ri Archive collections - audio, video and image examples
Page: Exact For DOS Bookkeeping Data Dataset
Page: UCLan Corporate Records on Sharepoint
Page: Outputs from born-digital ingest workflow

Research datasets

Label: researchdata

Page: NeXus Data Collection ISIS - STFC
Page: Nexus data files from instruments
Page: STFC Scientific Datasets
Page: Catalogue Data for the Scientific datasets
Page: Processed scientific datasets
Page: UK Web Domain Dataset - Format Profile
Page: Geospatial data
Page: Electoral Roll Data
Page: British National Bibliography
Page: British Library - Research Datasets

Software datasets

Label: software

Page: Computer games - KB private donations
Page: National Videogame Archive Dataset
Page: Obsolete multimedia software
Page: Script and Programming Language File Identifications

Web datasets

Label: web

Page: Ida Roper Herbarium archive
Page: Internet Memory Web Archive
Page: Database containing a unique list of Danish words
Page: Malta Music Memory Project (M3P)
Page: French Web Archives

Video datasets

Label: video

Page: Collection of audio-visuals and digitised images
Page: When the Weather is Uggianaqtuq
Page: National Motor Museum Film and Video Collection
Page: Big Dance Archive
Page: McLean Museum
Page: Endangered Archives Programme (EAP)
Page: Gormley Flash (BL)

Other / untagged

Page: Imperial College Exploration Board Adventure 2001 Comprising Overland Pakistan and Biafo Climbing Nick Adlam Alain Hosley James Smyth Tim Harris Nick Saunders
Page: PDF Creator Validator
Page: Danish newspaper - Morgenavisen Jyllandsposten
Page: UoN MSS - Tiffs and corresponding Jpegs
Page: Lovebytes Festival Media Archive
Page: Lovebytes Festival Media Archive
Page: ttt test

Issues


These are the preservation or other business driven Issues that relate to a specific Dataset and may have one or more specific Solutions developed to solve them. The aim of an Issue page is to provide a detailed description of the preservation challenge and the requirements of the Issue Owner that will help to inform development of a Solution that solves the Issue.
Click this link to create a new Issue, then edit the italicised text.


Unsolved issues

Issues that do not have linked solutions. Why not suggest or contribute a solution?

Label: unsolved_issue

Page: Jhove reports error for non-standard violating criterion (imbalanced page trees)
Page: Common validation error messages from PDF to PDFA conversion
Page: PDF Creator Validator (Issue)
Page: Sorting Error Messages by Pdf Creation Software
Page: PDF to PDF A Conversion
Page: Decoding JP2 with OpenJPEG goes wrong in case of embedded ICC profiles
Page: Data Extraction from real world Android Phone Images through BW-FLA Emulation as a service
Page: Classification of files within a disk image
Page: Extraction of keywords (and images) from large collections of text based files
Page: Checking significant properties of images have been retained after migration
Page: Correct File Formats
Page: Checking significant properties of documents have been retained after migration
Page: EAP Issue 2 TIFF images that will not to open in Photoshop or Adobe Bridge
Page: EAP Issue 6 Identify Missing or Out of sequence files
Page: EAP Issue 4 Detecting Visual Errors

Appraisal and assessment issues

Issues related to appraising or assessing digital content as the first step in deciding how to proceed with preservation activities.

Label: appraisal_assessment

Page: Content profiling
Page: Appraisal and preservation of 3D modelling data
Page: Image content identification and categorisation
Page: Audit and identification of current electronic records
Page: Parsing PST and OST email files for textual mining and searching
Page: Apprasing OST file for restricted data
Page: Analyzing a disk image of a 12-year old laptop
Page: Digital Preservation Planning
Page: Sorting, appraising and metadata creation for deposited personal collections
Page: Produce a report summarising collection metadata and content
Page: Identifying content and Sorting
Page: Automatically extracting metadata for Grey Literature reports
Page: Extraction of keywords (and images) from large collections of text based files
Page: Identification of file format and last modified or created dates of files within a disk image
Page: Unknown born-digital file history
Page: Identifying the content of MS Office documents

Bit rot issues

Issues related to Datasets that exhibit bit rot (files damaged by imperfect storage, failed write operations or software/processing errors) and require a Solution to identify, and if possible repair, problematic files.

Label: bit_rot

Page: Valid and well-formed TIFF's with scanline corruption
Page: Truncated JPEG2000
Page: Shifted Crop Corruption
Page: Corrupted JPEG and JPEG2000 files
Page: Black areas or pixels in TIFF files
Page: EAP Issue 1 Broken TIFF images
Page: EAP Issue 4 Detecting Visual Errors
Page: Player stops part way through some of the performances

Conformance issues

Issues where Dataset content does not match a required profile, or needs to be checked or validated against a particular profile. These profiles are typically determined by an organisation's collection or preservation policy.

Label: conformance

Page: IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete?
Page: PDFA Validation tools give different results
Page: IS48 Validate archival files against an institutional content policy regarding formats
Page: Document content and utility preservation
Page: IS31 Semantic checking of very large data files
Page: IS29 Characterisation and validation of very large data files
Page: Nexus Characterisation - STFC
Page: ePub Version 2.0 Validation
Page: Black areas or pixels in TIFF files
Page: Audit images against criteria
Page: Historic photographic collection - consistency across time or suppliers
Page: Unknown JPEG2000 characteristics presents risks to quality, preservation and access

Contextual issues

Issues related to the wider context of a particular Dataset.

Label: context

Page: Transferring metadata from JPEGs to TIFFs
Page: IS37 Preserving the verifiability and provenance of processed datasets
Page: E-mail Threads - relinking the conversation

Data capture issues

Issues related to the capture, harvesting or extraction of data in order to facilitate effective preservation and access.

Label: data_capture

Page: At risk and decaying audio data on CDs
Page: Identification, transfer and preservation of audio and video files
Page: Data Extraction from real world Android Phone Images through BW-FLA Emulation as a service
Page: Moving records from Sharepoint to Eprints for preservation
Page: Web based email "harvesting"
Page: Ensuring appropriate interface for data capture or deposit

Duplication issues

Duplicated files can arise from a number of causes. Identical duplicates are relatively easy to detect. Similar duplicates (eg. one file processed from another, or the same item scanned on a different device) can require much more complicated Solutions.

Label: duplication

Page: Duplicate detection
Page: Identifying Aggregations of Duplicates in a Dataset
Page: Checksumming
Page: Deduplication
Page: De-duplication of multiple scanned images of same object
Page: Finding duplicate images
Page: Identifying missed or duplicated pages
Page: Identification of same image at different levels of rotation
Page: Duplicate images within a collection or job

Embedded objects issues

Objects embedded within other objects (such as OLE, OOXML, PDF, ZIP) can pose identification, appraisal or risk assessment challenges.

Label: embedded_objects

Page: Web based email "harvesting"
Page: Extracting embedded objects from docx files
Page: Preserving MS Outlook (.msg) E-mails with Attachments
Page: Embedded objects in PDFs
Page: Identifying the content of MS Office documents

External dependency issues

Issues relating to digital objects that have dependencies on other objects or content on the web.

Label: dependency

Page: Embedded links within the PDF

Integrity issues

Issues relating to ensuring the integrity or fixity of Datasets.

Label: integrity

Page: Straightforward calculation and verification of checksums
Page: IS30 Fixity capturing and checking of very large data files
Page: IS18 Verify bitstream integrity
Page: Checksumming
Page: Authenticity and Integrity of Digital Artwork

Obsolescence, preservation risk and business constraint issues

Issues that relate to the obsolescence of Datasets, preservation risk or business constraints placed on the way that Datasets are managed.

Label: obsolescence

Page: IS13 wmv to Video Format-X Migration Results in Out-of-sync Sound and Video
Page: IS3 Large media files are difficult to characterise without mass processing + We cannot identify preservation risks in uncharacterised files
Page: IS21 Migration of mp3 to wav
Page: IS22 Characterise and Validate very large mpeg-1 and mpeg-2 files
Page: Simple preservation actions with few IT resources
Page: Document content and utility preservation
Page: Content separation
Page: Appraisal and preservation of 3D modelling data
Page: Identification, transfer and preservation of audio and video files
Page: Apprasing OST file for restricted data
Page: IS11 PDF files may face preservation risks
Page: IS38 (W)ARC to HBASE migration
Page: IS1 Digitised TIFFs do not meet storage and access requirements
Page: IS42 Detecting Encryption and DRM in Digital Content
Page: IS40 Complexity of camera raw files
Page: IS39 Format obsolescence detection
Page: IS36 Examine the long term value of the preserved datasets
Page: IS35 Mantid website or software no longer applicable or available
Page: IS33 Enhanced migration of RAW to NeXus data
Page: IS34 ISIS instrument website no longer applicable or available
Page: IS32 Basic Migration of RAW to NeXus data
Page: IS28 Structural and visual comparisons for web page archiving
Page: IS25 Web Content Characterisation
Page: IS16 Normalisation of JPEG 2000 images
Page: IS15 Long-term access and decoding of JP2 images
Page: IS14 Diverse preservation risks in large archives with millions of objects
Page: IS12 ARC to WARC migration
Page: IS8 Diversity of office document formats in digital objects archive
Page: IS6 Determine render-ability of displayable web objects
Page: IS5 Digital objects archive contains unidentified content
Page: National Videogame Archive - issues with preserving games for public display
Page: Identifying content and Sorting
Page: Emulation and authenticity issues
Page: Embedded links within the PDF
Page: PDF Characterisation Tool
Page: EAP Issue 1 Broken TIFF images
Page: BOPCRIS issue - Mix of compressed and uncompressed TIFFS
Page: Validating JPEG2000 files on conversion from TIFF, identifying and tracing source of errors
Page: Normalization of digital audio files
Page: EAP Issue 2 TIFF images that will not to open in Photoshop or Adobe Bridge
Page: BOPCRIS issue - ABBYY "Unknown error"
Page: Identifying the content of MS Office documents
Page: Unknown JPEG2000 characteristics presents risks to quality, preservation and access

Planning and management issues

Issues that relate to the general planning and management of digital preservation.

Label: planning_management 

Page: Preservation plan
Page: Appraisal and preservation of 3D modelling data
Page: Identifying and preserving image files
Page: Preservation planning for images - access, preservation and annotation
Page: Storage and compression of AVI files from Film and Video Collection
Page: National Videogame Archive - issues with preserving games for public display

Quality issues

Issues relating to Datasets containing quality issues caused by digitisation, processing or format migration.

Label: qa

Page: IS7 Incompleteness and and inconsistency of web archive data
Page: IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete?
Page: IS13 wmv to Video Format-X Migration Results in Out-of-sync Sound and Video
Page: IS20 Detect audio files with very bad sound quality
Page: IS21 Migration of mp3 to wav
Page: IS44 Migrated image metadata must map or match to those of the original
Page: IS1 Digitised TIFFs do not meet storage and access requirements
Page: IS10 Potential bit rot in image files that were stored on CD
Page: IS27 Quality assurance in redownload workflows of digitised books
Page: IS43 Determining general 'document' properties
Page: IS28 Structural and visual comparisons for web page archiving
Page: IS19 Migrate whole archive to new archiving system
Page: IS16 Normalisation of JPEG 2000 images
Page: IS12 ARC to WARC migration
Page: IS9 Archive system migration preserving and enriching AIPs
Page: IS8 Diversity of office document formats in digital objects archive
Page: Check content of e-pub against digitized book
Page: Truncated JPEG2000
Page: Sound files, type and quality checking
Page: Shifted Crop Corruption
Page: PDF to PDF-A conversion
Page: Checking that significant properties are preserved after migration
Page: Black areas or pixels in TIFF files
Page: Checking significant properties of images have been retained after migration
Page: Checking significant properties of documents have been retained after migration
Page: Quality assurance of a migration from TIFF to JPEG2000
Page: Historic photographic collection - consistency across time or suppliers
Page: Identifying missed or duplicated pages
Page: Born-digital - migration success
Page: Validating JPEG2000 files on conversion from TIFF, identifying and tracing source of errors
Page: Using METS data to inform analysis
Page: Quality issues in digitised pages
Page: Use of OCR metadata
Page: BOPCRIS issue - ABBYY "Unknown error"
Page: EAP Issue 6 Identify Missing or Out of sequence files
Page: EAP Issue 4 Detecting Visual Errors
Page: Born-digital - metadata validation
Page: Born-digital - log file checks
Page: Audit audio batch against criteria
Page: Player stops part way through some of the performances
Page: Quality issues may be present in digitised pages

Retention/disposal issues

Issues relating to the retention, disposal and/or deletion of digital objects.

Label: retention

Page: Verify if data (a file) is not existing anymore

Rights issues

Issues related to rights or permissions that cause difficulties in managing or preserving Datasets.

Label: rights

Page: Permission Overlays

Structural relationship issues

Digital entities can be made up of a number of objects (eg. masters, services copies, metadata). Structural relationships are important to understand which objects are part of an entity and what they for.

Label: structural_relationships

Page: Transferring metadata from JPEGs to TIFFs
Page: IS37 Preserving the verifiability and provenance of processed datasets
Page: E-mail Threads - relinking the conversation
Page: Disassociation of files and metadata
Page: Historic photographic collection - check for versions of an image
Page: Born-digital - metadata validation
Page: Inconsistencies between metadata and content

System obsolescence issues

Issues related to the obsolescence of software or other systems that manage Datasets.

Label: system_obsolescence

Page: IS19 Migrate whole archive to new archiving system
Page: IS9 Archive system migration preserving and enriching AIPs
Page: New Operating System

Unknown characteristics issues

Issues related to Datasets with unknown characteristics that are necessary for a preservation, management or other business need.

Label: unknown_characteristics

Page: IS24 Characterisation of large amounts of wav audio
Page: IS45 Audio and Video Recordings have unreliable broadcast time information
Page: At risk and decaying audio data on CDs
Page: Content profiling
Page: Image content identification and categorisation
Page: Parsing PST and OST email files for textual mining and searching
Page: Apprasing OST file for restricted data
Page: IS41 Analyse huge text files containing information about a web archive
Page: IS43 Determining general 'document' properties
Page: How to maintain a list of metadata mappings outside of the script
Page: Metadata extraction
Page: Sound files, type and quality checking
Page: Identifying web content
Page: Identifying content and Sorting
Page: Automatically extracting metadata for Grey Literature reports
Page: Extraction of keywords (and images) from large collections of text based files
Page: Identification of file format and last modified or created dates of files within a disk image
Page: Identifing the software application that a file was created in
Page: Unknown born-digital file history
Page: Unknown PDF characteristics
Page: OCR'ing mixed content (text and non-text)
Page: EAP Issue 3 Metadata Extraction from audio, video and image files
Page: Extraction of metadata from digital audio files
Page: Newspaper issue dates

Unknown file formats issues

Datasets containing unknown file formats tend to pose a preservation risk and make management of them difficult.

Label: unknown_file_formats

Page: Identifying and preserving image files
Page: Audit and identification of current electronic records
Page: IS26 Dealing with difficult identification cases
Page: IS17 Characterisation of text-based formats
Page: Ability to automatically identify script files
Page: Identification of file format and last modified or created dates of files within a disk image
Page: Identification of file formats with incorrect file extensions
Page: Identification and validation of esoteric audio file formats

Value and cost issues

Issues relating to the cost of Dataset management or the Value of the Dataset to its owners and users.

Label: value_cost

Page: IS36 Examine the long term value of the preserved datasets

Other / untagged

Issues that haven't been tagged with any of the labels listed in this column. This provides a useful mechanism for catching Issues that have not been tagged with sufficient detail, or identifying the need to add new labels to this page.

Page: Matching equivalent files of different formats
Page: IS49 Large scale ingest of a large book collection
Page: IS47 Identify Preservation Risks from audio+video characterisation information
Page: Exact For DOS Bookkeeping Data Issue
Page: Media migration and emulation of computer games
Page: MS-DOS 3.30 Floppy Installation Issue
Page: Legacy Environment Issues
Page: IS46 Book page image duplicate detection within one book

Solutions


These are Solutions that address specific Issues encountered in particular Datasets. Solutions are typically quite specific to a particular Issue and Dataset but many will have a wider application. For details of tools utilised in a Solution, either follow links from individual Solution pages or see the Tool Registry.
Click this link to create a new Solution, then edit the italicised text.


Appraisal and assessment solutions

Solutions for assessing or appraising datasets.

Label: appraisal_assessment

Page: Visual Analysis of Preflight Output
Page: Parsing PST OST file using TIKA
Page: Analysis of Lucene Index Word Frequency
Page: Characterising Externally Generated Content

Bit rot detection and repair solutions

Solutions for detecting and possibly repairing bit rot Issues in Datasets.

Label: bit_rot_detection

Page: Solving TIFF malformation using exiftool
Page: Corrupted JPEG and JPEG2000 files solution
Page: Malformed TIFF images solution
Page: Identify Files Affected by Truncated-Fuzzy JPEG2000
Page: Identify Shifted Crop Issue in JPEG2000
Page: EAP File Verification
Page: Diagnosing FLV problems using FLVmeta's flvdump

Characterisation solutions

Solutions for characterising content.

Label: characterisation

Page: Identifying differences between metadata in files and copying metadata between files
Page: freqy - word clouds for directories
Page: File Format Identification and Metadata Extraction using FITS
Page: Parsing PST OST file using TIKA
Page: SO18 Comparing two web page versions for web archiving
Page: Extracting and aggregating metadata with Apache Tika
Page: Maintain a list of metadata mappings outside of the script
Page: Solving TIFF malformation using exiftool
Page: Distinguishing Files with Descriptive Metadata
Page: Using Perl to write scripts for reporting on the content of the collection
Page: SO20 Extending JHOVE to characterise NeXus data format
Page: SO29 Extending JHOVE to characterise very large NeXus data file
Page: SO23 Pushing additional metadata into NeXus metadata fields
Page: Characterising Externally Generated Content
Page: tiff2RDF - visualising image collection consistency
Page: SO25 Rosetta v3.0 Implementation Integrated with DROID 6, JHOVE1, NLNZ tool and more...
Page: SO27 Analyse huge text files containing information about a web archive using Hadoop
Page: Mediainfo output viewer
Page: Open Planets Foundation - File Scanner
Page: PDF to PDF-A Conversion Pre-Processor
Page: PDF Characterisation Tool
Page: Detect, extract and analyse embedded objects in PDFs
Page: Apache POI Office Document Analyser
Page: Audio Auditing Script
Page: EAP File Verification
Page: Identify compressed TIFFs and convert them to uncompressed TIFFs
Page: AQUAdio - characterization of user-generated audio field recordings
Page: EAP Compare Metadata with Requirements

Data capture solutions

Solutions for capturing data from an external source, or imaging data from hand held media.

Label: data_capture

Page: Audio CD Preservation
Page: Backup Nintendo Wii Discs
Page: Backing up PS3 or XBox 360 Games
Page: Backing up Nintendo DS ROMs
Page: Extracting content from Facebook to Mediawiki
Page: Moving records from Sharepoint to Eprints for preservation solution
Page: Harvest webmail accounts

De-duplication solutions

Solutions for detecting and managing duplicated digital objects or datasets.

Label: de-duplication

Page: CSV listing of Aggregations of Duplicates in a Dataset
Page: ssdeep for duplicate image detection
Page: java image blocks comparison
Page: Perceptual Image Diff comparison
Page: Identifying rotated, duplicate images using pHash

Embedded object solutions

Solutions for managing and preserving embedded digital objects.

Label: embedded_objects

Page: Preserving MS Outlook (.msg) E-mails with Attachments - Solution
Page: Extracting embedded objects from Office OpenXML documents
Page: Detect, extract and analyse embedded objects in PDFs
Page: Apache POI Office Document Analyser

Emulation solutions

Solutions utilising emulation or virtualisation technologies.

Label: emulation

Page: Nintendo Wii Emulation
Page: Exact For DOS Bookkeeping Data Solution
Page: Virtual Box Legacy Windows Environment
Page: MS-DOS 3.30 Floppy Installation Solution
Page: Virtualisation using VirtualBox
Page: Creation of a virtual machine to run old or deprecated software

File format identification solutions

Solutions for identifying file formats.

Label: identification

Page: File Format Identification and Metadata Extraction using FITS
Page: Identifying the content of Email Mailboxes - Solution
Page: Distinguishing Files with Descriptive Metadata
Page: NeXus Data Collection ISIS - STFC - solution
Page: Tika Batch File Identification
Page: SO17 Web Archive Mime-Type detection workflow based on Droid and Apache Tika
Page: Determine the format of a digital object
Page: Validate and report filetypes per file
Page: Open Planets Foundation - File Scanner
Page: Server MIME Type Correction
Page: Use ohcount to detect source code text files
Page: EAP File Verification
Page: Identify compressed TIFFs and convert them to uncompressed TIFFs

Fixity solutions

Solutions for addressing integrity issues using approaches for generating and verifying fixity information such as manifests and checksums.

Label: fixity

Page: Using Perl to write scripts for reporting on the content of the collection
Page: Characterising Externally Generated Content
Page: Checking the Authenticity and Integrity of Digital Content

Migration solutions

Solutions for migrating data from one format to another.

Label: migration

Page: Converting PST & OST files to MBOX format
Page: FFMPEG as Video Transcoder
Page: SO22 Developing a Raw-to-NeXus migration tool
Page: SO28 A heuristic measure for detecting undesired influence of lossy JP2 compression on OCR in the absence of ground truth
Page: Convert embedded fonts to outlines
Page: Identify compressed TIFFs and convert them to uncompressed TIFFs

Miscellaneous solutions

Solutions for miscellaneous topics.

Label: miscellaneous

Page: DVD Migration and Video Splitting
Page: Image content identification and categorisation solution
Page: National Videogame Archive - Game Preservation & Public Access Solutions
Page: SO24 Use Preservation Network Model to record "deep" dependencies and to allow tracking over time

Quality assurance solutions

Solutions for assessing or identifying quality Issues in Datasets.

Label: quality_assurance

Page: AQDC - Document Compare
Page: SO26 Automated RAW to DNG migration+QA
Page: SO19 Recognize inaccurate graphical image files based on a pattern-set
Page: SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm)
Page: Validate and report filetypes per file
Page: Server MIME Type Correction
Page: Newspaper issue dates - solution
Page: jp2 header analysis
Page: java image blocks comparison
Page: Perceptual Image Diff comparison
Page: EAP File Verification
Page: Identifying rotated, duplicate images using pHash
Page: Identify compressed TIFFs and convert them to uncompressed TIFFs
Page: EAP Compare Metadata with Requirements
Page: Validating TIFF to JPEG2000 migration
Page: Compare OCR results of the same source material in different formats (TIFF, JP2)

Rights managment solutions

Solutions for managing permissions and rights Issues.

Label: rights

Page: Convert embedded fonts to outlines
Page: Permissions Overlays

Structural relationship solutions

Solutions for preserving or checking the structural relationships between digital objects belonging to a particular entity.

Label: structural_relationships

Page: Identifying differences between metadata in files and copying metadata between files
Page: File management and matching of tif, htm and pdf files solution
Page: Check consistency between metadata and content
Page: Newspaper issue dates - solution

Validation solutions

Solutions for validating the conformance of digital objects to file format specifications or institutional profiles.

Label: validation

Page: Fixes for some common PDF to PDFA conversion validation errors
Page: PDFBox Preflight 2 - Uses and Abuses
Page: Windows checksumming and verification tools
Page: SO30 Automated assessment of JP2 against a technical profile
Page: SO20 Extending JHOVE to characterise NeXus data format
Page: SO21 Extending the NeXus validation toolkit to cope with very large data files
Page: PDF Characterisation Tool
Page: EAP File Verification
Page: Validating TIFF to JPEG2000 migration
Page: Diagnosing FLV problems using FLVmeta's flvdump

Other / untagged

Solutions that haven't been tagged with any of the labels listed in this column. This provides a useful mechanism for catching Issues that have not been tagged with sufficient detail, or identifying the need to add new labels to this page.

Page: PDF to PDFA Conversion
Page: Visual Analysis of Preflight Output
Page: OSX GUI based checksum tool
Page: CSV listing of Aggregations of Duplicates in a Dataset
Page: Search Web Archive Data for Highlighted Text in Chrome
Page: SO37 Connector API Technical Compability Kit
Page: SO36 Perform scalable search for small sound chunks in large audio archive
Page: SO35 Use schematron as the content profile language to validate files by evaluating their characterisation information
Page: SO34 Use Manzanita Crosscheck to validate mpeg transport streams
Page: Full System Preservation of Apple iMac (PPC)
Page: SO31 Preservation Grade TIFF to JPEG2000 Migration
Page: SO32 Image Metadata Extractor
Page: SO33 Image Metadata Compare
Page: Analysis of Lucene Index Word Frequency
Page: ssdeep for duplicate image detection
Page: Corrupted JPEG and JPEG2000 files solution
Page: Preserving MS Outlook (.msg) E-mails with Attachments - Solution
Page: Malformed TIFF images solution
Page: SO14 Fuse mounting (w)arc files
Page: Identify Files Affected by Truncated-Fuzzy JPEG2000
Page: Identify Shifted Crop Issue in JPEG2000
Page: Extracting embedded objects from Office OpenXML documents

Recently Updated

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.