| *Title* \\ | IS3 Large media files are difficult to characterise without mass processing + We cannot identify preservation risks in uncharacterised files |
| *Description* | At SB, data from broadcasters contain huge media files like MPEG2 transport streams (MPEG2-TS), for example. There is an end user agreement that only allows streaming this data, but not distribution of copies of the archived content. SB captures broadcast television as complex MPEG2-TS. The video content is accompanied by metadata, typically used to support the production of TV guides. SB preserves the MPEG2-TS as the preservation masters. Chunks of this data that relate to specific programmes are extracted, migrated and served to users as streaming Flash video. The master MPEG2-TS files are so large that characterisation is a significant challenge. \\
The difficulty lies in pulling out metadata for these huge media files in a large scale. Deep characterisation, in this context, means that for container formats the contained streams (typically mpeg-2 or mpeg-4 (h.264) video and AAC audio are also identified and characterised. \\
It is difficult to apply typical validation tools to such large files. A detailed characterisation of the MPEG2-TS is needed in order to identify technical dependencies for extracting from or rendering the embedded content in the MPEG2-TS. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation. \\
| *Scalability Challenge* \\ | Extremely large files. Checksumming the collection currently takes around 3 months on existing hardware. |
| *Issue champion* | [Blekinge, Asger Askov|https://portal.ait.ac.at/sites/Scape/_layouts/userdisp.aspx?ID=9] (SB) |
| *Other interested parties* \\ | \\ |
| *Possible approaches* | * ALL
** A deep characterisation service is required for MPEG2-TS. Analysis of the characterisation results would facilitate risk identification.
** We're not sure how this scenario fits in. The TB work package is meant to design test cases to test the work that the platform team (WP PT) does. The platform team should find an appropriate test scenario to test whatever MD extractor is developed or found which can work on large files.
** Checksum can be done on chunks of a predefined size, and therefore can be map/reduced. The final checksum can be checksum on a partial checksums. Need to specify the checksum algorithm to be used. The problem is not clear stated. Does JHOVE crash? What is maximum file size JHOVE can handle?
** Watch may contribute to the solutions with the triggers:
*** Monitor characterization tools
*** Monitor new versions of new rendering software
*** Monitor rendering software features or supported characteristics |
| *Context* | |