Skip to end of metadata
Go to start of metadata
IS3 Large media files are difficult to characterise without mass processing + We cannot identify preservation risks in uncharacterised files
Description At SB, data from broadcasters contain huge media files like MPEG2 transport streams (MPEG2-TS), for example. There is an end user agreement that only allows streaming this data, but not distribution of copies of the archived content. SB captures broadcast television as complex MPEG2-TS. The video content is accompanied by metadata, typically used to support the production of TV guides. SB preserves the MPEG2-TS as the preservation masters. Chunks of this data that relate to specific programmes are extracted, migrated and served to users as streaming Flash video. The master MPEG2-TS files are so large that characterisation is a significant challenge.

The difficulty lies in pulling out metadata for these huge media files in a large scale. Deep characterisation, in this context, means that for container formats the contained streams (typically mpeg-2 or mpeg-4 (h.264) video and AAC audio are also identified and characterised.

It is difficult to apply typical validation tools to such large files. A detailed characterisation of the MPEG2-TS is needed in order to identify technical dependencies for extracting from or rendering the embedded content in the MPEG2-TS. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation.

Scalability Challenge
Extremely large files. Checksumming the collection currently takes around 3 months on existing hardware.
Issue champion Blekinge, Asger Askov (SB)
Other interested parties

Possible approaches
  • ALL
    • A deep characterisation service is required for MPEG2-TS. Analysis of the characterisation results would facilitate risk identification.
  • EXL
    • We're not sure how this scenario fits in. The TB work package is meant to design test cases to test the work that the platform team (WP PT) does. The platform team should find an appropriate test scenario to test whatever MD extractor is developed or found which can work on large files.
    • Checksum can be done on chunks of a predefined size, and therefore can be map/reduced. The final checksum can be checksum on a partial checksums. Need to specify the checksum algorithm to be used. The problem is not clear stated. Does JHOVE crash? What is maximum file size JHOVE can handle?
    • Watch may contribute to the solutions with the triggers:
      • Monitor characterization tools
      • Monitor new versions of new rendering software
      • Monitor rendering software features or supported characteristics
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets mpeg-2 transport stream with Danish TV broadcasts


Objectives scaleability, coverage, preciseness, automation
Success criteria Being able to extract all the provided metadata 
  • The technical metadata, which is used by the player machines to decode the stream
  • The program metadata that is used to display program and channel information
  • The subtitles, which to some extent is a full text dump of the program content. 
  • TextTV information
    With this metadata extracted, search-machine integration should be very powerful. With the ability to extract this metadata, the transport stream could be used as a selfdescribing object.
Automatic measures Being able to process streams faster than their defined bitrate (ie. not lose the race to time)
Manual assessment Which if the above mentioned metadata sources we can extract
Actual evaluations links to acutual evaluations of this Issue/Scenario
characterisation characterisation Delete
lsdr lsdr Delete
issue issue Delete
watch watch Delete
planning planning Delete
obsolescence obsolescence Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.