Label: webarchive

Content with label webarchive in SCAPE (See content from all spaces)
Related Labels: planning, hadoop, lsdr, representationinformation, characterisation, watch, identification, obsolescence, issue, qa, formatprofile, arc, researchdata, unknown_file_formats, unknown_characteristics, dataset, scenario

Page: Austrian National Library - Web Archive
Title \\ Austrian National Library Web Archive Description The Austrian National Library uses a representative datasets from their webarchive: \\ \ events selective crawls: during an event frequently harvested sites, e.g. EU election 2009, Olympia ...
Other labels: arc
Page: IS12 ARC to WARC migration
Title \\ IS12 ARC to WARC migration \\ Detailed description Migration from ARC to WARC is desirable as the WARC archive is better suited for the future of web archiving.  Scalability Challenge \\ ARC and WARC are both container formats. At the present SB has ...
Other labels: qa, issue, planning, watch, obsolescence
Page: IS14 Diverse preservation risks in large archives with millions of objects
Title \\ IS14 Diverse preservation risks in large archives with millions of objects Detailed description While we ingested millions of objects in the past, we expanded our knowledge about the risks of the objects. However, before we could make a decision ...
Other labels: characterisation, identification, issue, watch, obsolescence
Page: IS17 Characterisation of text-based formats
Title \\ IS17 Characterisation of textbased formats Detailed description Problem: it is getting increasingly common that scientific journal articles (which are usually in PDF format) are accompanied by supplemental files. These are often research data, or software source code or scripts. In the majority of cases ...
Other labels: identification, lsdr, issue, unknown_file_formats
Page: IS25 Web Content Characterisation
Title \\ IS25 Web Content Characterisation Detailed description \\ The issue with web content is mainly the fact that web archive data is very heterogeneous. Depending on the policy of the institution, data contains text documents in all kinds of text encoding, html content ...
Other labels: characterisation, identification, issue, obsolescence
Page: IS26 Dealing with difficult identification cases
Title \\ Dealing with difficult identification cases \\ Detailed description Identification Requirements, Format Languages, Requirements and Difficult Cases. Mutants and wild types. Strains. See below for specific examples. Scalability Challenge \\ The solution must be able to identify and describe the large ...
Other labels: lsdr, identification, issue, unknown_file_formats
Page: IS41 Analyse huge text files containing information about a web archive
Title \\ IS41 Analyse huge text files containing information about a web archive \\ Detailed description Some web archive produce information about the content of a web archive on a periodical basis. The result is sometimes stored as huge text files ...
Other labels: issue, hadoop, characterisation, unknown_characteristics
Page: IS5 Digital objects archive contains unidentified content
Title \\ Digital objects archive contains unidentified content Detailed description From an archiving point of view, if there is no detailed information about the exact content of an archive, no preservation planning or any preservation actions can be undertaken. For example, if old ...
Other labels: characterisation, identification, issue, watch, obsolescence
Page: IS6 Determine render-ability of displayable web objects
Title \\ Determine renderability of displayable web objects Detailed description To make a digital object renderable depends on standards, agreements, and understandings in interfaces and hardware, and there are strong interdependencies between these conditions. Because of these technical dependencies, the content of the web archive might not be renderable ...
Other labels: characterisation, issue, obsolescence
Page: IS7 Incompleteness and and inconsistency of web archive data
Title \\ Incompleteness and/or inconsistency of web archive data \\ Detailed description The best practice in preserving websites is by crawling them using a web crawler like Heritrix. However, crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler ...
Other labels: characterisation, qa, identification, issue, watch