Produce a report summarising collection metadata and content

compared with
Current by Thom Carter
on Sep 20, 2012 13:29.

This line was removed.
This word was removed. This word was added.
This line was added.

Changes (4)

View Page History

*Lessons Learned*
_Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_
* Difficulties involved in comparing text data
* Relevant tools
* Apache Tika is a useful tool for extracting metadata and text content from files
* Perl is a power tool for creating reports on the contents of collections
* It is difficult to test large text documents for similarity with any degree of accuracy


_Reference to the appropriate Solution page(s), by hyperlink._
[Extracting and aggregating metadata with Tika|]
[Using Perl to write scripts to find duplicates and find keywords|]