Produce a report summarising collection metadata and content

*Lessons Learned*
_Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_
* Difficulties involved in comparing text data
* Relevant tools
* Apache Tika is a useful tool for extracting metadata and text content from files
* Perl is a power tool for creating reports on the contents of collections
* It is difficult to test large text documents for similarity with any degree of accuracy


_Reference to the appropriate Solution page(s), by hyperlink._
[Extracting and aggregating metadata with Tika|]
[Using Perl to write scripts to find duplicates and find keywords|]