Produce a report summarising collection metadata and content

compared with
Current by Thom Carter
on Sep 20, 2012 13:29.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (4)

View Page History

*Lessons Learned*
_Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_
* Difficulties involved in comparing text data
* Relevant tools
* Apache Tika is a useful tool for extracting metadata and text content from files
* Perl is a power tool for creating reports on the contents of collections
* It is difficult to test large text documents for similarity with any degree of accuracy

*Datasets*

*Solutions*
_Reference to the appropriate Solution page(s), by hyperlink._
[Extracting and aggregating metadata with Tika|http://wiki.opf-labs.org/display/SPR/Extracting+and+aggregating+metadata+with+Apache+Tika]
[Using Perl to write scripts to find duplicates and find keywords|http://wiki.opf-labs.org/display/SPR/Using+Perl+to+write+scripts+to+find+duplicates+and+find+keywords]