In order to get a data storage system allowing data's high availability and form of storage enabling easy distributed computations, the distributed database is considered as main data store. Broad research of the novel approaches to distributed storage and search system was carried out in IMF and several candidates were selected for further hands on evaluation. Namely, the systems based on Google's BigTable approach, HBase and Hypertable. Both are open-sourced projects – HBase maintained as a main Apache project. Aside the performance itself, other criteria, like the size of the community behind the projects were considered. The evaluation was done on the side of Internet Memory on real crawl data assuring, that the proposed system is mature and performing enough for such large scale deployment that IMF demands. These systems can provide real-time access to data while keeping the extent of scalability to hundreds of nodes in the data base cluster.
The organization of data in the proposed system will be in tables, which have the same meaning as in regular relational data bases, but their column orientation – rather than row orientation of storage – allows storing efficiently thousands of columns in millions of rows. The columns are organized in column families – physically stored close together and assuring that sparseness of the values will not hurt the performance – via variable row length. The overall structure is a multidimensional map, where the key to each value is the tuple <row_id, column_family, column qualifier, timestamp>. The high availability of the data is provided via inherent replication of the distributed filesystem used to store the DB's data files.
After a thorough evaluation of both systems, HBase has been selected from the two considered candidates for its stability, maturity and satisfying performance.
In IMF data are kept in several column families, each meant for different purpose, e.g. the raw content, meta data acquired at the time of the crawl, the results of analyses, etc. Each datum is than identified by its unique URL, semantics and crawl time. Due to shortcomings imposed by HBase in storage of large blobs, crawled resources having size large than 10MB are stored directly in HDFS with respective pointer stored in HBase.
The access to data can be done either in a manner of a random access query by URL - the primary key - or a high performance sequential read compatible with the MapReduce paradigm.