IBM’s BigSheets Text-mining the UK Web Archive

Recently announced, the UK Web Archive, with the help of IBM and its decades of experience in text-mining and BigSheets software is going to store and make accessible every site in the .uk top-level domain to provide dynamic research with abilities like classifying pages into categories, extracting entities as metadata, and offering several approaches to querying and visualizing data.

Hadoop, the core technology being used within BigSheets, is a data storage system that can scale to billions of items with less required structure and space than a relational database; easily handling large amounts of traffic and using parallel processing as well as addition of new servers, replication, fail-over, and load balancing.

