The Hadoop framework can be considered highly scalable, fault tolerant, even simple to program, but definitely not a high-performance solution. At least out of the box. While Hadoop has been conceived to run on commodity servers, hardware has been evolving rapidly over the last years. Cluster design now incorporates many more affordable and faster options like SSD disks, multi-cores, faster networks, and the possibility to run in the Cloud either as IaaS or PaaS. To take advantage of the newly available hardware and services, and to reduce the Total Cost of Ownership (TCO), Hadoop often requires an iterative —and time consuming— benchmarking and fine tuning over a myriad of software configuration options. Including tuning from the OS and the JVM, to the different Hadoop parameters, compare vendor versions, and different types of Jobs. Making Hadoop cluster design complex and a challenge to devise cost-effective infrastructures.

Aloja aims to explore upcoming hardware architectures for Big Data processing and to reduce the TCO of running Hadoop clusters. Aloja's approach is to create the most comprehensive open public Hadoop benchmarking repository. Comparing not only software configuration parameters, but also current, to new available hardware including SSDs, InfiniBand networks, and Cloud services. While at the same time evaluating the TCO of possible each possible setup along with the running time to offer a recommendation. This way, serving as a reference guide for designing new Hadoop clusters, exploring parameter relationship, as well to reduce the TCO for existing data processing infrastructures.

Project Team: