Basic cluster software
ubuntu 10.04.04 LTS
OpenSSH as part of the OS installation
Cluster Setup (2012.06): Cloudera CDH3u4 (see current status below)
JDK 6u33 - x64 (jdk-6u33-linux-x64.bin)
At the time of planning the experimental cluster setup, CDH4u0 based on Hadoop 2.0.0 and ubuntu 12.04 LTS have just been released. It has been decided by the development team to go with CDH3u4 (which is supported on 10.04 LTS but not on 12.04 LTS) and ubuntu 10.04.04 LTS because of stability reasons.
Current cluster software setup & history
2012.09 - update to Cloudera CDH3u5
- Hadoop update 0.20.2+923.418 (previously 0.20.2+923.256)
- PIG update auf Version 0.8.1+28.39-1 (previously pig-0.8.1+28.32)
- HADOOP_HEAPSIZE=2000 (previously 1000)
- MAP / REDUCE slot ratio 5/2 (previously 6/1)
Cluster hardware
NETWORK infrastructure
The CONTROLLER and the NODEs are connected to a GBit high performance network switch (guarantees the full GBit performance for each port) to the ONB internal network infrastructure. Since all machines are connected to ports on the same physical network switch, the traffic between the cluster machines does not affect the traffic on the ONB network.
CONTROLLER
Dell Poweredge R510
CPU: 2 x Xeon [email protected] Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
NIC: 2 x GBit Ethernet (1 used)
DISK: 3 x 1TB DISKs; configured as RAID5 (redundancy) => 2TB effective disk space
NODE (worker nodes 1 to 5)
Dell Poweredge R310
CPU: 1 x Xeon [email protected] Quadcore CPU (8 HyperThreading cores)
RAM: 16GB
NIC: 2 x GBit Ethernet (1 used)
DISK: 2 x 1TB DISKs; configured as RAID0 (performance) => 2TB effective disk space
Comment: Some hadoop documentation recommends JBOD (Just a Bunch Of Disks) instead of RAID0 (stripe set). Unfortunately the RAID controller in the R310 does not offer JBOD functionality. If that impacts performance at all, the impact might be very, very small in our usage scenario.