This is the first stage of my DIY Hadoop cluster build. Please note I have zero previous experience with Hadoop clusters and this project is for me to simply learn more about how it works and have some fun with old PCs and Linux.
The catalyst for this project started with my work as an analytics professional. To date I have largely worked with traditional server architecture and in-memory software such as R. With the push towards digital half-hourly metering in the energy industry the scale of the task to use such data for analytical purposes requires a paradigm shift. This is a funny but interesting post I came across when researching this topic for work and explains why this is the first time I have needed to learn about Hadoop:
The hardware for this project is not based on optimal design choices but rather what I had lying around the house, low cost or free (yes, I am a tightarse). All I want to achieve with this first hit-out is a working platform comprised of three physical servers in a cluster, it doesn’t need to be able to analytically process big data at this point. This project is intended for me to gain a bit more of an understanding of Linux administration, networking and how Hadoop works. I do not need to be a Linux administrator for my job (we have IT gurus for that) however I have found that knowing a little bit about it leads to much better outcomes when engaging with the technical people who will be supporting the platform. Also, Linux is free so there is no excuse not to have a play and learn some new things along the way.
Some things to outline up front. I completed a little bit of pre-reading on this subject and I have compiled a list of sources that influenced my thinking in some way before the project commenced. I thank those who took the time to post these:
- NameNode and DataNode
- home built hadoop analytics cluster
- building a virtualized 5-node HDP 2.0 cluster (all within a mac)
- NanoLab – Index of Tekhead.it Posts on Intel NUC VMware vSphere Homelabs
- Open Homelab project
- HDFS Users Guide
- Understanding Hadoop Clusters and the Network
- Build a Compact 4 Node Raspberry Pi Cluster
- How-to: Deploy Apache Hadoop Clusters Like a Boss
So the first stage was to pull together the three PCs (or two PCs and one laptop) that I had gathering dust in the house, remove/backup all the distributed content I had previously saved on them and then install 64-bit CentOS over the top as a complete new install to delete/replace all content. Yes, I am using CentOS as my OS of choice. Largely because of the “home built hadoop analytics cluster” article and the fact that RedHat is used at work. I went with the Minimal ISO option installed on and booted from a USB. My how things have progressed since I last tried this. Nowadays with the latest versions of CentOS it is possible to simply copy the iso image to a USB and use it to boot and install. No more unetbootin or pendrivelinux! The install on all three PCs went fine. I ensured the same user and passwords and timezone, language etc was identical on all machines. I’m not sure if this is strictly needed but it is easier to remember in any case. I also wrote down all of the settings I used for the initial install. As for the hardware each PC is different as below:
- HP Workstation xw4400 (with an extra 1TB hard disk)
- HP Workstation xw4300
- HP Pavillion dv6000
I swear I am NOT a HP fanboy, that is just what I had lying around the house from past decades. You can google the specs of each PC if you need to but for me that wasn’t important at this point because it is just a test exercise remember. In a real scenario there is no way I would consider using laptops or old power hungry Pentium 4 workstations in any cluster. Here they are in all their glory (ignore the netbook in the image):
And this completes the first stage of the project. Next I will have to figure out how to set them all up ready for Hadoop but I’ll need to do some more reading before I get to that (hint: I will be using Hortonworks).