Saturday, October 25, 2014

Getting Hadoop Up and Running on Ubuntu

Hadoop is a framework written in Java, that allows processing of large datasets in a distributed manner across large clusters of commodity hardware.

Hadoop was created by Doug Cutting and he got the inspiration for Hadoop after reading Google papers "The Google File System" and "MapReduce: Simplified Data Processing on Large Clusters" published in 2003 and 2004 respectively.

As per the project main page, Hadoop includes four modules; Hadoop Common, Hadoop Distributed File System(HDFS), Hadoop YARN and Hadoop MapReduce. There is a whole ecosystem built around Hadoop; including projects like Apache Hive, Apache Pig, Apache ZooKeeper ect...

There are three modes to start a Hadoop cluster.
  1. Local (Standalone) Mode
  2. Pseudo-Distributed Mode
  3. Fully-Distributed Mode
In this post my aim is to get Hadoop up and running on a Ubuntu host using Local (Standalone) Mode and on Pseudo-Distributed Mode.

Linux is supported as a development and a production platform by Hadoop. Since Hadoop runs on any Linux distribution, my choice of platform being Ubuntu will not have any effect to the readers who are using different Linux distributions. Note that environment configurations can be vary across different distributions.

For this post I'll be using Ubuntu 14.04 LTS and Apache Hadoop 2.5.1.


Since Hadoop is written in Java, Java should be installed in your Ubuntu host. Refer this link for recommended Java versions. Perform following command in command line to check if you have already installed Java on your machine.
$ javac
$ java -version
This link provides a good resource in case you have not already installed Java.

Once Java is installed, you should set JAVA_HOME/bin to your PATH, to ensure java is available from the command line. To save the JAVA_HOME environment variable persistently, open up ~/.profile file using following command.
$ gedit ~/.profile
Append following lines to it and save.

export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export PATH=$JAVA_HOME/bin

Note that after editing, you should re-login in order to initialize the variables, but you could use following command and use the variable without re-login. Also there are many different ways you can save an environment variables in Ubuntu, refer this link to find out what they are.
$ source ~/.profile
Downloading Hadoop
  1. Download the latest stable Hadoop release from this link. You'll most likely to download a file named like; hadoop-2.5.1.tar.gz(This is the latest version of Hadoop at the time of this writing.)
  2. I prefer Hadoop being installed in /usr/local directory. Decompress the downloaded file using the following command.
    $ tar -xf hadoop-2.5.1.tar.gz -C /usr/local/
  3. Add $HADOOP_PREFIX/bin directory to your PATH, to ensure Hadoop is available from the command line. Follow the same steps we followed for adding JAVA_HOME variable, except you should append following in .profile file.

    export HADOOP_PREFIX=/usr/local/hadoop-2.5.0
    export PATH=$HADOOP_PREFIX/bin:$PATH
  4. Define following parameters in etc/hadoop/ file.
    $ export JAVA_HOME=/usr/lib/jvm/java-7-oracle
    $ export HADOOP_PREFIX=/usr/local/hadoop-2.5.1
  6. If the execution of the following command displays the usage documentation for Hadoop script, you are good to go to start your Hadoop cluster in one of the above mentioned three modes.
    $ hadoop

Standalone Mode

Hadoop by default is configured to run as a single Java process, which runs in a non distributed mode. Standalone mode is usually useful in development phase since it is easy to test and debug. Also, Hadoop daemons are not started in this mode. Since Hadoop's default properties are set to standalone mode and there are no Hadoop daemons to run, there are no additional steps to carry out here.

Pseudo-Distributed Mode

This mode simulates a small scale cluster, with Hadoop daemons running on a local machine. Each Hadoop daemon is run on a separate Java process. Pseudo-Distributed Mode is a special case of Fully distributed mode.

To enable Pseudo-Distributed Mode, you should edit following two XML files. These XML files contain multiple property elements within a single configuration element. Property elements contain name and value elements.
  1. etc/hadoop/core-site.xml
  2. etc/hadoop/hdfs-site.xml
Edit the core-site.xml and modify the following properties. fs.defaultFS property holds the locations of the NameNode.
Edit the hdfs-site.xml and modify the following properties. dfs.replication property holds the number of times each HDFS block should be replicated.
Setting up SSH

In a typical cluster, the Hadoop user should be able to execute commands on machines in the cluster, without having to enter a password for every single log in. If we use a password to log in to machines in the cluster, we will have to go on to each individual machine and start all the processes there.

Like I mentioned earlier, in Pseudo-Distributed Mode, we need to start Hadoop daemons. The host(single) being localhost, we need a way to log in to localhost without entering a password and start Hadoop daemons there. To do that we need ssh, we need to make sure we can ssh to localhost without giving a password. To allow password-less log in, we need to create a ssh key pair that has an empty password.

[ssh provides a way to securely log onto remote systems without using a password, using Key-based authentication. Key-based authentication creates a pair of keys; a private key and a public key. The private key will be kept as a secret at the client machine. The public key can be placed on any server you wish to access. Briefly what happens when a client tries to connect to a server is, server will generate a message to client using client's public key and only client can read it using it's private key. Based on the response server will get from client, server can decide if client is authorized or not.]
  1. ssh is pre-packaged with Ubuntu, but we need to install ssh first to start sshd server. Use the following command to install ssh and sshd.
    $ sudo apt-get install ssh
    Verify installation using following commands.
    $ which ssh
    ## Should print '/usr/bin/ssh'
    $ which sshd
    ## Should print '/usr/bin/sshd'
  2. Check if you can ssh to the localhost without a password.
    $ ssh localhost
    Note that if you try o ssh to the localhost without installing ssh first, an error message will be printed saying 'ssh: connect to host localhost port 22: Connection refused'. So be sure to install ssh first.
  3. If you cannot SSH to the localhost without a password create a ssh key pair using the following command.
    $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
  4. Now the key pair has been created, note that id_rsa is the private key and is the public key are in .ssh directory. We need to include the new public key to the list of authorized keys using the following command.
    $ cat ~/.ssh/ >> ~/.ssh/authorized_keys
  5. Now connect to the localhost and check if you can ssh to the localhost without a password.
    $ ssh localhost
Configuring the base HDFS directory

hadoop.tmp.dir property within core-site.xml file holds the location to the base HDFS directory. Note that this property configuration doesn't depend on the mode Hadoop runs on. The default value for hadoop.tmp.dir property is /tmp/hadoop-${}, and there is a risk that some linux distributions might discard the contents of the /tmp directory in the local file system on each reboot, and leads to data loss within the local file system, hence to be on the safer side, it makes sense to change the location of the base directory to a much reliable one.

Carry out following steps to change the location of the base HDFS directory.
  1. Create a directory for Hadoop to store its data locally and change its permissions to be writable by any user.
    $ mkdir /app/hadoop/tmp
    $ chmod 777 /app/hadoop/tmp
  2. Edit the core-site.xml and modify the following property.
Formatting the HDFS filesystem

We need to format the HDFS file system, before starting Hadoop cluster in Pseudo-Distributed Mode for the first time. Note that formatting the file system multiple times will result deleting the existing file system data.

Execute the following command on command line to format the HDFS file system.
$ hdfs namenode -format
Starting NameNode daemon and DataNode daemon

Now you can access the name node web interface at http://localhost:50070/.

You can use the following command to see the Java Processes which are currently running.
$ jps

Creating the home directory

In Hadoop the home directories for each user are stored under /user directory.

If there is no /user directory, create that using the following command. Note that even though you would skip this step, /user directory will be automatically created by Hadoop later, if necessary.
$ hdfs dfs -mkdir /user
Then create your home directory using the following command.
$ hdfs dfs -mkdir /user/hadoop
Note that this is an explicit step, even if you don't carry out this step, a home directory will be automatically created later by Hadoop by the name you are logged into your local machine. For an example if I logged into my local machine as a user called pavithra, my home directory in HDFS will be /user/pavithra. So in order to utilize the previous step you should log in to your local machine as a user called hadoop.

Starting a MapReduce job

  1. Use following command to create an input directory in HDFS.
    $ hdfs dfs -mkdir input
  2. Use following command to copy input files into HDFS.
    $ hdfs dfs -put $HADOOP_PREFIX/etc/hadoop/*.xml input
  3. Use the following command to run the MapReduced job, provided.
    $ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar grep input output 'dfs[a-z.]+'
    Hadoop itself creates the output directory you have specified in this command. If by any chance you specify a directory which already exists in HDFS, Hadoop will throw an exception saying "Output directory already exists". With this Hadoop ensures data from previous jobs will not get replaced with the data from the current job.
  4. Use following commands to copy the output files from HDFS to local file system and view them.
    $ hdfs dfs -get output output
    $ cat output/*
    OR you can use following command to View the output files on HDFS itself.
    $ bin/hdfs dfs -cat output/*
    Note that result files inside the output directory follows a naming convention of part-nnnnn.
  5. Use the following command to stop the daemons.
    $ $HADOOP_PREFIX/sbin/