Apache Hadoop Installation Guide for Ubuntu 22.04 Users

Apache Hadoop has become the standard framework for processing and storing big data in various industries. Designed to handle large datasets with high volume and complexity, Hadoop is an open-source framework that can be run on distributed systems with hundreds or even thousands of clustered computers or dedicated servers.

In this article, we will guide you through the process of installing Apache Hadoop on an Ubuntu 22.04 server, specifically focusing on the Pseudo-Distributed Mode of Hadoop deployment. We will cover the prerequisites, installing Java OpenJDK, setting up user and password-less SSH authentication, downloading Hadoop, setting up Hadoop environment variables, and configuring the Hadoop cluster.

Prerequisites

Before getting started, make sure you have the following requirements:

An Ubuntu 22.04 server (e.g., hosted on Shape.host with the username shapehost)
A non-root user with sudo/root administrator privileges

Installing Java OpenJDK

Hadoop is primarily written in Java, so the first step is to install Java OpenJDK, which is compatible with the latest version of Hadoop (v3.3.4). Start by updating and refreshing the package lists/repositories on your Ubuntu system with the following command:

sudo apt update

Next, install Java OpenJDK 11 by executing the following command:

sudo apt installdefault-jdk

When prompted, type “y” to confirm and press ENTER to proceed with the installation. Once the installation is complete, verify the Java version by running the following command:

java-version

You should see the Java OpenJDK 11 installed on your Ubuntu system.

Setting up User and Password-less SSH Authentication

To run Apache Hadoop, SSH service needs to be running on the system. In this step, we will create a new user named “hadoop” and set up password-less SSH authentication.

If you don’t have SSH installed on your system, you can install it by running the following command:

sudo apt install openssh-server openssh-client pdsh

Now, create a new user “hadoop” and set up a password for the user with the following commands:

sudo useradd -m -s /bin/bash hadoop
sudo passwd hadoop

Next, add the “hadoop” user to the “sudo” group so that it can execute the “sudo” command:

sudo usermod -aG sudo hadoop

Switch to the “hadoop” user by running the following command:

su - hadoop

Generate SSH public and private keys by executing the following command:

ssh-keygen -t rsa

You will be prompted to set up a password for the key, but you can press ENTER to skip it. The SSH keys will be generated in the ~/.ssh directory. Verify the generated SSH key by running the following command:

ls ~/.ssh/

To enable password-less SSH authentication, copy the SSH public key (idrsa.pub) to the “authorizedkeys” file and change its permission to 600 with the following commands:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

You can verify the password-less configuration by connecting to the local machine using the following command:

ssh localhost

Type “yes” to confirm and add the SSH fingerprint, and you should be connected to the server without password authentication.

Downloading Hadoop

Now that the user and password-less SSH authentication are set up, we can proceed to download the Apache Hadoop binary package and set up the installation directory.

Begin by downloading the Hadoop binary package (version 3.3.4) to the current working directory with the following command:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

Once the download is complete, extract the package and move the extracted directory to the /usr/local/hadoop directory with the following commands:

tar -xvzf hadoop-3.3.4.tar.gz
sudo mv hadoop-3.3.4 /usr/local/hadoop

Change the ownership of the Hadoop installation directory to the “hadoop” user and group:

sudo chown -R hadoop:hadoop /usr/local/hadoop

Setting up Hadoop Environment Variables

To set up the Hadoop environment variables, open the ~/.bashrc file with a text editor:

nano ~/.bashrc

At the end of the file, add the following lines to set up the Hadoop environment variables:

# Hadoop environment variables
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Save the file and exit the editor. Apply the changes within the ~/.bashrc file by running the following command:

source ~/.bashrc

Verify the environment variables by checking each variable with the echo command, for example:

echo $JAVA_HOME
echo $HADOOP_HOME
echo $HADOOP_OPTS

You should see the output of each environment variable.

Additionally, configure the JAVA_HOME environment variable in the hadoop-env.sh script. Open the file with a text editor:

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Uncomment the line that sets the JAVA_HOME environment variable and change its value to the Java OpenJDK installation directory:

exportJAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Save the file and exit the editor.

To verify the Hadoop version on your system, run the following command:

hadoop version

You should see that Apache Hadoop 3.3.4 is installed on your system.

Setting up Apache Hadoop Cluster: Pseudo-Distributed Mode

In Hadoop, you can create a cluster in three different modes: Local Mode (Standalone), Pseudo-Distributed Mode, and Fully-Distributed Mode. In this article, we will set up an Apache Hadoop cluster in Pseudo-Distributed Mode on a single Ubuntu server.

To configure the NameNode and DataNode for the Hadoop cluster, open the core-site.xml file with a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following lines to the file, replacing the NameNode IP address with the appropriate value:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.5.100:9000</value>
</property>
</configuration>

Save the file and exit the editor.

Next, create the directories that will be used for the DataNode on the Hadoop cluster and change their ownership to the “hadoop” user:

sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}
sudo chown -R hadoop:hadoop /home/hadoop/hdfs

Open the hdfs-site.xml file with a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration to the file, adjusting the “dfs.replication” value and specifying the directory for the DataNode:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>

Save the file and exit the editor.

Format the Hadoop filesystem by running the following command:

hdfs namenode -format

Start the NameNode and DataNode by executing the following command:

start-dfs.sh

To verify that both processes are running, access the Hadoop NameNode web interface by opening a web browser and visiting the server IP address followed by port 9870 (e.g., http://192.168.5.100:9870/). You should see the NameNode status page indicating that it is active.

Click on the “Datanodes” menu to view the active DataNode on the Hadoop cluster. The page should display the DataNode running on port 9864.

Click on the DataNode’s “Http Address” to access detailed information about the DataNode, including the volume directory.

With the NameNode and DataNode running, it’s time to set up and run MapReduce on the YARN manager (Yet Another Resource Negotiator).

YARN Manager

To configure MapReduce on YARN in Pseudo-Distributed Mode, make changes to the mapred-site.xml and yarn-site.xml configuration files.

Open the mapred-site.xml file with a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following lines to the file:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
  </property>
</configuration>

Save the file and exit the editor.

Open the yarn-site.xml file with a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Modify the following configurations:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Save the file and exit the editor.

Start the YARN daemons by running the following command:

start-yarn.sh

Access the Hadoop ResourceManager web interface by opening a web browser and visiting the server IP address followed by port 8088 (e.g., http://192.168.5.100:8088/). The interface allows you to monitor all running processes within the Hadoop cluster.

Click on the “Nodes” menu to see the currently running nodes on the Hadoop cluster.

Congratulations! You have successfully set up Apache Hadoop in Pseudo-Distributed Mode on your Ubuntu 22.04 server. Now you can process and store big data efficiently using Hadoop.

Conclusion

In this guide, we walked you through the step-by-step process of installing Apache Hadoop on an Ubuntu 22.04 server. We covered the prerequisites, installation of Java OpenJDK, setting up user and password-less SSH authentication, downloading Hadoop, configuring Hadoop environment variables, and setting up the Hadoop cluster in Pseudo-Distributed Mode.

Remember, Pseudo-Distributed Mode is suitable for testing purposes, while Fully-Distributed Mode is recommended for large-scale deployments. With Hadoop’s distributed processing capabilities, you can handle medium to large datasets effectively.

If you’re looking for reliable cloud hosting solutions, consider Shape.host’s Cloud VPS services. Shape.host offers scalable and secure cloud hosting, empowering businesses with efficient data processing and storage capabilities.

Cloud Instances

Standard

CPU-Optimized

Memory-Optimized

Storage & Networking

Volumes

Load Balancers

Extra IPs

Server Locations

Advanced Networking

Backup

Control Panel

Operating Systems