Getting Started With Apache Hadoop: Installing On Debian 11

‍Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. It is widely used by businesses to handle large datasets and perform complex analytics tasks. In this article, we will guide you through the process of installing and configuring Apache Hadoop on Debian 11, step by step.

System Update and Java Installation

Before we begin with the installation of Apache Hadoop, it is essential to update the system packages to the latest version. To do so, open the terminal and run the following command:

sudo apt-get update -y

Once the system packages are updated, we need to install Java, as Apache Hadoop is a Java-based application. Run the following command to install the default JDK and JRE:

sudo apt-get install default-jdk default-jre -y

To verify the Java installation, run the following command:

java-version

The output should display the installed Java version.

Creating a Hadoop User and Setting up Passwordless SSH

To ensure secure access and management of Apache Hadoop, it is recommended to create a dedicated user. Run the following command to create a Hadoop user:

sudo adduser hadoop

After creating the user, switch to the Hadoop user by running the following command:

su - hadoop

Next, generate an SSH key for the Hadoop user by running the following command:

ssh-keygen -t rsa

This command will generate a public and private key pair. Press Enter to accept the default file location and passphrase. The SSH key will be generated and saved in the specified location.

To enable passwordless SSH access, add the public key of your computer to the authorized_keys file of the Hadoop user. Run the following command:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

To verify the passwordless SSH connection, run the following command:

ssh [Server's_IP_Address]

Replace [Server’sIPAddress] with the IP address of the server you are connecting to.

Installing Apache Hadoop

Switch back to the Hadoop user and download the latest version of Apache Hadoop using the following command:

su - hadoop
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

Once the download is complete, extract the downloaded tar file using the following command:

tar -xvzf hadoop-3.3.0.tar.gz

Now, switch back to the root user for the remaining commands:

su root

Move the extracted Hadoop files to the appropriate directory by running the following command:

cd /home/hadoop
mv hadoop-3.3.0 /usr/local/hadoop

Create a log directory to store Apache Hadoop logs:

mkdir /usr/local/hadoop/logs

Change the ownership of the /usr/local/hadoop directory to the Hadoop user:

chown -R hadoop:hadoop /usr/local/hadoop

Switch back to the Hadoop user:

su - hadoop

Configuring Hadoop Environment Variables

To configure the Hadoop environment variables, enter the edit mode of the .bashrc file by running the following command:

nano ~/.bashrc

Add the following configuration at the end of the file:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Save the changes and exit the editor.

To activate the added environment variables, run the following command:

source ~/.bashrc

Configuring Hadoop on a Single Node

If you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node. To do so, follow the steps below:

Configure Java Environment Variables

Determine the path of the installed Java version by running the following command:

which javac

The output should display the path to the Java compiler.

Next, find the OpenJDK directory by running the following command:

readlink -f /usr/bin/javac

Make a note of the output, as we will need it in the next step.

Edit the hadoop-env.sh file by running the following command:

nano $HADOOP_HOME /etc/hadoop/hadoop-env.sh

Add the following configuration at the end of the file:

export JAVA_HOME=[Java_Path]
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"

Replace [Java_Path] with the path obtained from the previous step.

Save the changes and exit the editor.

Download the Javax Activation File

To download the Javax Activation file, run the following command:

cd /usr/local/hadoop/lib
sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar

Verify the Hadoop Version

To verify the installed Hadoop version, run the following command:

hadoop version

The output should display the installed Hadoop version.

Configuring Hadoop Files

To configure Hadoop, we need to modify several XML files. Follow the steps below to configure each file:

Configure core-site.xml File

Open the core-site.xml file using a text editor:

vi $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration inside the <configuration> tags:

<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9000</value>
<description>The default file system URI</description>
</property>

Save the changes and exit the editor.

Configure hdfs-site.xml File

Create a directory to store the node metadata:

mkdir -p /home/hadoop/hdfs/{namenode,datanode}
chown -R hadoop:hadoop/home/hadoop/hdfs

Open the hdfs-site.xml file:

vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration inside the <configuration> tags:

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>

Save the changes and exit the editor.

Configure mapred-site.xml File

Open the mapred-site.xml file:

vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following configuration inside the <configuration> tags:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Save the changes and exit the editor.

Configure yarn-site.xml File

Open the yarn-site.xml file:

vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following configuration inside the <configuration> tags:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Save the changes and exit the editor.

Formatting HDFS NameNode

Before starting the Hadoop services for the first time, it is important to format the NameNode. Run the following command:

hdfs namenode-format

Starting the Hadoop Cluster

To start the Hadoop cluster, follow the steps below:

Start the NameNode and DataNode

Switch to the Hadoop user and start the NameNode and DataNode services by running the following command:

start-dfs.sh

Start the YARN Resource Manager and NodeManagers

Start the YARN resource manager and nodemanagers by running the following command:

start-yarn.sh

Verifying the Hadoop Cluster

To verify if all the Hadoop daemons are active and running as Java processes, run the following command:

jps

The output should display the list of running processes, including NameNode, DataNode, ResourceManager, and NodeManager.

Accessing the Hadoop Web Interface

To access the Hadoop web interface, follow the steps below:

Hadoop NameNode

Open a web browser and navigate to the following URL:

http://your-server-ip:9870

Replace your-server-ip with the IP address of your server.

Individual DataNodes

To access individual DataNodes, navigate to the following URL:

http://your-server-ip:9864

Replace your-server-ip with the IP address of your server.

YARN Resource Manager

To access the YARN resource manager, navigate to the following URL:

http://your-server-ip:8088

Replace your-server-ip with the IP address of your server.

Congratulations! You have successfully installed and configured Apache Hadoop on Debian 11. You can now leverage the power of Hadoop to process and analyze large datasets. If you are looking for reliable and scalable cloud hosting solutions to run your Hadoop clusters, consider Shape.host’s Linux SSD VPS services. Shape.host offers high-performance VPS hosting with excellent support and competitive pricing. Start unlocking the potential of big data with Apache Hadoop and Shape.host today!

Cloud Instances

Standard

CPU-Optimized

Memory-Optimized

Storage & Networking

Volumes

Load Balancers

Extra IPs

Server Locations

Advanced Networking

Backup

Control Panel

Operating Systems

Installing and Configuring Apache Hadoop on Debian 11: Comprehensive Guide

System Update and Java Installation

Creating a Hadoop User and Setting up Passwordless SSH

Installing Apache Hadoop

Configuring Hadoop Environment Variables

Configuring Hadoop on a Single Node

Configure Java Environment Variables

Download the Javax Activation File

Verify the Hadoop Version

Configuring Hadoop Files

Configure core-site.xml File

Configure hdfs-site.xml File

Configure mapred-site.xml File

Configure yarn-site.xml File

Formatting HDFS NameNode

Starting the Hadoop Cluster

Start the NameNode and DataNode

Start the YARN Resource Manager and NodeManagers

Verifying the Hadoop Cluster

Accessing the Hadoop Web Interface

Hadoop NameNode

Individual DataNodes

YARN Resource Manager

Christian Wells

Company Services

Quick Links

Contact Info

Cloud Instances

Storage & Networking

Installing and Configuring Apache Hadoop on Debian 11: Comprehensive Guide

System Update and Java Installation

Creating a Hadoop User and Setting up Passwordless SSH

Installing Apache Hadoop

Configuring Hadoop Environment Variables

Configuring Hadoop on a Single Node

Configure Java Environment Variables

Download the Javax Activation File

Verify the Hadoop Version

Configuring Hadoop Files

Configure core-site.xml File

Configure hdfs-site.xml File

Configure mapred-site.xml File

Configure yarn-site.xml File

Formatting HDFS NameNode

Starting the Hadoop Cluster

Start the NameNode and DataNode

Start the YARN Resource Manager and NodeManagers

Verifying the Hadoop Cluster

Accessing the Hadoop Web Interface

Hadoop NameNode

Individual DataNodes

YARN Resource Manager

Christian Wells

How to Secure Your SSH with Fail2Ban on Debian 11

Installing MongoDB 5 on Debian 11 Bullseye: Step-by-Step Guide

Related Product

Company Services

Quick Links

Contact Info