Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. It is widely used by businesses to handle large datasets and perform complex analytics tasks. In this article, we will guide you through the process of installing and configuring Apache Hadoop on Debian 11, step by step.
System Update and Java Installation
Before we begin with the installation of Apache Hadoop, it is essential to update the system packages to the latest version. To do so, open the terminal and run the following command:
sudo apt-get update -y
Once the system packages are updated, we need to install Java, as Apache Hadoop is a Java-based application. Run the following command to install the default JDK and JRE:
sudo apt-get install default-jdk default-jre -y
To verify the Java installation, run the following command:
java-version
The output should display the installed Java version.
Creating a Hadoop User and Setting up Passwordless SSH
To ensure secure access and management of Apache Hadoop, it is recommended to create a dedicated user. Run the following command to create a Hadoop user:
sudo adduser hadoop
After creating the user, switch to the Hadoop user by running the following command:
su - hadoop
Next, generate an SSH key for the Hadoop user by running the following command:
ssh-keygen -t rsa
This command will generate a public and private key pair. Press Enter to accept the default file location and passphrase. The SSH key will be generated and saved in the specified location.
To enable passwordless SSH access, add the public key of your computer to the authorized_keys file of the Hadoop user. Run the following command:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
To verify the passwordless SSH connection, run the following command:
ssh [Server's_IP_Address]
Replace [Server’sIPAddress] with the IP address of the server you are connecting to.
Installing Apache Hadoop
Switch back to the Hadoop user and download the latest version of Apache Hadoop using the following command:
su - hadoop wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Once the download is complete, extract the downloaded tar file using the following command:
tar -xvzf hadoop-3.3.0.tar.gz
Now, switch back to the root user for the remaining commands:
su root
Move the extracted Hadoop files to the appropriate directory by running the following command:
cd /home/hadoop mv hadoop-3.3.0 /usr/local/hadoop
Create a log directory to store Apache Hadoop logs:
mkdir /usr/local/hadoop/logs
Change the ownership of the /usr/local/hadoop directory to the Hadoop user:
chown -R hadoop:hadoop /usr/local/hadoop
Switch back to the Hadoop user:
su - hadoop
Configuring Hadoop Environment Variables
To configure the Hadoop environment variables, enter the edit mode of the .bashrc file by running the following command:
nano ~/.bashrc
Add the following configuration at the end of the file:
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Save the changes and exit the editor.
To activate the added environment variables, run the following command:
source ~/.bashrc
Configuring Hadoop on a Single Node
If you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node. To do so, follow the steps below:
Configure Java Environment Variables
Determine the path of the installed Java version by running the following command:
which javac
The output should display the path to the Java compiler.
Next, find the OpenJDK directory by running the following command:
readlink -f /usr/bin/javac
Make a note of the output, as we will need it in the next step.
Edit the hadoop-env.sh file by running the following command:
nano $HADOOP_HOME /etc/hadoop/hadoop-env.sh
Add the following configuration at the end of the file:
export JAVA_HOME=[Java_Path] export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"
Replace [Java_Path] with the path obtained from the previous step.
Save the changes and exit the editor.
Download the Javax Activation File
To download the Javax Activation file, run the following command:
cd /usr/local/hadoop/lib sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar
Verify the Hadoop Version
To verify the installed Hadoop version, run the following command:
hadoop version
The output should display the installed Hadoop version.
Configuring Hadoop Files
To configure Hadoop, we need to modify several XML files. Follow the steps below to configure each file:
Configure core-site.xml File
Open the core-site.xml file using a text editor:
vi $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following configuration inside the <configuration>
tags:
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9000</value>
<description>The default file system URI</description>
</property>
Save the changes and exit the editor.
Configure hdfs-site.xml File
Create a directory to store the node metadata:
mkdir -p /home/hadoop/hdfs/{namenode,datanode} chown -R hadoop:hadoop/home/hadoop/hdfs
Open the hdfs-site.xml file:
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration inside the <configuration>
tags:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
Save the changes and exit the editor.
Configure mapred-site.xml File
Open the mapred-site.xml file:
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following configuration inside the <configuration>
tags:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Save the changes and exit the editor.
Configure yarn-site.xml File
Open the yarn-site.xml file:
vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add the following configuration inside the <configuration>
tags:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
Save the changes and exit the editor.
Formatting HDFS NameNode
Before starting the Hadoop services for the first time, it is important to format the NameNode. Run the following command:
hdfs namenode-format
Starting the Hadoop Cluster
To start the Hadoop cluster, follow the steps below:
Start the NameNode and DataNode
Switch to the Hadoop user and start the NameNode and DataNode services by running the following command:
start-dfs.sh
Start the YARN Resource Manager and NodeManagers
Start the YARN resource manager and nodemanagers by running the following command:
start-yarn.sh
Verifying the Hadoop Cluster
To verify if all the Hadoop daemons are active and running as Java processes, run the following command:
jps
The output should display the list of running processes, including NameNode, DataNode, ResourceManager, and NodeManager.
Accessing the Hadoop Web Interface
To access the Hadoop web interface, follow the steps below:
Hadoop NameNode
Open a web browser and navigate to the following URL:
http://your-server-ip:9870
Replace your-server-ip
with the IP address of your server.
Individual DataNodes
To access individual DataNodes, navigate to the following URL:
http://your-server-ip:9864
Replace your-server-ip
with the IP address of your server.
YARN Resource Manager
To access the YARN resource manager, navigate to the following URL:
http://your-server-ip:8088
Replace your-server-ip
with the IP address of your server.
Congratulations! You have successfully installed and configured Apache Hadoop on Debian 11. You can now leverage the power of Hadoop to process and analyze large datasets. If you are looking for reliable and scalable cloud hosting solutions to run your Hadoop clusters, consider Shape.host’s Linux SSD VPS services. Shape.host offers high-performance VPS hosting with excellent support and competitive pricing. Start unlocking the potential of big data with Apache Hadoop and Shape.host today!