Installation d'un cluster single-node

/!\ testé sur Centos 6.6 /!\

Mise à jour des paquets et installations

yum -y update
yum -y install wget openssh openssh-clients nmap java-1.7.0-openjdk-devel.x86_64 java-1.7.0-openjdk.x86_64

Configuration du fichier /etc/hosts :

On utilise des FQDN. Dans le fichier ci-dessous on trouve les autres noeuds qu'on rajoutera par la suite.

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.2.5    hadoop-namenode.localdomain hadoop-namenode
192.168.2.11    hadoop-datanode1.localdomain hadoop-datanode1
192.168.2.12    hadoop-datanode2.localdomain hadoop-datanode2

Désactivation de SELinux

sed -i "%enforcing%disabled%g" /etc/selinux/config
reboot

Création du user/group Hadoop

groupadd hadoop
useradd -g hadoop hadoop
passwd hadoop

cat >> /home/hadoop/.bashrc << "EOF"

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64
PATH=$PATH:$JAVA_HOME/bin

export HADOOP_INSTALL=/opt/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export MAPRED_HOME=$YARN_INSTALL

export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin

EOF

Téléchargement et décompression d'Hadoop

cd /opt/
wget http://wwwftp.ciril.fr/pub/apache/hadoop/common/hadoop-2.5.2/hadoop-2.5.2.tar.gz
tar vzxf hadoop-2.5.2.tar.gz
ln -s hadoop-2.5.2 hadoop

Configuration de Hadoop

mv /opt/hadoop/etc/hadoop/core-site.xml /opt/hadoop/etc/hadoop/core-site.xml.bak

cat > /opt/hadoop/etc/hadoop/core-site.xml << "EOF"
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/HDFS/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
EOF

mv /opt/hadoop/etc/hadoop/mapred-site.xml /opt/hadoop/etc/hadoop/mapred-site.xml.bak

cat > /opt/hadoop/etc/hadoop/mapred-site.xml << "EOF"
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
</configuration>
EOF
mv /opt/hadoop/etc/hadoop/yarn-site.xml /opt/hadoop/etc/hadoop/yarn-site.xml.bak
cat > /opt/hadoop/etc/hadoop/yarn-site.xml << "EOF"
<?xml version="1.0"?>
<configuration>
</configuration>
EOF

mv /opt/hadoop/etc/hadoop/hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml.bak

cat > /opt/hadoop/etc/hadoop/hdfs-site.xml << "EOF"
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/HDFS/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/HDFS/datanode</value>
</property>
</configuration>
EOF

Création des répertoires + droits

mkdir /opt/HDFS
mkdir -p /opt/HDFS/namenode
mkdir -p /opt/HDFS/datanode
chown -R hadoop:hadoop /opt/HDFS
chown -R hadoop:hadoop /opt/hadoop
chown -R hadoop:hadoop /opt/hadoop-2.5.2

Création d'un script de démarrage

cat > /etc/init.d/starthadoop << "EOF"

. /etc/init.d/functions


RETVAL=$?

case "$1" in
start)
echo $"Starting Hadoop server"
/bin/su - hadoop -c start-dfs.sh
/bin/su - hadoop -c start-yarn.sh
;;

stop)
echo $"Stopping Hadoop server"
/bin/su - hadoop -c stop-dfs.sh
/bin/su - hadoop -c stop-yarn.sh
;;

*)
echo $"Usage: $0 {start|stop}"
exit 1
;;
esac

exit $RETVAL
EOF

chmod u+x /etc/init.d/starthadoop

Copies de clés ssh des users hadoop sur chaque noeud
Install de java et d'hadoop sur chaque noeud
Copies des fichiers /etc/hosts sur chaque noeud

Toutes les modifs des fichiers ci-dessous doivent être reportées sur chaque noeud

Fichiers /opt/hadoop/etc/hadoop/masters et /opt/hadoop/etc/slaves :

[hadoop@hadoop-namenode hadoop]$ cat masters
hadoop-namenode.localdomain

[hadoop@hadoop-namenode hadoop]$ cat slaves
hadoop-namenode.localdomain
hadoop-datanode1.localdomain

Fichier /opt/hadoop/etc/hadoop/core-site.xml

[hadoop@hadoop-namenode hadoop]$ cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/HDFS/tmp</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop-namenode.localdomain:9000</value>
  </property>
</configuration>

Fichier /opt/hadoop/etc/hadoop/hdfs-site.xml

[hadoop@hadoop-namenode hadoop]$ cat hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/HDFS/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/opt/HDFS/datanode</value>
  </property>
</configuration>

Fichier /opt/hadoop/etc/hadoop/mapred-site.xml

[hadoop@hadoop-namenode hadoop]$ cat mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>hadoop-namenode:54311</value>
  </property>
</configuration>

[hadoop@hadoop-namenode /]$ start-dfs.sh
14/11/27 13:51:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop-namenode.localdomain]
hadoop-namenode.localdomain: starting namenode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-namenode-hadoop-namenode.localdomain.out
hadoop-namenode.localdomain: starting datanode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-datanode-hadoop-namenode.localdomain.out
hadoop-datanode1.localdomain: starting datanode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-datanode-hadoop-datanode1.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-secondarynamenode-hadoop-namenode.localdomain.out
14/11/27 13:51:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

note : le message Unable to load native-hadoop library for your platform… using builtin-java classes where applicable n'est pas bloquant. Il indique juste qu'Hadoop ne dispose pas des librairies natives (car machine 64 bits VS librairies 32 bits).

[hadoop@hadoop-namenode /]$ jps
7720 DataNode
7898 SecondaryNameNode
7626 NameNode
6521 NodeManager
8006 Jps
6428 ResourceManager

[hadoop@hadoop-namenode /]$ ssh hadoop-datanode1 jps
11511 DataNode
11580 Jps

[hadoop@hadoop-namenode /]$ hdfs dfsadmin -report
Configured Capacity: 6438158336 (6.00 GB)
Present Capacity: 3252350976 (3.03 GB)
DFS Remaining: 3252289536 (3.03 GB)
DFS Used: 61440 (60 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Live datanodes (2):

Name: 192.168.2.11:50010 (hadoop-datanode1.localdomain)
Hostname: hadoop-datanode1.localdomain
Decommission Status : Normal
Configured Capacity: 3219079168 (3.00 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 1588588544 (1.48 GB)
DFS Remaining: 1630461952 (1.52 GB)
DFS Used%: 0.00%
DFS Remaining%: 50.65%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 27 13:54:47 CET 2014


Name: 192.168.2.5:50010 (hadoop-namenode.localdomain)
Hostname: hadoop-namenode.localdomain
Decommission Status : Normal
Configured Capacity: 3219079168 (3.00 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 1597218816 (1.49 GB)
DFS Remaining: 1621827584 (1.51 GB)
DFS Used%: 0.00%
DFS Remaining%: 50.38%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 27 13:54:47 CET 2014

hadoop fs -ls /
hadoop fs -mkdir /user
hadoop fs -mkdir /user/hduser
hadoop fs -chown hduser /user/hduser
hadoop fs -chown :hadoop /user/hduser
hadoop fs -ls /user

Arrêt du datanode hadoop-datanode1

ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ stop datanode

Dans /opt/hadoop/etc/hadoop/hdfs-site.xml, rajouter :

<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-datanode2.localdomain:50090</value>
</property>

ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ stop datanode
ssh hadoop-datanode1 rm -rf /opt/HDFS
ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ start datanode

[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar pi 10 100
Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
14/11/27 16:23:42 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/27 16:23:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/27 16:23:42 INFO input.FileInputFormat: Total input paths to process : 10
14/11/27 16:23:42 INFO mapreduce.JobSubmitter: number of splits:10
14/11/27 16:23:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local930526865_0001
...
...
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1180
        File Output Format Counters
                Bytes Written=97
Job Finished in 2.929 seconds
Estimated value of Pi is 3.14800000000000000000

Récupérer des livres au format .txt sur http://www.gutenberg.org

[hadoop@hadoop-namenode ~]$ hadoop fs -mkdir /user/hadoop/ebooks
[hadoop@hadoop-namenode ~]$ hadoop fs -put *.txt /user/hadoop/books

[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount books output
14/11/27 16:27:16 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/27 16:27:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/27 16:27:16 INFO input.FileInputFormat: Total input paths to process : 3
14/11/27 16:27:16 INFO mapreduce.JobSubmitter: number of splits:3
14/11/27 16:27:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1050799219_0001
...
...
        File Input Format Counters
                Bytes Read=2856705
        File Output Format Counters
                Bytes Written=525222

Voir le résultat

[hadoop@hadoop-namenode ~]$  hadoop fs -cat /user/hadoop/output/part-r-00000 |egrep  -w "^the|^house"
house   134
house!  1
house,  73
house,' 2
house--which    1
house-door      2
house-door,     3
house-door.     1
house-door.'    1
house-door;     1
house-keeping,  1
house-keeping.  1
house-top,      1
house.  38
house." 1
house.' 3
house:  1
house;  11
house?" 2
house?' 1
the     34498
the)    1
the---- 1
the.    2
the...  2
the.... 1
the.....        2
the..............       1
the:    1
the]    5

Soit le sudoku ci-dessous :

[hadoop@hadoop-namenode ~]$ cat puzzle1.dta
8 5 ? 3 9 ? ? ? ?
? ? 2 ? ? ? ? ? ?
? ? 6 ? 1 ? ? ? 2
? ? 4 ? ? 3 ? 5 9
? ? 8 9 ? 1 4 ? ?
3 2 ? 4 ? ? 8 ? ?
9 ? ? ? 8 ? 5 ? ?
? ? ? ? ? ? 2 ? ?
? ? ? ? 4 5 ? 7 8

[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar sudoku puzzle1.dta
Solving puzzle1.dta
8 5 1 3 9 2 6 4 7
4 3 2 6 7 8 1 9 5
7 9 6 5 1 4 3 8 2
6 1 4 8 2 3 7 5 9
5 7 8 9 6 1 4 2 3
3 2 9 4 5 7 8 1 6
9 4 7 2 8 6 5 3 1
1 8 5 7 3 9 2 6 4
2 6 3 1 4 5 9 7 8

Found 1 solutions

cat /tmp/hadoop-hdfs/dfs/name/current/VERSION

⇒ Recopier l'ID dans data/VERSION ⇒ Redémarrer datanode et namenode

Installation d'un cluster single-node

Ajout d'un datanode à un cluster existant

Pré-requis

Configuration

Opérations de base

Démarrage et vérifications

Check des process

Statut de HDFS

Manipulations de fichiers/répertoires

Arrêter un node particulier

Définir un namenode sur un noeud quelconque

Simuler un crash d'un datanode

Lancement de jobs MapReduce

Calcul de pi

Calcul occurences de mots dans des livres

Sudoku

Troubleshooting

ClusterID mismatch for namenode and datanodes

Where there is a shell, there is a way