{{ :informatique:nix:bigdata:hadoop_elephant.png?nolink&300 |}}====== Installation d'un cluster single-node ======
/!\ testé sur Centos 6.6 /!\
* Mise à jour des paquets et installations
yum -y update
yum -y install wget openssh openssh-clients nmap java-1.7.0-openjdk-devel.x86_64 java-1.7.0-openjdk.x86_64
* Configuration du fichier **/etc/hosts** :
On utilise des FQDN. Dans le fichier ci-dessous on trouve les autres noeuds qu'on rajoutera par la suite.
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.2.5 hadoop-namenode.localdomain hadoop-namenode
192.168.2.11 hadoop-datanode1.localdomain hadoop-datanode1
192.168.2.12 hadoop-datanode2.localdomain hadoop-datanode2
* Désactivation de SELinux
sed -i "%enforcing%disabled%g" /etc/selinux/config
reboot
* Création du user/group Hadoop
groupadd hadoop
useradd -g hadoop hadoop
passwd hadoop
cat >> /home/hadoop/.bashrc << "EOF"
export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64
PATH=$PATH:$JAVA_HOME/bin
export HADOOP_INSTALL=/opt/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export MAPRED_HOME=$YARN_INSTALL
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
EOF
* Téléchargement et décompression d'Hadoop
cd /opt/
wget http://wwwftp.ciril.fr/pub/apache/hadoop/common/hadoop-2.5.2/hadoop-2.5.2.tar.gz
tar vzxf hadoop-2.5.2.tar.gz
ln -s hadoop-2.5.2 hadoop
* Configuration de Hadoop
mv /opt/hadoop/etc/hadoop/core-site.xml /opt/hadoop/etc/hadoop/core-site.xml.bak
cat > /opt/hadoop/etc/hadoop/core-site.xml << "EOF"
hadoop.tmp.dir
/opt/HDFS/tmp
fs.default.name
hdfs://localhost:9000
EOF
mv /opt/hadoop/etc/hadoop/mapred-site.xml /opt/hadoop/etc/hadoop/mapred-site.xml.bak
cat > /opt/hadoop/etc/hadoop/mapred-site.xml << "EOF"
mapreduce.jobtracker.address
local
EOF
mv /opt/hadoop/etc/hadoop/yarn-site.xml /opt/hadoop/etc/hadoop/yarn-site.xml.bak
cat > /opt/hadoop/etc/hadoop/yarn-site.xml << "EOF"
EOF
mv /opt/hadoop/etc/hadoop/hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml.bak
cat > /opt/hadoop/etc/hadoop/hdfs-site.xml << "EOF"
dfs.replication
1
dfs.namenode.name.dir
file:/opt/HDFS/namenode
dfs.datanode.data.dir
file:/opt/HDFS/datanode
EOF
* Création des répertoires + droits
mkdir /opt/HDFS
mkdir -p /opt/HDFS/namenode
mkdir -p /opt/HDFS/datanode
chown -R hadoop:hadoop /opt/HDFS
chown -R hadoop:hadoop /opt/hadoop
chown -R hadoop:hadoop /opt/hadoop-2.5.2
* Création d'un script de démarrage
cat > /etc/init.d/starthadoop << "EOF"
. /etc/init.d/functions
RETVAL=$?
case "$1" in
start)
echo $"Starting Hadoop server"
/bin/su - hadoop -c start-dfs.sh
/bin/su - hadoop -c start-yarn.sh
;;
stop)
echo $"Stopping Hadoop server"
/bin/su - hadoop -c stop-dfs.sh
/bin/su - hadoop -c stop-yarn.sh
;;
*)
echo $"Usage: $0 {start|stop}"
exit 1
;;
esac
exit $RETVAL
EOF
chmod u+x /etc/init.d/starthadoop
====== Ajout d'un datanode à un cluster existant ======
==== Pré-requis ====
* Copies de clés ssh des users **hadoop** sur chaque noeud
* Install de java et d'hadoop sur chaque noeud
* Copies des fichiers /etc/hosts sur chaque noeud
:!: Toutes les modifs des fichiers ci-dessous doivent être reportées sur chaque noeud :!:
==== Configuration ====
* Fichiers **/opt/hadoop/etc/hadoop/masters** et **/opt/hadoop/etc/slaves** :
[hadoop@hadoop-namenode hadoop]$ cat masters
hadoop-namenode.localdomain
[hadoop@hadoop-namenode hadoop]$ cat slaves
hadoop-namenode.localdomain
hadoop-datanode1.localdomain
* Fichier **/opt/hadoop/etc/hadoop/core-site.xml**
[hadoop@hadoop-namenode hadoop]$ cat core-site.xml
hadoop.tmp.dir
/opt/HDFS/tmp
fs.default.name
hdfs://hadoop-namenode.localdomain:9000
* Fichier **/opt/hadoop/etc/hadoop/hdfs-site.xml**
[hadoop@hadoop-namenode hadoop]$ cat hdfs-site.xml
dfs.replication
2
dfs.namenode.name.dir
file:/opt/HDFS/namenode
dfs.datanode.data.dir
file:/opt/HDFS/datanode
* Fichier **/opt/hadoop/etc/hadoop/mapred-site.xml**
[hadoop@hadoop-namenode hadoop]$ cat mapred-site.xml
mapreduce.jobtracker.address
hadoop-namenode:54311
====== Opérations de base ======
==== Démarrage et vérifications ====
[hadoop@hadoop-namenode /]$ start-dfs.sh
14/11/27 13:51:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop-namenode.localdomain]
hadoop-namenode.localdomain: starting namenode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-namenode-hadoop-namenode.localdomain.out
hadoop-namenode.localdomain: starting datanode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-datanode-hadoop-namenode.localdomain.out
hadoop-datanode1.localdomain: starting datanode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-datanode-hadoop-datanode1.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-secondarynamenode-hadoop-namenode.localdomain.out
14/11/27 13:51:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
__note__ : le message //Unable to load native-hadoop library for your platform... using builtin-java classes where applicable// n'est pas bloquant. Il indique juste qu'Hadoop ne dispose pas des librairies natives (car machine 64 bits VS librairies 32 bits).
==== Check des process ====
[hadoop@hadoop-namenode /]$ jps
7720 DataNode
7898 SecondaryNameNode
7626 NameNode
6521 NodeManager
8006 Jps
6428 ResourceManager
[hadoop@hadoop-namenode /]$ ssh hadoop-datanode1 jps
11511 DataNode
11580 Jps
==== Statut de HDFS ====
[hadoop@hadoop-namenode /]$ hdfs dfsadmin -report
Configured Capacity: 6438158336 (6.00 GB)
Present Capacity: 3252350976 (3.03 GB)
DFS Remaining: 3252289536 (3.03 GB)
DFS Used: 61440 (60 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Live datanodes (2):
Name: 192.168.2.11:50010 (hadoop-datanode1.localdomain)
Hostname: hadoop-datanode1.localdomain
Decommission Status : Normal
Configured Capacity: 3219079168 (3.00 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 1588588544 (1.48 GB)
DFS Remaining: 1630461952 (1.52 GB)
DFS Used%: 0.00%
DFS Remaining%: 50.65%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 27 13:54:47 CET 2014
Name: 192.168.2.5:50010 (hadoop-namenode.localdomain)
Hostname: hadoop-namenode.localdomain
Decommission Status : Normal
Configured Capacity: 3219079168 (3.00 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 1597218816 (1.49 GB)
DFS Remaining: 1621827584 (1.51 GB)
DFS Used%: 0.00%
DFS Remaining%: 50.38%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 27 13:54:47 CET 2014
==== Manipulations de fichiers/répertoires ====
hadoop fs -ls /
hadoop fs -mkdir /user
hadoop fs -mkdir /user/hduser
hadoop fs -chown hduser /user/hduser
hadoop fs -chown :hadoop /user/hduser
hadoop fs -ls /user
==== Arrêter un node particulier ====
* Arrêt du datanode **hadoop-datanode1**
ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ stop datanode
==== Définir un namenode sur un noeud quelconque ====
* Dans **/opt/hadoop/etc/hadoop/hdfs-site.xml**, rajouter :
dfs.namenode.secondary.http-address
hadoop-datanode2.localdomain:50090
==== Simuler un crash d'un datanode ====
ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ stop datanode
ssh hadoop-datanode1 rm -rf /opt/HDFS
ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ start datanode
====== Lancement de jobs MapReduce ======
==== Calcul de pi ====
[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar pi 10 100
Number of Maps = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
14/11/27 16:23:42 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/27 16:23:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/27 16:23:42 INFO input.FileInputFormat: Total input paths to process : 10
14/11/27 16:23:42 INFO mapreduce.JobSubmitter: number of splits:10
14/11/27 16:23:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local930526865_0001
...
...
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1180
File Output Format Counters
Bytes Written=97
Job Finished in 2.929 seconds
Estimated value of Pi is 3.14800000000000000000
==== Calcul occurences de mots dans des livres ====
* Récupérer des livres au format .txt sur http://www.gutenberg.org
[hadoop@hadoop-namenode ~]$ hadoop fs -mkdir /user/hadoop/ebooks
[hadoop@hadoop-namenode ~]$ hadoop fs -put *.txt /user/hadoop/books
[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount books output
14/11/27 16:27:16 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/27 16:27:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/27 16:27:16 INFO input.FileInputFormat: Total input paths to process : 3
14/11/27 16:27:16 INFO mapreduce.JobSubmitter: number of splits:3
14/11/27 16:27:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1050799219_0001
...
...
File Input Format Counters
Bytes Read=2856705
File Output Format Counters
Bytes Written=525222
* Voir le résultat
[hadoop@hadoop-namenode ~]$ hadoop fs -cat /user/hadoop/output/part-r-00000 |egrep -w "^the|^house"
house 134
house! 1
house, 73
house,' 2
house--which 1
house-door 2
house-door, 3
house-door. 1
house-door.' 1
house-door; 1
house-keeping, 1
house-keeping. 1
house-top, 1
house. 38
house." 1
house.' 3
house: 1
house; 11
house?" 2
house?' 1
the 34498
the) 1
the---- 1
the. 2
the... 2
the.... 1
the..... 2
the.............. 1
the: 1
the] 5
==== Sudoku ====
* Soit le sudoku ci-dessous :
[hadoop@hadoop-namenode ~]$ cat puzzle1.dta
8 5 ? 3 9 ? ? ? ?
? ? 2 ? ? ? ? ? ?
? ? 6 ? 1 ? ? ? 2
? ? 4 ? ? 3 ? 5 9
? ? 8 9 ? 1 4 ? ?
3 2 ? 4 ? ? 8 ? ?
9 ? ? ? 8 ? 5 ? ?
? ? ? ? ? ? 2 ? ?
? ? ? ? 4 5 ? 7 8
[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar sudoku puzzle1.dta
Solving puzzle1.dta
8 5 1 3 9 2 6 4 7
4 3 2 6 7 8 1 9 5
7 9 6 5 1 4 3 8 2
6 1 4 8 2 3 7 5 9
5 7 8 9 6 1 4 2 3
3 2 9 4 5 7 8 1 6
9 4 7 2 8 6 5 3 1
1 8 5 7 3 9 2 6 4
2 6 3 1 4 5 9 7 8
Found 1 solutions
====== Troubleshooting ======
==== ClusterID mismatch for namenode and datanodes ====
cat /tmp/hadoop-hdfs/dfs/name/current/VERSION
=> Recopier l'ID dans //data/VERSION//
=> Redémarrer datanode et namenode