Table des matières

Installation d'un cluster single-node

/!\ testé sur Centos 6.6 /!\

yum -y update
yum -y install wget openssh openssh-clients nmap java-1.7.0-openjdk-devel.x86_64 java-1.7.0-openjdk.x86_64

On utilise des FQDN. Dans le fichier ci-dessous on trouve les autres noeuds qu'on rajoutera par la suite.

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.2.5    hadoop-namenode.localdomain hadoop-namenode
192.168.2.11    hadoop-datanode1.localdomain hadoop-datanode1
192.168.2.12    hadoop-datanode2.localdomain hadoop-datanode2
sed -i "%enforcing%disabled%g" /etc/selinux/config
reboot
groupadd hadoop
useradd -g hadoop hadoop
passwd hadoop

cat >> /home/hadoop/.bashrc << "EOF"

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64
PATH=$PATH:$JAVA_HOME/bin

export HADOOP_INSTALL=/opt/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export MAPRED_HOME=$YARN_INSTALL

export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin

EOF
cd /opt/
wget http://wwwftp.ciril.fr/pub/apache/hadoop/common/hadoop-2.5.2/hadoop-2.5.2.tar.gz
tar vzxf hadoop-2.5.2.tar.gz
ln -s hadoop-2.5.2 hadoop
mv /opt/hadoop/etc/hadoop/core-site.xml /opt/hadoop/etc/hadoop/core-site.xml.bak

cat > /opt/hadoop/etc/hadoop/core-site.xml << "EOF"
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/HDFS/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
EOF
mv /opt/hadoop/etc/hadoop/mapred-site.xml /opt/hadoop/etc/hadoop/mapred-site.xml.bak

cat > /opt/hadoop/etc/hadoop/mapred-site.xml << "EOF"
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
</configuration>
EOF
mv /opt/hadoop/etc/hadoop/yarn-site.xml /opt/hadoop/etc/hadoop/yarn-site.xml.bak
cat > /opt/hadoop/etc/hadoop/yarn-site.xml << "EOF"
<?xml version="1.0"?>
<configuration>
</configuration>
EOF
mv /opt/hadoop/etc/hadoop/hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml.bak

cat > /opt/hadoop/etc/hadoop/hdfs-site.xml << "EOF"
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/HDFS/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/HDFS/datanode</value>
</property>
</configuration>
EOF
mkdir /opt/HDFS
mkdir -p /opt/HDFS/namenode
mkdir -p /opt/HDFS/datanode
chown -R hadoop:hadoop /opt/HDFS
chown -R hadoop:hadoop /opt/hadoop
chown -R hadoop:hadoop /opt/hadoop-2.5.2
cat > /etc/init.d/starthadoop << "EOF"

. /etc/init.d/functions


RETVAL=$?

case "$1" in
start)
echo $"Starting Hadoop server"
/bin/su - hadoop -c start-dfs.sh
/bin/su - hadoop -c start-yarn.sh
;;

stop)
echo $"Stopping Hadoop server"
/bin/su - hadoop -c stop-dfs.sh
/bin/su - hadoop -c stop-yarn.sh
;;

*)
echo $"Usage: $0 {start|stop}"
exit 1
;;
esac

exit $RETVAL
EOF

chmod u+x /etc/init.d/starthadoop

Ajout d'un datanode à un cluster existant

Pré-requis

:!: Toutes les modifs des fichiers ci-dessous doivent être reportées sur chaque noeud :!:

Configuration

[hadoop@hadoop-namenode hadoop]$ cat masters
hadoop-namenode.localdomain

[hadoop@hadoop-namenode hadoop]$ cat slaves
hadoop-namenode.localdomain
hadoop-datanode1.localdomain
[hadoop@hadoop-namenode hadoop]$ cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/HDFS/tmp</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop-namenode.localdomain:9000</value>
  </property>
</configuration>
[hadoop@hadoop-namenode hadoop]$ cat hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/HDFS/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/opt/HDFS/datanode</value>
  </property>
</configuration>
[hadoop@hadoop-namenode hadoop]$ cat mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>hadoop-namenode:54311</value>
  </property>
</configuration>

Opérations de base

Démarrage et vérifications

[hadoop@hadoop-namenode /]$ start-dfs.sh
14/11/27 13:51:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop-namenode.localdomain]
hadoop-namenode.localdomain: starting namenode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-namenode-hadoop-namenode.localdomain.out
hadoop-namenode.localdomain: starting datanode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-datanode-hadoop-namenode.localdomain.out
hadoop-datanode1.localdomain: starting datanode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-datanode-hadoop-datanode1.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop-2.5.2/logs/hadoop-hadoop-secondarynamenode-hadoop-namenode.localdomain.out
14/11/27 13:51:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

note : le message Unable to load native-hadoop library for your platform… using builtin-java classes where applicable n'est pas bloquant. Il indique juste qu'Hadoop ne dispose pas des librairies natives (car machine 64 bits VS librairies 32 bits).

Check des process

[hadoop@hadoop-namenode /]$ jps
7720 DataNode
7898 SecondaryNameNode
7626 NameNode
6521 NodeManager
8006 Jps
6428 ResourceManager
[hadoop@hadoop-namenode /]$ ssh hadoop-datanode1 jps
11511 DataNode
11580 Jps

Statut de HDFS

[hadoop@hadoop-namenode /]$ hdfs dfsadmin -report
Configured Capacity: 6438158336 (6.00 GB)
Present Capacity: 3252350976 (3.03 GB)
DFS Remaining: 3252289536 (3.03 GB)
DFS Used: 61440 (60 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Live datanodes (2):

Name: 192.168.2.11:50010 (hadoop-datanode1.localdomain)
Hostname: hadoop-datanode1.localdomain
Decommission Status : Normal
Configured Capacity: 3219079168 (3.00 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 1588588544 (1.48 GB)
DFS Remaining: 1630461952 (1.52 GB)
DFS Used%: 0.00%
DFS Remaining%: 50.65%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 27 13:54:47 CET 2014


Name: 192.168.2.5:50010 (hadoop-namenode.localdomain)
Hostname: hadoop-namenode.localdomain
Decommission Status : Normal
Configured Capacity: 3219079168 (3.00 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 1597218816 (1.49 GB)
DFS Remaining: 1621827584 (1.51 GB)
DFS Used%: 0.00%
DFS Remaining%: 50.38%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Nov 27 13:54:47 CET 2014

Manipulations de fichiers/répertoires

hadoop fs -ls /
hadoop fs -mkdir /user
hadoop fs -mkdir /user/hduser
hadoop fs -chown hduser /user/hduser
hadoop fs -chown :hadoop /user/hduser
hadoop fs -ls /user

Arrêter un node particulier

ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ stop datanode

Définir un namenode sur un noeud quelconque

<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-datanode2.localdomain:50090</value>
</property>

Simuler un crash d'un datanode

ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ stop datanode
ssh hadoop-datanode1 rm -rf /opt/HDFS
ssh hadoop-datanode1 /opt/hadoop/sbin/hadoop-daemon.sh --config /opt/hadoop/etc/hadoop/ start datanode

Lancement de jobs MapReduce

Calcul de pi

[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar pi 10 100
Number of Maps  = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
14/11/27 16:23:42 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/27 16:23:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/27 16:23:42 INFO input.FileInputFormat: Total input paths to process : 10
14/11/27 16:23:42 INFO mapreduce.JobSubmitter: number of splits:10
14/11/27 16:23:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local930526865_0001
...
...
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1180
        File Output Format Counters
                Bytes Written=97
Job Finished in 2.929 seconds
Estimated value of Pi is 3.14800000000000000000

Calcul occurences de mots dans des livres

[hadoop@hadoop-namenode ~]$ hadoop fs -mkdir /user/hadoop/ebooks
[hadoop@hadoop-namenode ~]$ hadoop fs -put *.txt /user/hadoop/books

[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount books output
14/11/27 16:27:16 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/27 16:27:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/27 16:27:16 INFO input.FileInputFormat: Total input paths to process : 3
14/11/27 16:27:16 INFO mapreduce.JobSubmitter: number of splits:3
14/11/27 16:27:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1050799219_0001
...
...
        File Input Format Counters
                Bytes Read=2856705
        File Output Format Counters
                Bytes Written=525222
[hadoop@hadoop-namenode ~]$  hadoop fs -cat /user/hadoop/output/part-r-00000 |egrep  -w "^the|^house"
house   134
house!  1
house,  73
house,' 2
house--which    1
house-door      2
house-door,     3
house-door.     1
house-door.'    1
house-door;     1
house-keeping,  1
house-keeping.  1
house-top,      1
house.  38
house." 1
house.' 3
house:  1
house;  11
house?" 2
house?' 1
the     34498
the)    1
the---- 1
the.    2
the...  2
the.... 1
the.....        2
the..............       1
the:    1
the]    5

Sudoku

[hadoop@hadoop-namenode ~]$ cat puzzle1.dta
8 5 ? 3 9 ? ? ? ?
? ? 2 ? ? ? ? ? ?
? ? 6 ? 1 ? ? ? 2
? ? 4 ? ? 3 ? 5 9
? ? 8 9 ? 1 4 ? ?
3 2 ? 4 ? ? 8 ? ?
9 ? ? ? 8 ? 5 ? ?
? ? ? ? ? ? 2 ? ?
? ? ? ? 4 5 ? 7 8
[hadoop@hadoop-namenode ~]$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar sudoku puzzle1.dta
Solving puzzle1.dta
8 5 1 3 9 2 6 4 7
4 3 2 6 7 8 1 9 5
7 9 6 5 1 4 3 8 2
6 1 4 8 2 3 7 5 9
5 7 8 9 6 1 4 2 3
3 2 9 4 5 7 8 1 6
9 4 7 2 8 6 5 3 1
1 8 5 7 3 9 2 6 4
2 6 3 1 4 5 9 7 8

Found 1 solutions

Troubleshooting

ClusterID mismatch for namenode and datanodes

cat /tmp/hadoop-hdfs/dfs/name/current/VERSION

⇒ Recopier l'ID dans data/VERSION ⇒ Redémarrer datanode et namenode