Skip to content

Instantly share code, notes, and snippets.

@ericzhong
Last active November 24, 2017 10:05
Show Gist options
  • Select an option

  • Save ericzhong/77de90562ef373378cc00620e333d98b to your computer and use it in GitHub Desktop.

Select an option

Save ericzhong/77de90562ef373378cc00620e333d98b to your computer and use it in GitHub Desktop.
Hadoop 部署

安装

环境

CentOS 7.3
java-1.8.0-openjdk
hadoop-3.0.0-beta1

单机模式

yum install pdsh openssh
wget http://www.mirrorservice.org/sites/ftp.apache.org/hadoop/common/hadoop-3.0.0-beta1/hadoop-3.0.0-beta1.tar.gz
tar xvf hadoop-3.0.0-beta1.tar.gz
cd hadoop-3.0.0-beta1

确定已设置 JAVA_HOME

echo $JAVA_HOME

设置环境变量:

echo "export PATH=`pwd`/bin:`pwd`/sbin:\$PATH" | sudo tee /etc/profile.d/hadoop.sh
source /etc/profile.d/hadoop.sh

测试一下,执行一个 grep 操作:

mkdir input
cp etc/hadoop/*.xml input
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1.jar grep input output 'dfs[a-z.]+'
cat output/*

得到输出:

1	dfsadmin

伪集群模式

etc/hadoop/core-site.xml 中增加:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml 中增加:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

确保 ssh localhost 能成功,配置方法如下:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

yum install openssh-server
ssh-keygen -A
/usr/sbin/sshd

在本地执行 MapReduce 任务

启动 HDFS:

$ hdfs namenode -format

# 不能用 root 账户,日志在 logs/ 下面
$ start-dfs.sh

# 查看打开的三个进程
$ jps
... DataNode
... NameNode
... SecondaryNameNode

然后,可以用浏览器访问 NameNode:http://localhost:9870/

在 HDFS 中创建 MapReduce 任务需要的目录,然后执行任务:

hdfs dfs -mkdir -p /user/<当前用户名>  # HOME
hdfs dfs -mkdir input     # ~/input
hdfs dfs -put etc/hadoop/*.xml input
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1.jar grep input output 'dfs[a-z.]+'
hdfs dfs -cat output/*
hdfs dfs -get output output

得到输出:

1	dfsadmin
1	dfs.replication

停止 HDFS:

stop-dfs.sh

在 YARN 上执行 MapReduce 任务

修改 etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

修改 etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

启动服务:

start-dfs.sh
start-yarn.sh    # 启动 ResourceManager,NodeManager

可通过浏览器访问 ResourceManager:http://localhost:8088/

运行 MapReduce 任务:

yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-beta1.jar pi 5 10

停止服务:

stop-dfs.sh
stop-yarn.sh

集群模式

操作

安装 Hadoop 2.8

有如下几点不同:

  • NameNode 端口不同:http://localhost:50070/
  • yarn-site.xml 中可以不设置 yarn.nodemanager.env-whitelist

修改 HDFS 默认存储路径

修改 hdfs-site.xml

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop-${user.name}</value>
    </property>

dfs.namenode.name.dirdfs.namenode.checkpoint.dirdfs.datanode.data.dir 不需要单独设置,它们都引用了 hadoop.tmp.dir

问题

HDFS 默认存储路径

/tmp/hadoop-<USER>/dfs

Troubleshooting

DataNode 无法启动

如果在 NameNode 和 DataNode 启动的情况下执行 hdfs namenode -format,NameNode 将被格式化,但是 DataNode 中的数据还在,就造成了混乱。按如下顺序执行即可:

rm -rf /tmp/hadoop-<USER>    # 清除数据
hdfs namenode -format
strat-dfs.sh

HDFS 数据存储路径在 hdfs-site.xmldfs.datanode.data.dir 中配置,缺省值是 /tmp/hadoop-<USER>/dfs/data/

Name node is in safe mode

hdfs dfsadmin -safemode leave

Failed to connect to server: localhost/127.0.0.1:9000

单机模式不能配置 core-site.xml 中的 fs.defaultFS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment