此文档内容:在CentOS7系统上搭建高性能计算HPC集群。
-
至少两台服务器或者电脑(没有的,使用VMware虚拟机代替也行。此文档为了操作方便,使用VMware虚拟机代替做示范。)
-
CentOS7操作系统软件镜像
-
torque+maui 作业调度系统软件
在两台服务器裸机上安装CentOS7操作系统,最小化安装即可,速度快。
安装好系统以后,配置好网络。
在两台服务器节点的/etc/hosts文件中做好IP解析。可先在node1的/etc/hosts做好IP解析,然后复制到node2上。例如:
[root@localhost ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.22.123.66 node1
172.22.123.68 node2
关掉防火墙和selinux。
nis服务用于同步节点间账号信息。 node1做nis server端;node2做nis client端,同步node1的账号信息。 NIS详细步骤参考:https://gist.github.com/wangxianhe/d26b1c8a08ea7f324543728ca3d28c24
nfs服务用于共享文件信息。 node1做nfs server端;node2做nfs client端,挂载node1的共享目录。 NFS详细步骤参考:https://gist.github.com/wangxianhe/d42c0b777287f215d5c18757fc0e0308
操作比较简单,参考:https://gist.github.com/wangxianhe/d9bb9a4006bc0ec456c0ddb62d69a1a8
由于详细安装步骤繁杂,这里只做简要说明,详细步骤请参照官方文档:
TORQUE:http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide/torque.htm
先去官网下载软件包:
TORQUE:http://www.adaptivecomputing.com/support/download-center/torque-download/
这里下载的版本是: torque-6.1.2
安装依赖包
[root]# yum install libtool libcgroup-tools openssl-devel libxml2-devel boost-devel gcc gcc-c++
安装hwloc
When cgroups are enabled (recommended), hwloc version 1.9.1 or later is required.
Download hwloc-1.9.1.tar.gz from https://www.open-mpi.org/software/hwloc/v1.9.
yum install gcc make
tar -xzvf hwloc-1.9.1.tar.gz
cd hwloc-1.9.1
./configure
make
make install
echo /usr/local/lib >/etc/ld.so.conf.d/hwloc.conf
ldconfig
下载torque-6.1.2.tar.gz
[root]# yum install wget
[root]# wget http://www.adaptivecomputing.com/download/torque/torque-6.1.2.tar.gz -O torque-6.1.2.tar.gz
[root]# tar -xzvf torque-6.1.2.tar.gz
[root]# cd torque-6.1.2/
编译安装
[root]# ./configure --enable-cgroups --with-hwloc-path=/usr/local # add any other
specified options
[root]# make
[root]# make install
设置路径
[root]# . /etc/profile.d/torque.sh
初始化serverdb
[root]# ./torque.setup root
在Torque Server Host上,创建packages
[root]# make packages
Building ./torque-package-clients-linux-x86_64.sh ...
Building ./torque-package-mom-linux-x86_64.sh ...
Building ./torque-package-server-linux-x86_64.sh ...
Building ./torque-package-gui-linux-x86_64.sh ...
Building ./torque-package-devel-linux-x86_64.sh ...
Done.
The package files are self-extracting packages that can be copied and executed
on your production machines. Use --help for options.
把MOM package和client package 拷贝到计算节点。建议拷贝到共享区。
[root]# scp torque-package-mom-linux-x86_64.sh <mom-node>:
[root]# scp torque-package-clients-linux-x86_64.sh <torque-client-host>:
把pbs_server,pbs_mom和trqauthd启动脚本拷贝到管理节点和计算节点对应位置。建议拷贝到共享区。
cp contrib/systemd/pbs_mom.service /usr/lib/systemd/system/pbs_mom.service
cp contrib/systemd/pbs_server.service /usr/lib/systemd/system/pbs_server.service
cp contrib/systemd/trqauthd.service /usr/lib/systemd/system/trqauthd.service
scp contrib/systemd/pbs_mom.service <mom-node>:/usr/lib/systemd/system/
scp contrib/systemd/trqauthd.service <torque-clienthost>:/usr/lib/systemd/system/
开启pbs_server,pbs_mom,trqauthd服务
qterm
systemctl enable pbs_server.service
systemctl restart pbs_server.service
systemctl enable pbs_mom.service
systemctl restart pbs_mom.service
systemctl enable trqauthd.service
systemctl restart trqauthd.service
编辑/var/spool/torque/server_priv/nodes,加入计算节点。例如:
node006 np=2
node007 np=2
node008 np=4
systemctl restart pbs_server.service
安装依赖包
[root]# yum install libcgroup-tools
安装hwloc
When cgroups are enabled (recommended), hwloc version 1.9.1 or later is required.
Download hwloc-1.9.1.tar.gz from https://www.open-mpi.org/software/hwloc/v1.9.
yum install gcc make
tar -xzvf hwloc-1.9.1.tar.gz
cd hwloc-1.9.1
./configure
make
make install
echo /usr/local/lib >/etc/ld.so.conf.d/hwloc.conf
ldconfig
安装MOM package和client package
./torque-package-mom-linux-x86_64.sh --install
./torque-package-clients-linux-x86_64.sh --install
开启pbs_mom,trqauthd服务
systemctl enable pbs_mom.service
systemctl start pbs_mom.service
systemctl enable trqauthd.service
systemctl start trqauthd.service
vi /var/spool/torque/mom_priv/config
$pbsserver headnode # hostname running pbs server
$logevent 225 # bitmap of which events to log
service pbs_mom restart
举例:
# verify all queues are properly configured
> qstat -q
server:kmn
Queue Memory CPU Time Walltime Node Run Que Lm State
----- ------ -------- -------- ---- --- --- -- -----
batch -- -- -- -- 0 0 -- ER
--- ---
0 0
# view additional server configuration
> qmgr -c 'p s'
##
Create queues and set their attributes
###
Create and define queue batch
# create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
##
Set server attributes.
# set server scheduling =
True
set server acl_hosts = kmn
set server managers = user1@kmn
set server operators = user1@kmn
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 0
# verify all nodes are correctly reporting
> pbsnodes -a
node001
state=free
np=2
properties=bigmem,fast,ia64,smp
ntype=cluster
status=rectime=1328810402,varattr=,jobs=,state=free,netload=6814326158,gres=,loadave=0
.21,ncpus=6,physmem=8193724kb,
availmem=13922548kb,totmem=16581304kb,idletime=3,nusers=3,nsessions=18,sessions=1876
1120 1912 1926 1937 1951 2019 2057 28399 2126 2140 2323 5419 17948 19356 27726 22254
29569,uname=Linux kmn 2.6.38-11-generic #48-Ubuntu SMP Fri Jul 29 19:02:55 UTC 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
# submit a basic job - DO NOT RUN AS ROOT
> su - testuser
> echo "sleep 30" | qsub
# verify jobs display
> qstat
Job id Name User Time Use S Queue
------ ----- ---- -------- -- -----
0.kmn STDIN knielson 0 Q batch
此时,因为scheduler还没有运行,作业不会run,接下来安装scheduler Maui.
由于详细安装步骤繁杂,这里只做简要说明,详细步骤请参照官方文档:
MAUI:http://docs.adaptivecomputing.com/maui/index.php
先去官网下载软件包:
maui:http://www.adaptivecomputing.com/support/download-center/maui-cluster-scheduler/
这里下载的版本是: maui-3.3.1
编译安装
> gtar -xzvf maui-3.2.6.tar.gz
> cd maui-3.2.6
> ./configure
> make
make install
加入路径
[root@node1 ~]# vi /etc/profile
添加:
export PATH=/usr/local/maui/bin/:/usr/local/maui/sbin/:$PATH
[root@node1 ~]# source /etc/profile
配置maui.cfg 先暂时使用默认配置即可,有更多需求,可以修改此文件。
启动 maui
写入/etc/rc.local: /usr/local/maui/sbin/maui