ceph-installation.md

The following installation methods were tried out.

Working instances of Ceph OSD service

Vagrant-based solution

Container-based solution (partially working)

With some trial and error, I managed to get a stand alone Ceph Object Storage Device (OSD) service working. The instructions to replicate the installation are as follows.

Start a vagrant box
```
vagrant init ubuntu/bionic64
vagrant up
```

Install docker

sudo su
apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
add-apt-repository    "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt-get update
apt-get install -y docker-ce
docker pull ceph/daemon:latest

Install Ceph Monitor, Manager and OSD components

docker network create --subnet=175.20.0.0/16 ceph
docker run -d -h ceph-mon --name ceph-mon --net ceph --ip 175.20.0.12 -e MON_IP=175.20.0.12 -e CEPH_PUBLIC_NETWORK=175.20.0.0/16  -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ ceph/daemon mon
# update /etc/ceph/ceph.conf file with the contents given below
docker run -d -h ceph-mgr --name ceph-mgr --net=ceph -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ ceph/daemon mgr
docker run -d -h ceph-osd --name ceph-osd --net=ceph --pid=host --privileged=true -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ ceph/daemon osd

Install librados library.
```
sudo apt-get install librados-dev
```
Use librados to communicate with Ceph OSD cluster.

NOTE: The container-based installation runs through the ceph diagnostic commands correctly, but the OSD is unresponsive. The object store and fetch operations do not work. The reason is unknown.

Container-based solution

This solution has been tried out on Ubuntu Mate 18.04 + Docker 18.06 CE

The primary instructions followed are available at:

There are SELinux specific instructions were ignored as they are not applicable to Linux kernel with SELinux capabilities.

sudo chcon -Rt svirt_sandbox_file_t /etc/ceph
sudo chcon -Rt svirt_sandbox_file_t /var/lib/ceph

The first error encountered is with respect to etcd KV store.

For the command,

docker run -d -h ceph-kv --name ceph-kv -e KV_TYPE=etcd -e KV_IP=127.0.0.1 -e KV_PORT=2379 ceph/daemon populate_kvstore

I see the following error log.

2018-10-16T10:03:51.514221915Z Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 175.20.0.12:2379: getsockopt: connection refused
2018-10-16T10:03:51.514243724Z 
2018-10-16T10:03:51.514256484Z error #0: dial tcp 175.20.0.12:2379: getsockopt: connection refused
2018-10-16T10:03:51.514259320Z 
2018-10-16T10:03:51.519791527Z 2018-10-16 10:03:51  /entrypoint.sh: Value is already set
2018-10-16T10:03:51.662737927Z Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 175.20.0.12:2379: getsockopt: connection refused
2018-10-16T10:03:51.662755244Z 
2018-10-16T10:03:51.662795795Z error #0: dial tcp 175.20.0.12:2379: getsockopt: connection refused
2018-10-16T10:03:51.662801028Z 
2018-10-16T10:03:51.667302541Z 2018-10-16 10:03:51  /entrypoint.sh: client_host already exists

Just to try a dedicated docker network, I modified the KV command as follows.

docker network create --subnet=175.20.0.0/16 ceph
docker run -d -h ceph-kv --name ceph-kv --net=ceph -e KV_TYPE=etcd -e KV_IP=175.20.0.50 -e KV_PORT=2379 -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ ceph/daemon populate_kvstore

The command throws up the following docker container log.

2018-10-16T09:58:47.677911525Z 2018-10-16 09:58:47  /entrypoint.sh: Adding key /osd/osd_journal_size with value 100 to KV store.
2018-10-16T09:58:50.857882777Z Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 175.20.0.50:2379: getsockopt: no route to host
2018-10-16T09:58:50.858023921Z 
2018-10-16T09:58:50.858041017Z error #0: dial tcp 175.20.0.50:2379: getsockopt: no route to host
2018-10-16T09:58:50.858043718Z 
2018-10-16T09:58:50.861988822Z 2018-10-16 09:58:50  /entrypoint.sh: Value is already set
2018-10-16T09:58:53.857533335Z Error:  dial tcp 175.20.0.50:2379: getsockopt: no route to host
2018-10-16T09:58:53.862213259Z 2018-10-16 09:58:53  /entrypoint.sh: client_host already exists

Irrespective of choosing a new IP or an existing IP, the KV container throws up an error.

Hence, I decided to try ceph installation without the KV store.

The given instructions (on https://hub.docker.com/r/ceph/daemon/) fail for Ceph monitor. The following command,

docker run -d  --net host --ip 192.168.0.20 -e MON_IP=192.168.0.20 -e CEPH_PUBLIC_NETWORK=192.168.0.0/16 -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ ceph/daemon mon

gives an error shown below.

-1 unable to find any IP address in networks: 192.168.0.20/24

The Ceph OSD component comes up without the monitor, but the rest of the components containers fail to start.

Container with dedicated network

Based on the information provided in ceph/ceph-container#496, I tried a different ceph deployment with a dedicated docker network. This time, KV component and radosgw component come up and soon crash. The rest of the components remain online. The commands used for trying out this version of installation are:

sudo su
apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
add-apt-repository    "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt-get update
apt-get install -y docker-ce

docker pull ceph/daemon:latest

docker network create --subnet=175.20.0.0/16 ceph
docker run -d -h ceph-mon --name ceph-mon --net ceph --ip 175.20.0.12 -e MON_IP=175.20.0.12 -e CEPH_PUBLIC_NETWORK=175.20.0.0/16  -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ ceph/daemon mon
docker run -d -h ceph-mgr --name ceph-mgr --net=ceph -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ ceph/daemon mgr
# update /etc/ceph/ceph.conf file with the contents given below
docker run -d -h ceph-osd --name ceph-osd --net=ceph --pid=host --privileged=true -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ ceph/daemon osd
docker run -d -h ceph-mds --name ceph-mds --net=ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph -e CEPHFS_CREATE=1 ceph/daemon mds
docker run -d -h ceph-rgw --name ceph-rgw --net=ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph ceph/daemon rgw
docker run -d -h ceph-restapi --name ceph-restapi --net=ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph ceph/daemon restapi
docker run -d -h ceph-rbd --name ceph-rbd --net=ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /etc/ceph:/etc/ceph ceph/daemon rbd_mirror

# /etc/ceph/ceph.conf file
[global]
fsid = 8bc58859-905e-4b58-9742-09209bca6116
mon initial members = ceph-mon
mon host = 175.20.0.12
public network = 175.20.0.0/16
cluster network = 175.20.0.0/16
osd journal size = 100
log file = /dev/null

keyring = /etc/ceph/ceph.client.admin.keyring
auth cluster required = none
auth service required = none
auth client required = none

osd max object name len = 256
osd max object namespace len = 64

[client.restapi]
  public addr = 0.0.0.0:5000
  restapi base url = /api/v0.1
  restapi log level = warning
  log file = /var/log/ceph/ceph-restapi.log

This ceph with dedicated network setup comes closest to a complete setup. All the containers except ceph-restapi (REST API) and ceph-rgw (Rados Gateway) remain alive. The ceph-restapi fails with the following error log.

2018-10-16T14:05:40.543135966Z exec: PID 119: spawning /usr/bin/ceph-rest-api --cluster ceph -n client.admin
2018-10-16T14:05:40.543145120Z exec: Waiting 119 to quit
2018-10-16T14:05:40.543396149Z docker_exec.sh: line 143: /usr/bin/ceph-rest-api: No such file or directory
2018-10-16T14:05:40.543673917Z teardown: managing teardown after SIGCHLD

The radosgw-admin fails with the following error log.

2018-10-16T14:23:09.290515772Z 2018-10-16 14:23:09  /entrypoint.sh: static: does not generate config
2018-10-16T14:23:09.291261550Z 2018-10-16 14:23:09  /entrypoint.sh: SUCCESS
2018-10-16T14:23:09.291516838Z exec: PID 124: spawning /usr/bin/radosgw --cluster ceph --setuser ceph --setgroup ceph -d -n client.rgw.ceph-rgw -k /var/lib/ceph/radosgw/ceph-rgw.ceph-rgw/keyring
2018-10-16T14:23:09.291568646Z exec: Waiting 124 to quit
2018-10-16T14:23:09.327688131Z 2018-10-16 14:23:09.321 7fd4bde7f8c0  0 framework: civetweb
2018-10-16T14:23:09.327704972Z 2018-10-16 14:23:09.321 7fd4bde7f8c0  0 framework conf key: port, val: 7480
2018-10-16T14:23:09.328618629Z 2018-10-16 14:23:09.325 7fd4bde7f8c0  0 deferred set uid:gid to 167:167 (ceph:ceph)
2018-10-16T14:23:09.328628223Z 2018-10-16 14:23:09.325 7fd4bde7f8c0  0 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process radosgw, pid 124
2018-10-16T14:28:09.329078701Z 2018-10-16 14:28:09.327 7fd4ab959700 -1 Initialization timeout, failed to initialize
2018-10-16T14:28:09.331421746Z teardown: managing teardown after SIGCHLD
2018-10-16T14:28:09.331435382Z teardown: Waiting PID 124 to terminate 
2018-10-16T14:28:09.331438278Z teardown: Process 124 is terminated
2018-10-16T14:28:09.331442149Z teardown: Bye Bye, container will die with return code -1

The final status of the Ceph system can be captured in the following log.

root@ubuntu-bionic:/home/vagrant# docker ps -a
docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS                           PORTS               NAMES
f6f93a4a1666        ceph/daemon         "/entrypoint.sh rest…"   4 seconds ago       Exited (255) 3 seconds ago                           ceph-restapi
898f41f69033        ceph/daemon         "/entrypoint.sh rbd_…"   About an hour ago   Up About an hour                                     ceph-rbd
c102deb3f253        ceph/daemon         "/entrypoint.sh rgw"     About an hour ago   Exited (255) About an hour ago                       ceph-rgw
7b057aedeb5f        ceph/daemon         "/entrypoint.sh mds"     About an hour ago   Up About an hour                                     ceph-mds
a8ccd5cb9472        ceph/daemon         "/entrypoint.sh osd"     2 hours ago         Up About an hour                                     ceph-osd
88f567eac9f3        ceph/daemon         "/entrypoint.sh mgr"     2 hours ago         Up About an hour                                     ceph-mgr
b03064033c0f        ceph/daemon         "/entrypoint.sh mon"     2 hours ago         Up About an hour                                     ceph-mon

root@ubuntu-bionic:/home/vagrant# docker exec ceph-osd ceph -s
  cluster:
    id:     0e6a5ae3-1678-494f-9dcc-939c50473107
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 24 pgs inactive
            Degraded data redundancy: 24 pgs undersized
            too few PGs per OSD (24 < min 30)
 
  services:
    mon:        1 daemons, quorum ceph-mon
    mgr:        ceph-mgr(active)
    mds:        cephfs-1/1/1 up  {0=ceph-mds=up:creating}
    osd:        1 osds: 1 up, 1 in
    rbd-mirror: 1 daemon active
 
  data:
    pools:   3 pools, 24 pgs
    objects: 0  objects, 0 B
    usage:   2.2 GiB used, 7.4 GiB / 9.6 GiB avail
    pgs:     100.000% pgs not active
             24 undersized+peered
 
root@ubuntu-bionic:/home/vagrant# docker exec ceph-osd ceph -s
  cluster:
    id:     0e6a5ae3-1678-494f-9dcc-939c50473107
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 24 pgs inactive
            Degraded data redundancy: 24 pgs undersized
            too few PGs per OSD (24 < min 30)
 
  services:
    mon:        1 daemons, quorum ceph-mon
    mgr:        ceph-mgr(active)
    mds:        cephfs-1/1/1 up  {0=ceph-mds=up:creating}
    osd:        1 osds: 1 up, 1 in
    rbd-mirror: 1 daemon active
 
  data:
    pools:   3 pools, 24 pgs
    objects: 0  objects, 0 B
    usage:   2.3 GiB used, 7.4 GiB / 9.6 GiB avail
    pgs:     100.000% pgs not active
             24 undersized+peered

vagrant-based setup

Just to make sure that the results are repeatable, all the above procedures were also executed in two different vagrant machines. The results did not change. The errors received are as given above.

I tried the following vagrant boxes.

vagrant init ubuntu/bionic64
vagrant init ubuntu/trusty64
vagrant init fedora/28-cloud-base --box-version 20180425

ceph-ansible

The ceph engineers recommend that we use ceph-ansible for easy installation. I tried their suggestion. The commands used are:

sudo add-apt-repository ppa:ansible/ansible-2.6
sudo apt-get update
sudo apt-get install ansible
sudo apt-get install python-pip
sudo pip install notario
sudo apt-get install systemd
git clone https://github.com/ceph/ceph-ansible.git -b master	#supports Ansible v2.6
cd ceph-ansible	
cp site.yml.sample site.yml
ansible-playbook site.yml -i inventory

# inventory file for Ansible
[mons]
localhost

[osds]
localhost

[mdss]
localhost

[rgws]
localhost

[restapis]
localhost

[mgrs]
localhost

[all:vars]
ansible_connection=local

# group_vars/all.yml file

ceph_origin: repository
ceph_repository: community
ceph_stable_release: jewel
public_network: "10.0.2.15/24"
cluster_network: "10.0.2.15/24"
radosgw_address: 10.0.2.15
monitor_interface: eth0
devices:
  - '/dev/sda'
osd_scenario: collocated

This fails saying there are no sensible defaults /etc/ceph/ceph.conf. Even after creating a ceph.conf file, we encounter an error saying Ceph Monitor does not deploy correctly.

I even tried installing just Ceph OSD and Ceph RadosGW components. But, both need Ceph Monitor to work. Hence, the Ansible installation fails as well.

prasadtalasila/ceph-installation.md