HA

SPOF

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working.

source: wikipedia

Quorom - 정족수(定足數)

A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system.

Arbiter - 조정자

source: wikipedia

Split-brain syndrome

A split-brain(a network partition) indicates data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope

source: wikipedia

Cluster computing

Computer cluster consists of a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system.

source: wikipedia

Disaster recovery (DR)

A set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.

source: wikipedia

Failover

The continuation of a service after the failure of one or more of its components.

Load balancing

The distribution of workloads across multiple computing resources.

Load balancing is often used to implement failover.

source: wikipedia

Global Server Load Balancing (GSLB)

DNS based
Routing Policy
- Weighted, Lantency, Failover set, Geo-location
AWS Route53

Wildcard DNS record

*.example.com
virtual hosts
- Host HTTP/1.1 header
- Multi host names bound to one IP address.

[Example_usages](https://en.wikipedia.org/wiki/Wildcard_DNS_record#Example_usages)

HA example

RAID (redundant array of independent disks)

A data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.

source: wikipedia

Linux DM Multipath

DM(device mapper)-Multipathing provides input-output (I/O) fail-over and load-balancing by using multipath I/O within Linux for block devices.

source: wikipedia

Ceph

Object storage on a single distributed computer cluster. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level.

source: www.linux-mag.com

Link aggregation

Combining (aggregating) multiple network connections in parallel in order to increase throughput beyond what a single connection could sustain, and to provide redundancy in case one of the links should fail.

source: wikipedia

Distributed Replicated Block Device(DRBD)

source: drbd.org

Coordinator

What does coordinator do?

Service Registration
Service Discovery
- DNS support: Consul only
Consistent and durable general-purpose K/V store
Leader Election

source: https://gist.github.com/kchida/d1c15f3968f4f8272c49

Consul vs Zookeeper vs Etcd

Consul
ready to use(DNS, health check), commercially supported.
Zookeeper
Java
old but stable
Etcd
new and just simple K/V store.
ttl, atomic

Etcd

TTL
compareAndDelete, compareAndSwap

Zookeeper

znode 단위로 관리된다.
znode는 파일 시스템과 유사한 디렉토리 구조(path)를 가진다.
znode에 데이터를 저장할 수 있다.
znode의 변화 watch
Zookeeper와 클라이언트의 연결이 끊어지면 자동으로 삭제
EPHEMERAL
EPHEMERAL_SEQUENTIAL : Ticketing
No TTL

zkHelper

var zk = require('zkHelper'),
  options = {
		basePath: '/myapps';
		configPath: '/myapps/config',
		node: require('os').hostname(),
		servers: ['zk0:2181', 'zk1:2181', 'zk2:2181'], // zk servers
		clientOptons: { sessionTimeout: 10000, retries: 3 }
	};
zk.init(options, function (err, zkClient) {
  if (zk.isMaster()) {
    console.info('I am master')
  } else {
    console.info('master', zk.getMaster() && zk.getMaster().master)
  }
  var config = zk.getConfig();
});

HAproxy

haproxy

HAProxy is free, open source software

that provides a high availability load balancer and proxy server for TCP(L4) and HTTP-based applications(L7)

that spreads requests across multiple servers.

manpage: fast and reliable http reverse proxy and load balancer
LVS(Linux Virtual Server) : L4 only alternative; kernel space impl.

Reverse , Forward proxy

source: wikipedia

haproxy stat - hatop

hatop

L4 TCP socket

frontend mqtt_fe
    option tcplog
    bind :1883
    mode tcp
    timeout client 90s
    default_backend mqtt_be

backend mqtt_be
    mode tcp
    timeout server 90s
    server mqtt.1 10.0.0.11:1883 maxconn 50000 check inter 2000 rise 2 fall 3
    server mqtt.2 10.0.0.22:1883 maxconn 50000 check inter 2000 rise 2 fall 3
    server mqtt.3 10.0.0.13:1883 maxconn 50000 check inter 2000 rise 2 fall 3

L7 websocket/ssl, sticky, redirect

frontend mqttwss_fe
    option httplog
    bind :8083
    bind :8483 ssl crt mqtt.pem
    redirect scheme https if !{ ssl_fc }
    mode http
    default_backend mqttwss_be

backend mqttwss_be
    mode http
    cookie SRV insert indirect nocache
    server mqtt.1 10.0.0.11:80 cookie mqtt.1 maxconn 50000 check inter 2000 rise 2 fall 3
    server mqtt.2 10.0.0.22:80 cookie mqtt.2 maxconn 50000 check inter 2000 rise 2 fall 3
    server mqtt.3 10.0.0.13:80 cookie mqtt.3 maxconn 50000 check inter 2000 rise 2 fall 3

wildcard host routing

frontend HttpFrontend
    bind *:80
    mode http
    acl fooBackend hdr_beg(host) -i foo.
    acl barBackend hdr_beg(host) -i bar.

    use_backend fooBackend if fooBackend
    use_backend barBackend if barBackend

    default_backend bazBackend
<...>

autoscaling

Use haproxy instead of AWS ELB
Update haproxy to use all instances running in a security group.

update-haproxy.py [-h] --security-group SECURITY_GROUP [SECURITY_GROUP ...] --access-key ACCESS_KEY --secret-key SECRET_KEY [--output OUTPUT] [--template TEMPLATE] [--haproxy HAPROXY] [--pid PID] [--eip EIP] [--health-check-url HEALTH_CHECK_URL] ```

haproxy command line options

-D : daemon
-f : config file(/etc/haproxy/haproxy.cfg)
-p : pid file to have its children's pids
-sf pidlist
- Send FINISH signal to the pids in pidlist after startup.
- The processes which receive this signal will wait for all sessions to finish before exiting.
-st pidlist
- Send TERMINATE signal to the pids in pidlist after startup.
- The processes which receive this signal will wait immediately terminate, closing all active sessions.

haproxy HA

use GSLB
Active/Standby

performance tunning

sysctls tunning
- ulimit -a
multi-process
- ssl offloading
- dedicate process for a task
- dedicate processor for irq handling

mult-process

global
  nbproc 4            # number of processes

frontend access_http
   bind 0.0.0.0:80
   bind-process 1     # dedicate one process to http
   mode            http
   default_backend backend_nodes

frontend access_https
   bind 0.0.0.0:443 ssl crt /etc/yourdomain.pem 
   bind-process 2 3 4 # dedicate the other processes to https
   mode            http
   option           forwardfor
   option           accept-invalid-http-request
   reqadd         X-Forwarded-Proto:\ https
   default_backend backend_nodes

Firewall

Firewall HA

source:[firewall-ha-with-conntrackd-and-keepalived](http://backreference.org/2013/04/03/firewall-ha-with-conntrackd-and-keepalived/)

Keepalived

VRRP(Virtual Router Redundancy Protocol)
LVS(Linux Virtual Server)

Keepalived - VRRP

vrrp_instance E1 {
    interface eth0
    state BACKUP
    virtual_router_id 61
    priority 100
    advert_int 1        # advertise every 1sec to multicast: 224.0.0.18
    virtual_ipaddress {
        10.15.7.100/24 dev eth0
        2001:db8:15:7::100/64 dev eth0 
    }
    nopreempt
    garp_master_delay 1 # 1sec delay for gratuitous ARP after transition to MASTER
}

Conntrackd

Sync {
  Mode FTFW {
    DisableExternalCache Off
  }
  UDP {
    IPv4_address 10.0.0.1
		IPv4_Destination_Address 10.0.0.2
		Port 3780
		Interface eth2
		SndSocketBuffer 1249280
		RcvSocketBuffer 1249280
		Checksum on
	}
}

failover scenario

FW1: 장애로, VRRP advertise pkt 전송 안됨.
FW2: VIP 할당되고, gratuitous ARP pkt 전송.
- switch port의 mac address 갱신.
- nodes의arp table 갱신.
FW2: external cache(FW1's conntrack info) --> internal(kernel) cache로 갱신함.

source:[VRRP(Virtual Router Redundancy Protocol) 상세 동작 원리](https://www.slideshare.net/netmanias-ko/netmanias20080324-vrrp-protocoloverview)

Pacemaker/Corosync

Corosync

provides clustering infracture such as membership, messaging and quorum.

corosync.conf

# quorum 이 구성
quorum {
  provider: corosync_votequorum
        two_node: 0
   }
# totem protocol 설정
totem {
  version:                             2
  token:      3000  # token 을 받지 못해서 해당 노드 fail로 판단하는 시간(ms)
  token_retransmits_before_loss_const: 10
  join:                                60
  consensus:  3600 # 새로운 q uorum member을 구성을 시작하는 전 기다리는 시간(ms).
  ...
}

pacemaker

It is an open source high availability resource manager software used on computer clusters since 2004. Its preferred API for this purpose is the OCF resource agent API.

OCF(Open Cluster Framework)
- LSB(linux standard base) init script와 유사한 shell script이다.
- Resource Agent를 만드는데 이용된다.
- exit code에 따라 pacemaker가 다른 행동을 한다.

Two-node Active/Passive clusters using Pacemaker and DRBD are a cost-effective solution for many High Availability situations

source: clusterlabs.org

By supporting many nodes, Pacemaker can dramatically reduce hardware costs by allowing several active/passive clusters to be combined and share a common backup node.

source: clusterlabs.org

When shared storage is available, every node can potentially be used for failover. Pacemaker can even run multiple copies of services to spread out the workload.

source: clusterlabs.org

N to N Redundancy

node 13: node-03
node 2: node-01
node 9: node-02

CRM property

property cib-bootstrap-options: \
        dc-version=1.1.12-561c4cf \
        cluster-infrastructure=corosync \
        no-quorum-policy=stop \
        stonith-enabled=false \
        start-failure-is-fatal=false \
        symmetric-cluster=false \
        last-lrm-refresh=1490016772

3대의 node 중 1 대의 node가 단절됐을 때

나머지 두 node는 정상 동작.
문제가 된 node는 quorum 구성을 못하고, 해당 node의 모든 resource stop!
- no-quorum-policy=stop

Resource - ex) conntrackd + vip

conntrackd : one master at a node and slaves are on the other nodes. vip public : one resource at a node.

primitive p_conntrackd ocf:fuel:ns_conntrackd \
	op monitor interval=30 timeout=60 \
	op monitor interval=27 role=Master timeout=60 \
	params bridge=br-mgmt \
	meta migration-threshold=INFINITY failure-timeout=180s

primitive vip__vrouter_pub ocf:fuel:ns_IPaddr2 \
	op monitor interval=5 timeout=20 \
	op start interval=0 timeout=30 \
	op stop interval=0 timeout=30 \
	meta migration-threshold=3 failure-timeout=60 resource-stickiness=1

ms master_p_conntrackd p_conntrackd \
  meta notify=true ordered=false interleave=true clone-node-max=1 master-max=1 master-node-max=1

Resource - ex) conntrackd + vip

vip internal, vip public and conntrackd at the same node.

location master_p_conntrackd-on-node-01 master_p_conntrackd 100: node-01
location master_p_conntrackd-on-node-02 master_p_conntrackd 100: node-02
location master_p_conntrackd-on-node-03 master_p_conntrackd 100: node-03

colocation conntrackd-with-pub-vip inf: vip__vrouter_pub:Started master_p_conntrackd:Master

colocation vip__vrouter-with-vip__vrouter_pub inf: vip__vrouter vip__vrouter_pub

Galera

source: [Introduction to Galera](https://www.slideshare.net/henrikingo/introduction-to-galera)

terms

WSRep(Write set replication)
GTID(Global Transaction ID) : {uuid}:{sequence number}
State: INIT -> JOINER -> JOINED -> SYNCED
SST(State Snapshot Transfers)
IST(Incremental State Transfers)
IST trigger condition:
- 해당 클러스터 그룹의 state UUID와 joiner node의 statu UUID가 같아야 함
- 모든 missing write-sets이 donor의 write-set 캐시에 존재 해야 함

...

vs

zookeeper, etcd, consul : builing block for own coordinator; loosely connected;
haproxy: failover and load balancing micro services.
Pacemaker: Pacemaker is really only designed to do membership and failure detection at small scale <50 nodes. tightly connected.

HA Clustering의 요건

Technology : ...
Process : documented, clear ownership.
People : skill set, attitudes, leadership, role & responsibiltiy.

ys-qb/hacluster..md

HA

SPOF

Quorom - 정족수(定足數)

Split-brain syndrome

Cluster computing

Disaster recovery (DR)

Failover

Load balancing

Global Server Load Balancing (GSLB)

Wildcard DNS record

HA example

RAID (redundant array of independent disks)

Linux DM Multipath

Ceph

Link aggregation

Distributed Replicated Block Device(DRBD)

Coordinator

What does coordinator do?

Consul vs Zookeeper vs Etcd

Etcd

Zookeeper

zkHelper

HAproxy

haproxy

Reverse , Forward proxy

haproxy stat - hatop

hatop

L4 TCP socket

L7 websocket/ssl, sticky, redirect

wildcard host routing

autoscaling

haproxy command line options

haproxy HA

performance tunning

mult-process

Firewall

Firewall HA

Keepalived

Keepalived - VRRP

Conntrackd

failover scenario

Pacemaker/Corosync

Corosync

corosync.conf

pacemaker

N to N Redundancy

CRM property

3대의 node 중 1 대의 node가 단절됐을 때

Resource - ex) conntrackd + vip

Resource - ex) conntrackd + vip

Galera

terms

...

vs

HA Clustering의 요건

Thank you