- P2P . No concept of zookeeper or named nodes so no single point of failure. Every node in the cluster has the same responsibility and no node gets any special treatment.
- Adding more m/c's means higher number of opeartions/sec
- NTP for synchronized clocks across nodes
- Disable swap via
sudo swapoff -all
- Ensure that the cluster name and seeds are set correctly. Seed information is used when joining a node to cluster.
- cassandra.yaml - most of the configuration about the node/clutster are contained in this file
- cassandra_env.sh - properties regarding setting up JAVA env variables
- logback.xml - C* logging levels
- cassandra-rackdc.props - useful when installing C* across DC's . Contains information on how nodes about topology/how nodes know what DC, RACK it is on.
Location of C* configurations:
Name | Description | Default location |
---|---|---|
commit log | commit log is a append only log to which all the writes are written in addition to memtables. If you are using spinning disk then better to have commit log mounted on a separate disk to ensure that this append only log can be written to as quickly as possible. Also ** this is the first place where the writes become durable. ** | /var/lib/cassandra/commitlog or $CASSANDRA_HOME/data/commitlog |
data file directories | this is where the SSTables live. SSTables are reqad optimized | /var/lib/cassandra/data or $CASSANDRA_HOME/data/data |
saved directories | storage directory for commonly read key and row caches | /var/lib/cassandra/saved_caches or $CASSANDRA_HOME/data/saved_caches |
System memory | MAX_HEAP_SIZE |
---|---|
< 2GB | 1/2 of system memory |
> 4GB but < 2GB | 1GB |
> 4GB (should be always the case) | 1/4 of the sytem memory but not greater than 8GB |
A C* cluster is 1..n nodes. All nodes under a cluster have the same cluster name in cassandra yaml and should be able to indentify themselves using seeds provided.
- Node - 1 cassandra instance
- Rack - Logical set of nodes
- Data center - Logical set of racks
[Best practice] 1 node from per DC should be used when definining the seeds for a cluster
TODO
There are two ways of assigning tokens to a cassandra ring:
-
Single token per node : Each node gets one token and is responsible for data between that token and token before that.
- Hard to add/remove nodes with this strategy as adding node will mean unbalanced nodes
-
Multiple tokens assigned per node : 256 tokens are divided across the entire ring. As a result each node gets mutliple tokens.
- Makes it easier to add/remove nodes without making the cluster unbalanced.
- num_tokens in cassandra.yaml controls how many tokens get divided over the ring. Default value is 256
RF is the total number of times a write will be replicated in the cluster. A RF of 3 means there will be total of 3 copies of a write in a cluster.
- RF is defined at a ** per kespace ** level and ideally should not be changed.
- If a cluster is spread across DC's then RF can be defined per keyspace per DC. For e.g. replicate for this keyspace 2 times in east coast and 3 times in west coast cluster.
-
Each node has a cassandra_rackdc.properties which is used to identify what rack/DC a node belongs to.
-
RF is set per keyspace per DC and is defined at the time of keyspace creation.
-
Replication strategy is set per keyspace and is defined at the time of keyspace creation.
-
SimpleStrategy - If you only are going to use 1 DC
-
NetworkTopolgyStrategy
- When using this and replicating the data within a DC across nodes make sure that the nodes are not in the same rack.
- Endpoint snitch/cassandra_rackdc.props can be used for this to identify rack and DC location of a node.
-
** Also notice that there is no concept of primary and slave when writing and replicating data. All nodes are treated eqallly.**
-
When replication across a remote cluster similar to local cluster, a coordinator is picked on the DC and it is resposible for replicating the data.
Defines how many nodes will have to agree for a given read write operation to succeed.
- Can be defined per operation.
- Directly affects the speed of the read/write operation.
- A CL = 1 means than just one of the replica needs to provide a copy of the data.
- A CL = QUORUM means that majority ( RF/2 + 1) of the replicas will have to agree.
- A CL = LOCAL_QUORUM means that majority ( RF/2 + 1) of the replicas in a DC will have to agree.
Assume RF = 3 and CL = 2. When this data is written once 2 nodes ack, the client will be told success. But the coordinator still need to replicate the data to the third node.If that 3rd node is dead or not responding coordinator will write a copy of this data to system.hints table and will try to stream is over once the node is back.
- Same concept when DC goes offline
- [Best practice] Need to actively monitor this statistic
If you are reading/writing at a CL < RF there is a chance that all the replicas have not caught up and you might get stale data. But eventually the replicas will catch up and all the replicas will have the same view of the data.
A read immediately follwed by a write is considered to be an anti pattern in C*, but there are ways to achieve it.
- ** Using below mentioned CL levels for these reads and writes **
WRITE | READ | Comments |
---|---|---|
CL = ALL | CL = 1 | If you are writing with CL = ALL that means all the replicas already must have the data for the write to succeed. So the read can use CL = 1 and can read from any node. |
CL = QUORUM | CL = QUORUM | You are always reading and writing to majority |
CL = 1 | CL = ALL | You write to one but you read from all to make sure that all the nodes agree. |
** Pick one depending on whether its a read heavy or a write heavy operation. But always check if the value of immediate consistency is worth the latency. **
[** Best pracitce **] CL=1 is your friend but use it carefully.
- Light weight transaction with conditional queries
** Use cases for read immediately followed by write **
CL = 1 | CL = QUORUM | CL = ALL | Notes |
---|---|---|---|
Lowest latency | Higher latency | Highest latency | Lower is better |
Highest throughput | Lower throughput | Lowest throughput | Higher is better |
Highest availability | Lower availability | Lowest availabily | High means cluster will keep operating even if nodes keep going down |
Stale reads possible | No stales reads if both read/write at quorum | No stale reads if either read or write is at CL = ALL |
-
What is a commit log?
-
What is a memtables?
-
What are SSTables
-
Common nodetool commands
-
Common cqlsh commands
-
Operations
-
How does read repair work?
-
How to take down a node?
- What metrics to observe?
- What can possibly go wrong ?
-
How to bring up a node to join the cluster
- What metrics to observe?
- What can possibly go wrong ?
-
What do do when a node goes down?
- What metrics to look out for?