jgeek · October 5, 2020 16:03
diff --git a/cassandra_notes b/cassandra_notes
 Cassandra uses a special type of primary key called a composite key (or compound
 key) to represent groups of related rows, also called partitions. The composite key
 consists of a partition key, plus an optional set of clustering columns. The partition key
 is used to determine the nodes on which rows are stored and can itself consist of mul‐
 tiple columns. The clustering columns are used to control how data is sorted for stor‐
 age within a partition. Cassandra also supports an additional construct called a static
 column, which is for storing data that is not part of the primary key but is shared by
 every row in a partition.

 So the outermost structure in Cassandra is the cluster, sometimes called
 the ring, because Cassandra assigns data to nodes in the cluster by arranging them in
 a ring.

 A table is a container for an ordered collection of rows, each of which is itself an
 ordered collection of columns. Rows are organized in partitions and assigned to
 nodes in a Cassandra cluster according to the column(s) designated as the partition
 key. The ordering of data within a partition is determined by the clustering columns.

 Remember that TTL is stored on a per-column level for nonpri‐
 mary key columns. There is currently no mechanism for setting
 TTL at a row level directly after the initial insert; you would instead
 need to reinsert the row, taking advantage of Cassandra’s upsert
 behavior. As with the timestamp, there is no way to obtain or set
 the TTL value of a primary key column, and the TTL can only be
 set for a column when you provide a value for the column.
 If you want to set TTL across an entire row, you must provide a value for every non‐
 primary key column in your INSERT or UPDATE command.

 Primary Keys Are Forever
 After you create a table, there is no way to modify the primary key,
 because this controls how data is distributed within the cluster, and
 even more importantly, how it is stored on disk.

 create table stackoverflow_composite (
    key_part_one text,
    key_part_two int,
    data text,
    PRIMARY KEY(key_part_one, key_part_two)      
 );
 In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY 
 (in this example key_part_one is the partition key) 
 and the second part of the key is the CLUSTERING KEY (in this example key_part_two)

 The Partition Key is responsible for data distribution across your nodes.
 The Clustering Key is responsible for data sorting within the partition.

 Cassandra is one of the few databases that provides race-free increments across data centers.

 The counter type has some special restrictions. It cannot be
 used as part of a primary key. If a counter is used, all of the columns other than
 primary key columns must be counters.

 Freezing is a concept that was introduced as a forward compatibil‐
 ity mechanism. For now, you can nest a collection within another
 collection by marking it as frozen , which means that Cassandra
 will store that value as a blob of binary data. In the future, when
 nested collections are fully supported, there will be a mechanism to
 “unfreeze” the nested collections, allowing the individual attributes
 to be accessed.
 You can also use a collection as a primary key if it is frozen.

 Design Differences Between RDBMS and Cassandra:
 No joins
 No referential integrity
 Denormalization
 Query-first design:
 in Cassandra you don’t start with the data model; you start with the
 query model. Instead of modeling the data first and then writing queries, with Cas‐
 sandra you model the queries and let the data be organized around them. Think of
 the most common query paths your application will use, and then create the tables
 that you need to support them.

 Designing for optimal storage:
 Because Cassandra tables are each stored in separate files on disk, it’s
 important to keep related columns defined together in the same table.
 A key goal as you begin creating data models in Cassandra is to minimize the number
 of partitions that must be searched in order to satisfy a given query. Because the parti‐
 tion is a unit of storage that does not get divided across nodes, a query that searches a
 single partition will typically yield the best performance.

 Sorting is a design decision:
 In Cassandra, however, sorting is treated differently; it is a design decision. The sort
 order available on queries is fixed, and is determined entirely by the selection of clus‐
 tering columns you supply in the CREATE TABLE command. The CQL SELECT state‐
 ment does support ORDER BY semantics, but only in the order specified by the
 clustering columns (ascending or descending).

 An important consideration in designing your table’s primary key
 is making sure that it defines a unique data element. Otherwise you
 run the risk of accidentally overwriting data.

 Use clustering columns to store attributes that you need to access
 in a range query. Remember that the order of the clustering columns is important. 
 You’ll learn more about range queries in Chapter 9.

 The Wide Partition Pattern
 The design of the available_rooms_by_hotel_date table is an instance of the wide
 partition pattern. This pattern is sometimes called the wide row pattern when discus‐
 sing databases that support similar models, but wide partition is a more accurate
 description from a Cassandra perspective. The essence of the pattern is to group mul‐
 tiple related rows in a partition in order to support fast access to multiple rows within
 the partition in a single query.

 The time series pattern is an extension of the wide partition pattern. In this pattern, a
 series of measurements at specific time intervals are stored in a wide partition, where
 the measurement time is used as part of the partition key. This pattern is frequently
 used in domains including business analysis, sensor data management, and scientific
 experiments.

 One design trap that many new users fall into is attempting to use Cassandra as a
 queue. Each item in the queue is stored with a timestamp in a wide partition. Items
 are appended to the end of the queue and read from the front, being deleted after they
 are read. This is a design that seems attractive, especially given its apparent similarity
 to the time series pattern. The problem with this approach is that the deleted items
 are now tombstones that Cassandra must scan past in order to read from the front of
 the queue. Over time, a growing number of tombstones begins to degrade read per‐
 formance. We’ll discuss tombstones in Chapter 6.
 The queue anti-pattern serves as a reminder that any design that relies on the deletion
 of data is potentially a poorly performing design.

 Taking Advantage of User-Defined Types
 User-defined types are frequently used to create logical groupings
 of nonprimary key columns, as you have done with the address
 user-defined type. UDTs can also be stored in collections to further
 reduce complexity in the design.
 Remember that the scope of a UDT is the keyspace in which it is
 defined. To use address in the reservation keyspace you’re about
 to design, you’ll have to declare it again.

 Evaluating and Refining

 Calculating Partition Size:
 The first thing that you want to look for is whether your tables will have partitions
 that will be overly large, or to put it another way, too wide. Partition size is measured
 by the number of cells (values) that are stored in the partition. Cassandra’s hard limit
 is two billion cells per partition, but you’ll likely run into performance issues before
 reaching that limit. The recommended size of a partition is not more than 100,000
 cells.
 The number of values (or cells) in the partition (N v ) is equal to the number of static
 columns (N s ) plus the product of the number of rows (N r ) and the number of of val‐
 ues per row. The number of values per row is defined as the number of columns (N c )
 minus the number of primary key columns (N pk ) and static columns (N s ).

 Breaking Up Large Partitions:

 The technique for splitting a large partition is straightforward: add an additional col‐
 umn to the partition key. In most cases, moving one of the existing columns into the
 partition key will be sufficient. Another option is to introduce an additional column
 to the table to act as a sharding key, but this requires additional application logic.

 Another technique known as bucketing is often used to break the data into moderate-
 size partitions. For example, you could bucketize the avail
 able_rooms_by_hotel_date table by adding a month column to the partition key,
 perhaps represented as an integer.
	Cassandra uses a special type of primary key called a composite key (or compound
	key) to represent groups of related rows, also called partitions. The composite key
	consists of a partition key, plus an optional set of clustering columns. The partition key
	is used to determine the nodes on which rows are stored and can itself consist of mul‐
	tiple columns. The clustering columns are used to control how data is sorted for stor‐
	age within a partition. Cassandra also supports an additional construct called a static
	column, which is for storing data that is not part of the primary key but is shared by
	every row in a partition.

	So the outermost structure in Cassandra is the cluster, sometimes called
	the ring, because Cassandra assigns data to nodes in the cluster by arranging them in
	a ring.

	A table is a container for an ordered collection of rows, each of which is itself an
	ordered collection of columns. Rows are organized in partitions and assigned to
	nodes in a Cassandra cluster according to the column(s) designated as the partition
	key. The ordering of data within a partition is determined by the clustering columns.

	Remember that TTL is stored on a per-column level for nonpri‐
	mary key columns. There is currently no mechanism for setting
	TTL at a row level directly after the initial insert; you would instead
	need to reinsert the row, taking advantage of Cassandra’s upsert
	behavior. As with the timestamp, there is no way to obtain or set
	the TTL value of a primary key column, and the TTL can only be
	set for a column when you provide a value for the column.
	If you want to set TTL across an entire row, you must provide a value for every non‐
	primary key column in your INSERT or UPDATE command.

	Primary Keys Are Forever
	After you create a table, there is no way to modify the primary key,
	because this controls how data is distributed within the cluster, and
	even more importantly, how it is stored on disk.

	create table stackoverflow_composite (
	key_part_one text,
	key_part_two int,
	data text,
	PRIMARY KEY(key_part_one, key_part_two)
	);
	In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY
	(in this example key_part_one is the partition key)
	and the second part of the key is the CLUSTERING KEY (in this example key_part_two)

	The Partition Key is responsible for data distribution across your nodes.
	The Clustering Key is responsible for data sorting within the partition.

	Cassandra is one of the few databases that provides race-free increments across data centers.

	The counter type has some special restrictions. It cannot be
	used as part of a primary key. If a counter is used, all of the columns other than
	primary key columns must be counters.

	Freezing is a concept that was introduced as a forward compatibil‐
	ity mechanism. For now, you can nest a collection within another
	collection by marking it as frozen , which means that Cassandra
	will store that value as a blob of binary data. In the future, when
	nested collections are fully supported, there will be a mechanism to
	“unfreeze” the nested collections, allowing the individual attributes
	to be accessed.
	You can also use a collection as a primary key if it is frozen.

	Design Differences Between RDBMS and Cassandra:
	No joins
	No referential integrity
	Denormalization
	Query-first design:
	in Cassandra you don’t start with the data model; you start with the
	query model. Instead of modeling the data first and then writing queries, with Cas‐
	sandra you model the queries and let the data be organized around them. Think of
	the most common query paths your application will use, and then create the tables
	that you need to support them.

	Designing for optimal storage:
	Because Cassandra tables are each stored in separate files on disk, it’s
	important to keep related columns defined together in the same table.
	A key goal as you begin creating data models in Cassandra is to minimize the number
	of partitions that must be searched in order to satisfy a given query. Because the parti‐
	tion is a unit of storage that does not get divided across nodes, a query that searches a
	single partition will typically yield the best performance.

	Sorting is a design decision:
	In Cassandra, however, sorting is treated differently; it is a design decision. The sort
	order available on queries is fixed, and is determined entirely by the selection of clus‐
	tering columns you supply in the CREATE TABLE command. The CQL SELECT state‐
	ment does support ORDER BY semantics, but only in the order specified by the
	clustering columns (ascending or descending).

	An important consideration in designing your table’s primary key
	is making sure that it defines a unique data element. Otherwise you
	run the risk of accidentally overwriting data.

	Use clustering columns to store attributes that you need to access
	in a range query. Remember that the order of the clustering columns is important.
	You’ll learn more about range queries in Chapter 9.

	The Wide Partition Pattern
	The design of the available_rooms_by_hotel_date table is an instance of the wide
	partition pattern. This pattern is sometimes called the wide row pattern when discus‐
	sing databases that support similar models, but wide partition is a more accurate
	description from a Cassandra perspective. The essence of the pattern is to group mul‐
	tiple related rows in a partition in order to support fast access to multiple rows within
	the partition in a single query.

	The time series pattern is an extension of the wide partition pattern. In this pattern, a
	series of measurements at specific time intervals are stored in a wide partition, where
	the measurement time is used as part of the partition key. This pattern is frequently
	used in domains including business analysis, sensor data management, and scientific
	experiments.

	One design trap that many new users fall into is attempting to use Cassandra as a
	queue. Each item in the queue is stored with a timestamp in a wide partition. Items
	are appended to the end of the queue and read from the front, being deleted after they
	are read. This is a design that seems attractive, especially given its apparent similarity
	to the time series pattern. The problem with this approach is that the deleted items
	are now tombstones that Cassandra must scan past in order to read from the front of
	the queue. Over time, a growing number of tombstones begins to degrade read per‐
	formance. We’ll discuss tombstones in Chapter 6.
	The queue anti-pattern serves as a reminder that any design that relies on the deletion
	of data is potentially a poorly performing design.

	Taking Advantage of User-Defined Types
	User-defined types are frequently used to create logical groupings
	of nonprimary key columns, as you have done with the address
	user-defined type. UDTs can also be stored in collections to further
	reduce complexity in the design.
	Remember that the scope of a UDT is the keyspace in which it is
	defined. To use address in the reservation keyspace you’re about
	to design, you’ll have to declare it again.

	Evaluating and Refining

	Calculating Partition Size:
	The first thing that you want to look for is whether your tables will have partitions
	that will be overly large, or to put it another way, too wide. Partition size is measured
	by the number of cells (values) that are stored in the partition. Cassandra’s hard limit
	is two billion cells per partition, but you’ll likely run into performance issues before
	reaching that limit. The recommended size of a partition is not more than 100,000
	cells.
	The number of values (or cells) in the partition (N v ) is equal to the number of static
	columns (N s ) plus the product of the number of rows (N r ) and the number of of val‐
	ues per row. The number of values per row is defined as the number of columns (N c )
	minus the number of primary key columns (N pk ) and static columns (N s ).

	Breaking Up Large Partitions:

	The technique for splitting a large partition is straightforward: add an additional col‐
	umn to the partition key. In most cases, moving one of the existing columns into the
	partition key will be sufficient. Another option is to introduce an additional column
	to the table to act as a sharding key, but this requires additional application logic.

	Another technique known as bucketing is often used to break the data into moderate-
	size partitions. For example, you could bucketize the avail
	able_rooms_by_hotel_date table by adding a month column to the partition key,
	perhaps represented as an integer.