Disclaimer: I'm super new to DynamoDB, so if you found I wrote something incorrect or stupid, just kindly send me a comment :)
- database -> tables -> items -> attributes
- db = a collection of tables
- table = a collection of items
- item = a collection of attributes
- use HTTP + JSON to talk with DynamoDB server
- use Amazon Elastic MapReduce to join/export/import/query tables.
- attribute value cannot be null, empty, or empty set
- data types:
- string, number, binary
- set of string, number, binary e.g. ['ruby', 'rails']
- set = no duplicate value, unordered
- each item max size 64kb
- size = sum of lengths of attribute names + values (binary/utf8 length)
- 1 required attribute = primary key
- getItem, putItem, updateAttributes(add,put,delete), deleteItem
- batchGetItems, batchWriteItems(put/delete)
- put - 1) create if not exists, 2) replace if exists (can turn off with
Exists
) - conditional write - put item.amount=10 if item.amount=8
- atomic counter, not idempotent! - increment/decrement counter
- return values before write - returns item and delete it
- when read, can choose between 1) strong consistency, 2) eventual consistency
- Strong Read costs 2x read capacity unit of Eventual Read
-
primary key
- hash key
- hash key + range key
-
calculate throughput - capacity unit
- 1 read capacity unit
- = 1 strong consistency read operation (max 4kb) / second
- = 2 eventual consistency read operation (max 4b) / second
- 1 write capacity unit = 1 write operation (max 1kb) / second
- 1 read capacity unit
- table hash key selection does affect actual throughput
- why: dynamo distributes items across partitions based on hash key
- good: random hash key
- bad: hash key with few possible values (e.g. status code)
- bad: hash key with certain popular values (e.g. 50% of items has the same device code)
- e.g. events, logs, messages
- access pattern: new data has more frequent access than old data.
- tips: split data by year/month/week into multiple tables e.g. events_2013jan
- then adjust read/write throughput for each table based on access pattern
- The cost of updating a single attribute of an item is based on the full size of the item.
- e.g. split ProductCatalog into ProductCatalog (frequent read), ProductPricingAvailability (frequent write), ExtendedProductCatalog (seldom read)
- gzip and store as binary e.g long review message
- store in Amazon S3 e.g. images
- split a large item into multiple items stored in separate table
- not recommended
- need to handle cross-item transaction yourself
- a single
Scan
request can consume (1 MB page size / 4 KB item size) / 2 (eventually consistent reads) = 128 read operations. - A larger number of smaller
Scan
orQuery
operations would allow your other critical requests to succeed without throttling. - Some applications handle
Scan
load by rotating traffic hourly between two tables – one for critical traffic, and one for bookkeeping. Other applications can do this by performing every write on two tables: a "mission-critical" table, and a "shadow" table.
- 2 types: local, global
- for each table - max 5 local secondary key, 5 global secondary key
- cannot add/modify/delete secondary index to an existing table.
- when create several tables with secondary index, must create one by one, otherwise
LimitExceededException
. - when
QUERY
, must specify the index name to use.
- primary key = hash key + range key
- hash key = table's hash key, range key = any scalar attribute in table
- for each hash key, total indexed items <= 10GB
- query single partition only
- can choose: strong or eventual consistency read.
- when query index, consume read capacity units from the table.
- when write table, write to index too, charge extra write capacity units to table.
- when query index, can select any attributes (costs extra fetches to table if select attributes not included in index)
- take advantage of sparse index
- an item only appears in index, when the range-key exists
- e.g. to search active users, add attribute IsActive="Y" to active user, delete attribute "IsActive" from inactive user. Query on index has fewer reads, lower costs.
- primary key = hash key OR (hash key + range key)
- free to select any hash key, range key
- no max size restrictions
- query across all partitions
- support eventual consistency read only.
- has its own read/write throughput capacity (charged separately)
- when query index, can only select attributes included in the index
- the key values in a global secondary index do not need to be unique
- select key for better distribution across partitions
- take advantage of sparse index
- tips: for quick lookup - create global index with same primary key as table, but sub-set of attributes.
- tips: global index as read replica - to deal with high traffic, clones table to a global index, then use it for high volume reads; original table continues to handle writes.
See http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html
See http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
It would be nice add a cheat sheet... @teohm what do u think?