- This talk was really an info session for Datstax Enterprise - Why this event was free.
- academy.datastax.com for free online classes
- It can be difficult to go from the open source datastax to the enterprise edition.
- Hadoop Integration - Doesn't see as much of this with the growing popularity of Spark
- Spark Integration - Leverages Cassandra: location aware, it knows partition keys
- QA: Why only use Oracle Java? Performance issues with OpenJDK.
- When looking for advice/solutions, ensure the advice is from the last 6 months - 1 year. (i.e, don't use Hector for anything)
- It was a good intro to Spark, but he ran out of time before really talking about Cassandra and Spark working together :(
- Erich Ess, CTO from SimpleRelevance
- Cassandra for recommendation data (Sounded like they only had a few months of operational experience). Spark to "augment ETL process", "post analysis on recommendation results", "interactive analysis on raw data"
- Spark knows where the Cassandra data is. Will prefer local nodes. It seems like with DataStax Enterprise, Spark will just work.
- Spark is a distributed compute platform. In memory computation. Batch/stream processing. Can hook into Kafka.
- Spark Context - The connection to the cluster. Spark Shell automatically creates one. Connects also to data sources.
- Resilient Distributed Dataset (RDD) - Abstracts distributed data. The core of Spark. Think of it as a list or enumerable object that you can step across and write a chain of transformations. Functional approach.
- Functional Transformations - How you interact with RDDs
- Map, reduce, filter, etc. Lazy evaluating - Don't expect things to work until you perform an action.
- You get another RDD that has been transformed.
- Can cache RDD into memory.
- groupBy's can be problematic because of shuffling.
- Action functions - unwrap the RDD and return a value of a non RDD type. They force the transformation chain to be evaluated.
- .count may be the easiest way to get an entire dataset into a cache
- Fault tolerance: Spark keeps a family tree of every RDD, so if something fails, Spark will replay the source data through the chain. No periodic snapshots.
- Spark Shell: Scala or Python Shell. Interactive distributed computing.
- A pretty basic history of their experience with Cassandra. Nothing super informative, but they are a Chicago company with a lot of Cassandra experience with AWS.
- Oo these guys are in our old office. Ad-tech. Not currently using Enterprise.
- 3 1/2 years of cassandra experience.
- Stats: 140+ nodes (i2.xlarge), 15B+ rows, 45B rows in index table, 30+ TB of data on disk
- Use cases: matching service ("unified customer view", "measurement and activation"), metrics (time series data), audit log
- Started Keyspace with replication of 2 copies in every region.
- Data migrations: sometimes it may make more sense to copy and move everything over than alter tables. The example he gave was for the move to vnodes, but he mentioned vaguely some limitations on alter tables.
- Adding a new region - Added a read only region copy to their cluster.
- His advice: SSDs, stay current but not too current, keep up on repairs, simulate production load and test configurations
- Can take several days - week to repair the entire token space.
- An interesting idea they are toying with is creating their own snitch to be smarter about data replication in AWS.
- Super basic talk on the differences in the feature set of SQL and CQL.
- An interesting (but basic) breakdown on how to choose physical hardware from a sales engineer manager at Datastax.
- Know your workload, know your processor, and saturate it.
- CPU - Lots of Cache. >= 20MB cache.
- RAM - 4 DIMMS per socket, 8-16 GB DIMMS. Theoritical ~7% performance increase from single rank -> dual rank.
- Storage - Use flash.
- CQL Sizer
- OpsCenter Capacity Planner - monitors data growth rate.
- A pretty interesting talk.
- Use Case: Top Gamer Scores
- User generating events that will be stored in a cluster. Time series data. Cassandra will be good at arbitrarily scaling this data.
- Partition key will be (userId, gameId), clustering key will be scoreTimestamp ordered descending.
- How do you get to a leaderboard of that data? Tough with vanilla Cassandra.
- Spark! Will apply some map/reduce functions to the data nightly (or more often) that is dumped into a new Cassandra table. Will also be time series data.
- Specifying a compaction strategy of
DateTieredCompactionStrategy
and a time to live of 7 hours.
- Specifying a compaction strategy of
- User generating events that will be stored in a cluster. Time series data. Cassandra will be good at arbitrarily scaling this data.
- Performance impact of compaction: Non-zero, depends on the system. Experiment and benchmark. Make sure your measurements take account of it.
- Use Case: Social image storage
- User creates an account, uploads images, image is distributed across datacenters, user can check access patterns.
- Access Patterns: Recall a single image, all images in a given time range, specific images over a time range, times each image was accessed.
- Denormalizing data isn't always bad (User table has a list of emails, image table has a list of tags - which you could search with SOLR integration)
- Images are
blobs
and are inblob_chunks
. - Use Case: User Registration
- What if two users are being created at the same time?
- No ACID transactions.
- Lightweight Transactions that use Paxos - "IF NOT EXIST":
- How big can a row get? How big can a column get (A few mb)? How big can a partition get (Rule of thumb: 100mb)?
- a 10 minute allegory about squirrels hiding their nuts in multiple trees at different school campuses.
- Spent a lot of time defending eventual consistency.
- Clients are the best judge on how valuable their time is.
- Applications can demand consistent operations.
- If doing lightweight transaction writes on a table, you need to do all writes in that manner or else they will squash them.
- Cassandra-5797