Skip to content

Instantly share code, notes, and snippets.

@clakech
Last active June 2, 2021 12:38
Show Gist options
  • Save clakech/4a4568daba1ca108f03c to your computer and use it in GitHub Desktop.
Save clakech/4a4568daba1ca108f03c to your computer and use it in GitHub Desktop.
(Spark + Cassandra) * Docker = <3

How to setup a cluster with Spark + Cassandra using Docker ?

Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.

NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily here

Install docker and git

Run a Cassandra 2.1 cluster

Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/

# run your first cassandra node
docker run --name some-cassandra -d cassandra:2.1

# (optionnal) run some other nodes if you wish
docker run --name some-cassandra2 -d -e CASSANDRA_SEEDS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-cassandra)" cassandra:2.1

Here you have a Cassandra cluster running without installing anything but docker.

To test your cluster, you can run a cqlsh console:

# run a cqlsh console to test your cluster
docker run -it --link some-cassandra:cassandra --rm cassandra:2.1 cqlsh cassandra

And now, create some data and retrieve them:

cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int);
cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1);
cqlsh>INSERT INTO test.kv(key, value) VALUES ('key2', 2);
cqlsh>select * from test.kv;

 key  | value
------+-------
 key1 |     1
 key2 |     2

(2 rows)

Here you have a running and functionnal C* cluster! #nice

Run a Spark 1.3 cluster

Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector 1.3.0-RC1 is blasting fast too: https://github.com/epahomov/docker-spark

I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image: https://github.com/clakech/docker-spark

# clone the fork
git clone https://github.com/clakech/docker-spark.git
cd docker-spark

# run a master spark node
./start-master.sh

# run some workers spark nodes (1 is enought)
./start-worker.sh

# run a spark shell console to test your cluster
./spark-shell.sh

# check you can retrive your Cassandra data using Spark

scala>import com.datastax.spark.connector._
...
scala>val rdd = sc.cassandraTable("test", "kv")
rdd: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15

scala>println(rdd.count)
2

scala>println(rdd.first)
CassandraRow{key: key1, value: 1}

scala>println(rdd.map(_.getInt("value")).sum)
3.0

scala>val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala>collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))       
...

scala>println(rdd.map(_.getInt("value")).sum)
10.0

THE END of the boring installation part, now eat and digest data to extract value!

PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned => another (better?) way to install a Spark + C* cluster is fescribed here: https://github.com/clakech/sparkassandra-dockerized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment