This Gist assumes you already followed the instructions to install Cassandra, created a keyspace and table, and added some data.
brew install apache-spark
Clone the download script from Github Gist:
git clone https://gist.github.com/b700fe70f0025a519171.git
Rename the cloned directory:
mv b700fe70f0025a519171 connector
Run the script:
bash install_connector.sh
./usr/local/Cellar/apache-spark/1.0.2/libexec/sbin/start-all.sh
Make a note of the path to your connector directory.
Open the Spark Shell with the connector:
spark-shell --driver-class-path $(echo path/to/connector/*.jar | sed 's/ /:/g')
Wait for everything to load. Once it is finished, you'll see a scala prompt:
scala >
You'll need to stop the default SparkContext, since you'll create your own with the script.
scala > sc.stop
Once that is finished, get ready to paste the script in:
scala > :paste
Paste in this script, make sure to change the path to the connector and to change keyspace and table to the names of your keyspace and table:
import com.datastax.spark.connector._
import org.apache.spark._
val conf = new SparkConf()
conf.set("spark.cassandra.connection.host", "127.0.0.1")
conf.set("spark.home","/usr/local/Cellar/apache-spark/1.0.2/libexec")
// You may not need these two settings if you haven't set up password authentication in Cassandra
conf.set("spark.cassandra.auth.username", "cassandra")
conf.set("spark.cassandra.auth.password", "cassandra")
val sc = new SparkContext("spark://localhost:7077", "Cassandra Connector Test", conf)
sc.addJar("path/to/connector/cassandra-driver-core-2.0.3.jar")
sc.addJar("path/to/connector/cassandra-thrift-2.0.9.jar")
sc.addJar("path/to/connector/commons-codec-1.2.jar")
sc.addJar("path/to/connector/commons-lang3-3.1.jar")
sc.addJar("path/to/connector/commons-logging-1.1.1.jar")
sc.addJar("path/to/connector/guava-16.0.1.jar")
sc.addJar("path/to/connector/httpclient-4.2.5.jar")
sc.addJar("path/to/connector/httpcore-4.2.4.jar")
sc.addJar("path/to/connector/joda-convert-1.6.jar")
sc.addJar("path/to/connector/joda-time-2.3.jar")
sc.addJar("path/to/connector/libthrift-0.9.1.jar")
sc.addJar("path/to/connector/lz4-1.2.0.jar")
sc.addJar("path/to/connector/metrics-core-3.0.2.jar")
sc.addJar("path/to/connector/netty-3.9.0.Final.jar")
sc.addJar("path/to/connector/slf4j-api-1.7.5.jar")
sc.addJar("path/to/connector/snappy-java-1.0.5.jar")
sc.addJar("path/to/connector/spark-cassandra-connector_2.10-1.0.0-rc2.jar")
val table = sc.cassandraTable("keyspace", "table")
table.count
Make sure you are on a new line after 'table.count', then hit ctl-D to get out of paste mode.
If everything is set up correctly it should start running the script and at the end it will print out the number of rows in your Cassandra database.
Thanks to Al Toby, Open Source Mechanic at DataStax, for the connector installation script and for the blog post that helped me write this guide.
Have fun with Spark and Cassandra!
I've updated the install_connector.sh to use the latest ivy jar and latest spark-cassandra-connector:
[master][~/Downloads/tmp/connector]$ cat install_connector.sh
!/bin/bash
Installs the spark-cassandra-connector and support libs
mkdir /opt/connector
cd /opt/connector
rm *.jar
curl -o ivy-2.4.0.jar
'https://repo1.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar'
curl -o spark-cassandra-connector_2.11-1.5.0-M1.jar
'https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.11/1.5.0-M1/spark-cassandra-connector_2.11-1.5.0-M1.jar'
java -jar ivy-2.4.0.jar -dependency org.apache.cassandra cassandra-thrift 2.2.1 -retrieve "[artifact]-revision.[ext]"
java -jar ivy-2.4.0.jar -dependency com.datastax.cassandra cassandra-driver-core 2.1.7.1 -retrieve "[artifact]-revision.[ext]"
java -jar ivy-2.4.0.jar -dependency joda-time joda-time 2.8.2 -retrieve "[artifact]-revision.[ext]"
java -jar ivy-2.4.0.jar -dependency org.joda joda-convert 1.7 -retrieve "[artifact]-revision.[ext]"
rm -f *-{sources,javadoc}.jar
However, i had to delete the /opt/connector/log4j-over-slf4j-1.7.7.jar manually since the latest Spark now uses slf4j-log4j instead.