- Google's TensorFlow is a open source Deep Learning neural network machine learning library
- Grew out of Google's DistBelief v2 = Google's Brain project
- Building a system that simplifies deployment of large-scale machine learning models to a variety of hardware (thousands of servers in datacenters, smartphones, GPUs).
- Much like Theano - a popular deep learning framework.
- Data Flow Graph (aka Computational Graph or TensorFlow Graph of Computation) with nodes for data or operations and edges for flow of data between nodes called tensor.
- Tensor is a multi-dimentional array that flows between nodes.
- Create a Scala/sbt project
- Use IntelliJ IDEA
- Add libraryDependencies for Spark 2.0.0 (RC2)
- Create class
mf.DefaultSource
(or similar) publishLocal
(or similar)./bin/spark-shell --packages organization:spark-mf-format_2.11:1.0.0
spark.read.format("mf").load("mojFormat.mf")
For the bravests:
- Deep Dive: Apache Spark Memory Management - An excellent talk about Spark's memory management in the past releases and the upcoming 2.0. No code. The slides were awesome with a superb presentation style. Very informatory.
- A Deep Dive Into Structured Streaming -- a superb talk about the upcoming Structured Streaming in Spark 2.0.
- Structuring Spark: Dataframes, Datasets And Streaming -- another superb talk about the reasons for structuring Spark using Datasets by the one and only Michael Armbrust.
- Large-Scale Deep Learning with TensorFlow by Jeff Dean (Google) -- just yesterday I was thinking about feature vectors and how close they map to the real objects (they are supposed to represent) and that gave me the Aha moment that the more features the better but you need to be careful with over-featuring the m
Grouped by topic
Discussion: https://groups.google.com/forum/#!topic/scalania/JV8bELXNgC4
- (git) Cloning repo + local build
- Fixing compilation warnings
- Improving scaladoc
- Being a Spark contributor: JIRA + pull requests - mastering the flow
- What else? ...
From https://github.com/spark-jobserver/spark-jobserver#getting-started-with-spark-job-server:
The easiest way to get started is to try the Docker container which prepackages a Spark distribution with the job server and lets you start and deploy it.
➜ spark-jobserver git:(master) docker-machine version
docker-machine version 0.7.0, build a650a40
// https://gist.github.com/radekg/ec5a1575c450a48e5cba
Warsaw Scala Enthusiasts meetup about Apache Spark themed Let's Scala few Apache Spark apps together! and the follow-up Let's Scala few Apache Spark apps together - part 2!.
Many, many people answered the question:
EN: What and how would you like to learn at the meetup (about Apache Spark)?
The answers are as follows (and are going to be the foundation for the agenda):
- Set up a cluster using many laptops and see how much it could handle.
- MLlib with a simple classification like logistic regression.
From http://stackoverflow.com/a/32393044/1305344:
object size extends App {
(1 to 1000000).map(i => ("foo"+i, ()))
val input = readLine("prompt> ")
}
Run it with sbt 'runMain size'
and then use jps
(to know the pids), jstat -gc pid
(to query for gc) and jmap
(similar to jstat
) to analise resource allocation.
- What use cases are a good fit for Apache Spark? How to work with Spark?
- create RDDs, transform them, and execute actions to get result of a computation
- All computations in memory = "memory is cheap" (we do need enough of memory to fit all the data in)
- the less disk operations, the faster (you do know it, don't you?)
- You develop such computation flows or pipelines using a programming language - Scala, Python or Java <-- that's where ability to write code is paramount
- Data is usually on a distributed file system like Hadoop HDFS or NoSQL databases like Cassandra
- Data mining = analysis / insights / analytics
- log mining
Steps:
- Build a Docker image and install sphinx inside
- Run the image to have a complete working environment to create docs.
See https://github.com/subuser-security/subuser/blob/master/docs/Makefile.
# Sphinx doc system containerized