- move to directory which is installed Spark.
My case is below:
pochi 3:34:58 % cd /opt/local/repos/incubator-spark-1.0.0-beta/incubator-spark
I use the most new spark which cloned 2014/2/24.
You should create 2 files at least.
- simple.sbt
- SimpleApp.scala(src/main/scala/SimpleApp.scala)
Minumum sbt file is below:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-incubating"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
I wrote example which count word at line.
/*** SimpleApp.scala ***/
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args: Array[String]) {
val logFile = "/tmp/simple_log.md" // Should be some file on your system
val sc = new SparkContext("local", "Simple App", System.getenv("SPARK_HOME"),
List("target/scala-2.10/simple-project_2.10-1.0.jar"))
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
System.exit(0)
}
}
Note is you should put some input file at '/tmp/simple_log.md'. Of course, you can change file place like another local file or HDFS.
First, you should do packaging.
pochi 3:34:58 % sbt/sbt package
Finally, you can execute spark job as below:
pochi 3:34:58 % sbt/sbt run
I wrote sample file as below:
/tmp/simple_log.md
adddcc
cvvvvbbbbbbbbb
bbbbb
ddddd
bbbb
As result, it should display my terminal.
pochi 3:28:37 % sbt/sbt run /opt/local/repos/incubator-spark-1.0.0-beta/incubator-spark [git incubator-spark master]
Launching sbt from sbt/sbt-launch-0.13.1.jar
[info] Loading project definition from /opt/local/repos/incubator-spark-1.0.0-beta/incubator-spark/project/project
[info] Loading project definition from /opt/local/repos/incubator-spark-1.0.0-beta/incubator-spark/project
[info] Set current project to Simple Project (in build file:/opt/local/repos/incubator-spark-1.0.0-beta/incubator-spark/)
[info] Running SimpleApp
[error] 14/02/24 03:29:38 INFO slf4j.Slf4jLogger: Slf4jLogger started
[error] 14/02/24 03:29:38 INFO Remoting: Starting remoting
[error] 14/02/24 03:29:39 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:55255]
[error] 14/02/24 03:29:39 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:55255]
[error] 14/02/24 03:29:39 INFO spark.SparkEnv: Registering BlockManagerMaster
[error] 14/02/24 03:29:39 INFO storage.DiskBlockManager: Created local directory at /var/folders/vf/wjh62mkn1lj7bc5g4hh0nf340000gn/T/spark-local-20140
224032939-eb7f
[error] 14/02/24 03:29:39 INFO storage.MemoryStore: MemoryStore started with capacity 1838.2 MB.
[error] 14/02/24 03:29:39 INFO network.ConnectionManager: Bound socket to port 55256 with id = ConnectionManagerId(192.168.0.4,55256)
[error] 14/02/24 03:29:39 INFO storage.BlockManagerMaster: Trying to register BlockManager
[error] 14/02/24 03:29:39 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager 192.168.0.4:55256 with 1838.2 MB RAM
[error] 14/02/24 03:29:39 INFO storage.BlockManagerMaster: Registered BlockManager
[error] 14/02/24 03:29:39 INFO spark.HttpServer: Starting HTTP Server
[error] 14/02/24 03:29:39 INFO server.Server: jetty-7.6.8.v20121106
[error] 14/02/24 03:29:39 INFO server.AbstractConnector: Started [email protected]:55257
[error] 14/02/24 03:29:39 INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.0.4:55257
[error] 14/02/24 03:29:39 INFO spark.SparkEnv: Registering MapOutputTracker
[error] 14/02/24 03:29:39 INFO spark.HttpFileServer: HTTP File server directory is /var/folders/vf/wjh62mkn1lj7bc5g4hh0nf340000gn/T/spark-000557cc-dcc
9-4f89-9f85-0765459678c2
[error] 14/02/24 03:29:39 INFO spark.HttpServer: Starting HTTP Server
[error] 14/02/24 03:29:39 INFO server.Server: jetty-7.6.8.v20121106
[error] 14/02/24 03:29:39 INFO server.AbstractConnector: Started [email protected]:55258
[error] 14/02/24 03:29:39 INFO server.Server: jetty-7.6.8.v20121106
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null}
[error] 14/02/24 03:29:39 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null}
[error] 14/02/24 03:29:39 INFO server.AbstractConnector: Started [email protected]:4040
[error] 14/02/24 03:29:39 INFO ui.SparkUI: Started Spark Web UI at http://192.168.0.4:4040
[error] 2014-02-24 03:29:39.822 java[52351:1003] Unable to load realm info from SCDynamicStore
[error] 14/02/24 03:29:39 INFO spark.SparkContext: Added JAR target/scala-2.10/simple-project_2.10-1.0.jar at http://192.168.0.4:55258/jars/simple-pro
ject_2.10-1.0.jar with timestamp 1393180179960
[error] 14/02/24 03:29:40 INFO storage.MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=1927505510
[error] 14/02/24 03:29:40 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 1838.2 MB)
[error] 14/02/24 03:29:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where appli
cable
[error] 14/02/24 03:29:40 WARN snappy.LoadSnappy: Snappy native library not loaded
[error] 14/02/24 03:29:40 INFO mapred.FileInputFormat: Total input paths to process : 1
[error] 14/02/24 03:29:40 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:11
[error] 14/02/24 03:29:40 INFO scheduler.DAGScheduler: Got job 0 (count at SimpleApp.scala:11) with 2 output partitions (allowLocal=false)
[error] 14/02/24 03:29:40 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at SimpleApp.scala:11)
[error] 14/02/24 03:29:40 INFO scheduler.DAGScheduler: Parents of final stage: List()
[error] 14/02/24 03:29:40 INFO scheduler.DAGScheduler: Missing parents: List()
[error] 14/02/24 03:29:40 INFO scheduler.DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:11), which has no missing paren
ts
[error] 14/02/24 03:29:40 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:11)
[error] 14/02/24 03:29:40 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
[error] 14/02/24 03:29:40 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
[error] 14/02/24 03:29:40 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1675 bytes in 9 ms
[error] 14/02/24 03:29:40 INFO executor.Executor: Running task ID 0
[error] 14/02/24 03:29:40 INFO executor.Executor: Fetching http://192.168.0.4:55258/jars/simple-project_2.10-1.0.jar with timestamp 1393180179960
[error] 14/02/24 03:29:40 INFO util.Utils: Fetching http://192.168.0.4:55258/jars/simple-project_2.10-1.0.jar to /var/folders/vf/wjh62mkn1lj7bc5g4hh0n
f340000gn/T/fetchFileTemp9189837294489295385.tmp
[error] 14/02/24 03:29:41 INFO executor.Executor: Adding file:/var/folders/vf/wjh62mkn1lj7bc5g4hh0nf340000gn/T/spark-93e2e3df-b2cf-42ad-9df2-97c0c7ced
3c0/simple-project_2.10-1.0.jar to class loader
[error] 14/02/24 03:29:41 INFO storage.BlockManager: Found block broadcast_0 locally
[error] 14/02/24 03:29:41 INFO spark.CacheManager: Partition rdd_1_0 not found, computing it
[error] 14/02/24 03:29:41 INFO rdd.HadoopRDD: Input split: file:/tmp/simple_log.md:0+19
[error] 14/02/24 03:29:41 INFO storage.MemoryStore: ensureFreeSpace(256) called with curMem=35456, maxMem=1927505510
[error] 14/02/24 03:29:41 INFO storage.MemoryStore: Block rdd_1_0 stored as values to memory (estimated size 256.0 B, free 1838.2 MB)
[error] 14/02/24 03:29:41 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_1_0 in memory on 192.168.0.4:55256 (size: 256.0 B, free: 18
38.2 MB)
[error] 14/02/24 03:29:41 INFO storage.BlockManagerMaster: Updated info of block rdd_1_0
[error] 14/02/24 03:29:41 INFO executor.Executor: Serialized size of result for 0 is 563
[error] 14/02/24 03:29:41 INFO executor.Executor: Sending result for 0 directly to driver
[error] 14/02/24 03:29:41 INFO executor.Executor: Finished task ID 0
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1675 bytes in 0 ms
[error] 14/02/24 03:29:41 INFO executor.Executor: Running task ID 1
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Finished TID 0 in 295 ms on localhost (progress: 0/2)
[error] 14/02/24 03:29:41 INFO storage.BlockManager: Found block broadcast_0 locally
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)
[error] 14/02/24 03:29:41 INFO spark.CacheManager: Partition rdd_1_1 not found, computing it
[error] 14/02/24 03:29:41 INFO rdd.HadoopRDD: Input split: file:/tmp/simple_log.md:19+20
[error] 14/02/24 03:29:41 INFO storage.MemoryStore: ensureFreeSpace(296) called with curMem=35712, maxMem=1927505510
[error] 14/02/24 03:29:41 INFO storage.MemoryStore: Block rdd_1_1 stored as values to memory (estimated size 296.0 B, free 1838.2 MB)
[error] 14/02/24 03:29:41 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_1_1 in memory on 192.168.0.4:55256 (size: 296.0 B, free: 18
38.2 MB)
[error] 14/02/24 03:29:41 INFO storage.BlockManagerMaster: Updated info of block rdd_1_1
[error] 14/02/24 03:29:41 INFO executor.Executor: Serialized size of result for 1 is 563
[error] 14/02/24 03:29:41 INFO executor.Executor: Sending result for 1 directly to driver
[error] 14/02/24 03:29:41 INFO executor.Executor: Finished task ID 1
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Finished TID 1 in 35 ms on localhost (progress: 1/2)
[error] 14/02/24 03:29:41 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0 from pool
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Completed ResultTask(0, 1)
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Stage 0 (count at SimpleApp.scala:11) finished in 0.339 s
[error] 14/02/24 03:29:41 INFO spark.SparkContext: Job finished: count at SimpleApp.scala:11, took 0.496443 s
[error] 14/02/24 03:29:41 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:12
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Got job 1 (count at SimpleApp.scala:12) with 2 output partitions (allowLocal=false)
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Final stage: Stage 1 (count at SimpleApp.scala:12)
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Parents of final stage: List()
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Missing parents: List()
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Submitting Stage 1 (FilteredRDD[3] at filter at SimpleApp.scala:12), which has no missing paren
ts
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (FilteredRDD[3] at filter at SimpleApp.scala:12)
[error] 14/02/24 03:29:41 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 1680 bytes in 0 ms
[error] 14/02/24 03:29:41 INFO executor.Executor: Running task ID 2
[error] 14/02/24 03:29:41 INFO storage.BlockManager: Found block broadcast_0 locally
[error] 14/02/24 03:29:41 INFO storage.BlockManager: Found block rdd_1_0 locally
[error] 14/02/24 03:29:41 INFO executor.Executor: Serialized size of result for 2 is 563
[error] 14/02/24 03:29:41 INFO executor.Executor: Sending result for 2 directly to driver
[error] 14/02/24 03:29:41 INFO executor.Executor: Finished task ID 2
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Starting task 1.0:1 as TID 3 on executor localhost: localhost (PROCESS_LOCAL)
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Serialized task 1.0:1 as 1680 bytes in 1 ms
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Finished TID 2 in 13 ms on localhost (progress: 0/2)
[error] 14/02/24 03:29:41 INFO executor.Executor: Running task ID 3
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Completed ResultTask(1, 0)
[error] 14/02/24 03:29:41 INFO storage.BlockManager: Found block broadcast_0 locally
[error] 14/02/24 03:29:41 INFO storage.BlockManager: Found block rdd_1_1 locally
[error] 14/02/24 03:29:41 INFO executor.Executor: Serialized size of result for 3 is 563
[error] 14/02/24 03:29:41 INFO executor.Executor: Sending result for 3 directly to driver
[error] 14/02/24 03:29:41 INFO executor.Executor: Finished task ID 3
[error] 14/02/24 03:29:41 INFO scheduler.TaskSetManager: Finished TID 3 in 11 ms on localhost (progress: 1/2)
[error] 14/02/24 03:29:41 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 1.0 from pool
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Completed ResultTask(1, 1)
[error] 14/02/24 03:29:41 INFO scheduler.DAGScheduler: Stage 1 (count at SimpleApp.scala:12) finished in 0.024 s
[error] 14/02/24 03:29:41 INFO spark.SparkContext: Job finished: count at SimpleApp.scala:12, took 0.035547 s
[info] Lines with a: 1, Lines with b: 3
[success] Total time: 35 s, completed 2014/02/24 3:29:41
sbt/sbt run 31.55s user 1.33s system 64% cpu 50.636 total
pochi 3:29:41 %
It seems result is correct, but I wonder all log type is 'error'...