➜ dev SPARK_REPL_OPTS="-XX:MaxPermSize=256m" spark-1.3.1-bin-hadoop2.6/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g Ivy Default Cache set to: /Users/sim/.ivy2/cache The jars for the packages stored in: /Users/sim/.ivy2/jars :: loading settings :: url = jar:file:/Users/sim/dev/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.10 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.10;1.0.3 in central found org.apache.commons#commons-csv;1.1 in central :: resolution report :: resolve 195ms :: artifacts dl 5ms :: modules in use: com.databricks#spark-csv_2.10;1.0.3 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 2 | 0 | 0 | 0 || 2 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 2 already retrieved (0kB/5ms) 2015-07-02 15:29:33.242 java[45393:7905252] Unable to load realm info from SCDynamicStore 15/07/02 15:29:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/07/02 15:29:33 INFO spark.SecurityManager: Changing view acls to: sim 15/07/02 15:29:33 INFO spark.SecurityManager: Changing modify acls to: sim 15/07/02 15:29:33 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sim); users with modify permissions: Set(sim) 15/07/02 15:29:33 INFO spark.HttpServer: Starting HTTP Server 15/07/02 15:29:33 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/07/02 15:29:33 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:62083 15/07/02 15:29:33 INFO util.Utils: Successfully started service 'HTTP class server' on port 62083. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51) Type in expressions to have them evaluated. Type :help for more information. 15/07/02 15:29:36 INFO spark.SparkContext: Running Spark version 1.3.1 15/07/02 15:29:36 INFO spark.SecurityManager: Changing view acls to: sim 15/07/02 15:29:36 INFO spark.SecurityManager: Changing modify acls to: sim 15/07/02 15:29:36 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sim); users with modify permissions: Set(sim) 15/07/02 15:29:36 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/07/02 15:29:36 INFO Remoting: Starting remoting 15/07/02 15:29:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.12:62084] 15/07/02 15:29:36 INFO util.Utils: Successfully started service 'sparkDriver' on port 62084. 15/07/02 15:29:36 INFO spark.SparkEnv: Registering MapOutputTracker 15/07/02 15:29:36 INFO spark.SparkEnv: Registering BlockManagerMaster 15/07/02 15:29:36 INFO storage.DiskBlockManager: Created local directory at /var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-0de5dce8-23bf-4dab-849e-f3e55e083747/blockmgr-55d47ebf-9987-4f9b-ac3b-02537c0e86ba 15/07/02 15:29:36 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB 15/07/02 15:29:36 INFO spark.HttpFileServer: HTTP File server directory is /var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-b8b6bbb8-13cc-4c7e-9696-2f3be90b54c6/httpd-9730dc96-ceb4-410a-aba0-967216cec688 15/07/02 15:29:36 INFO spark.HttpServer: Starting HTTP Server 15/07/02 15:29:36 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/07/02 15:29:36 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:62085 15/07/02 15:29:36 INFO util.Utils: Successfully started service 'HTTP file server' on port 62085. 15/07/02 15:29:36 INFO spark.SparkEnv: Registering OutputCommitCoordinator 15/07/02 15:29:36 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/07/02 15:29:36 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/07/02 15:29:36 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 15/07/02 15:29:36 INFO ui.SparkUI: Started SparkUI at http://192.168.1.12:4040 15/07/02 15:29:36 INFO spark.SparkContext: Added JAR file:/Users/sim/.ivy2/jars/spark-csv_2.10.jar at http://192.168.1.12:62085/jars/spark-csv_2.10.jar with timestamp 1435865376837 15/07/02 15:29:36 INFO spark.SparkContext: Added JAR file:/Users/sim/.ivy2/jars/commons-csv.jar at http://192.168.1.12:62085/jars/commons-csv.jar with timestamp 1435865376838 15/07/02 15:29:36 INFO executor.Executor: Starting executor ID <driver> on host localhost 15/07/02 15:29:36 INFO executor.Executor: Using REPL class URI: http://192.168.1.12:62083 15/07/02 15:29:36 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.1.12:62084/user/HeartbeatReceiver 15/07/02 15:29:36 INFO netty.NettyBlockTransferService: Server created on 62086 15/07/02 15:29:36 INFO storage.BlockManagerMaster: Trying to register BlockManager 15/07/02 15:29:36 INFO storage.BlockManagerMasterActor: Registering block manager localhost:62086 with 2.1 GB RAM, BlockManagerId(<driver>, localhost, 62086) 15/07/02 15:29:36 INFO storage.BlockManagerMaster: Registered BlockManager 15/07/02 15:29:37 INFO repl.SparkILoop: Created spark context.. Spark context available as sc. 15/07/02 15:29:37 INFO repl.SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala> scala> val ctx = new HiveContext(sc) ctx: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@2e46890e scala> import ctx.implicits._ import ctx.implicits._ scala> scala> val df = ctx.jsonFile("file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz") 15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(183601) called with curMem=0, maxMem=2223023063 15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 179.3 KB, free 2.1 GB) 15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(26218) called with curMem=183601, maxMem=2223023063 15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 25.6 KB, free 2.1 GB) 15/07/02 15:29:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:62086 (size: 25.6 KB, free: 2.1 GB) 15/07/02 15:29:52 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0 15/07/02 15:29:52 INFO spark.SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:114 15/07/02 15:29:52 INFO mapred.FileInputFormat: Total input paths to process : 1 15/07/02 15:29:52 INFO spark.SparkContext: Starting job: isEmpty at JsonRDD.scala:51 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Got job 0 (isEmpty at JsonRDD.scala:51) with 1 output partitions (allowLocal=true) 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Final stage: Stage 0(isEmpty at JsonRDD.scala:51) 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Parents of final stage: List() 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Missing parents: List() 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Submitting Stage 0 (file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz MapPartitionsRDD[1] at textFile at JSONRelation.scala:114), which has no missing parents 15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(2728) called with curMem=209819, maxMem=2223023063 15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.7 KB, free 2.1 GB) 15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(2031) called with curMem=212547, maxMem=2223023063 15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2031.0 B, free 2.1 GB) 15/07/02 15:29:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:62086 (size: 2031.0 B, free: 2.1 GB) 15/07/02 15:29:52 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0 15/07/02 15:29:52 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz MapPartitionsRDD[1] at textFile at JSONRelation.scala:114) 15/07/02 15:29:52 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/07/02 15:29:52 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1453 bytes) 15/07/02 15:29:52 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0) 15/07/02 15:29:52 INFO executor.Executor: Fetching http://192.168.1.12:62085/jars/commons-csv.jar with timestamp 1435865376838 15/07/02 15:29:52 INFO util.Utils: Fetching http://192.168.1.12:62085/jars/commons-csv.jar to /var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-fd72e62c-1adf-4bad-8c3d-5b3899545675/userFiles-c8f3949e-7f5e-43c3-b1ef-8f22523bdbcc/fetchFileTemp2464132617259806671.tmp 15/07/02 15:29:52 INFO executor.Executor: Adding file:/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-fd72e62c-1adf-4bad-8c3d-5b3899545675/userFiles-c8f3949e-7f5e-43c3-b1ef-8f22523bdbcc/commons-csv.jar to class loader 15/07/02 15:29:52 INFO executor.Executor: Fetching http://192.168.1.12:62085/jars/spark-csv_2.10.jar with timestamp 1435865376837 15/07/02 15:29:52 INFO util.Utils: Fetching http://192.168.1.12:62085/jars/spark-csv_2.10.jar to /var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-fd72e62c-1adf-4bad-8c3d-5b3899545675/userFiles-c8f3949e-7f5e-43c3-b1ef-8f22523bdbcc/fetchFileTemp3554212928556694314.tmp 15/07/02 15:29:52 INFO executor.Executor: Adding file:/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-fd72e62c-1adf-4bad-8c3d-5b3899545675/userFiles-c8f3949e-7f5e-43c3-b1ef-8f22523bdbcc/spark-csv_2.10.jar to class loader 15/07/02 15:29:52 INFO rdd.HadoopRDD: Input split: file:/Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz:0+22597095 15/07/02 15:29:52 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/07/02 15:29:52 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/07/02 15:29:52 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/07/02 15:29:52 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/07/02 15:29:52 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/07/02 15:29:52 INFO compress.CodecPool: Got brand-new decompressor [.gz] 15/07/02 15:29:52 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 3741 bytes result sent to driver 15/07/02 15:29:52 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 129 ms on localhost (1/1) 15/07/02 15:29:52 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Stage 0 (isEmpty at JsonRDD.scala:51) finished in 0.139 s 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Job 0 finished: isEmpty at JsonRDD.scala:51, took 0.172693 s 15/07/02 15:29:52 INFO spark.SparkContext: Starting job: reduce at JsonRDD.scala:54 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Got job 1 (reduce at JsonRDD.scala:54) with 1 output partitions (allowLocal=false) 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Final stage: Stage 1(reduce at JsonRDD.scala:54) 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Parents of final stage: List() 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Missing parents: List() 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at map at JsonRDD.scala:54), which has no missing parents 15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(3240) called with curMem=214578, maxMem=2223023063 15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.2 KB, free 2.1 GB) 15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(2338) called with curMem=217818, maxMem=2223023063 15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.3 KB, free 2.1 GB) 15/07/02 15:29:52 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:62086 (size: 2.3 KB, free: 2.1 GB) 15/07/02 15:29:52 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece0 15/07/02 15:29:52 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839 15/07/02 15:29:52 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[3] at map at JsonRDD.scala:54) 15/07/02 15:29:52 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/07/02 15:29:52 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1453 bytes) 15/07/02 15:29:52 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1) 15/07/02 15:29:52 INFO rdd.HadoopRDD: Input split: file:/Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz:0+22597095 15/07/02 15:29:52 INFO compress.CodecPool: Got brand-new decompressor [.gz] 15/07/02 15:29:54 INFO storage.BlockManager: Removing broadcast 1 15/07/02 15:29:54 INFO storage.BlockManager: Removing block broadcast_1_piece0 15/07/02 15:29:54 INFO storage.MemoryStore: Block broadcast_1_piece0 of size 2031 dropped from memory (free 2222804938) 15/07/02 15:29:54 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:62086 in memory (size: 2031.0 B, free: 2.1 GB) 15/07/02 15:29:54 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0 15/07/02 15:29:54 INFO storage.BlockManager: Removing block broadcast_1 15/07/02 15:29:54 INFO storage.MemoryStore: Block broadcast_1 of size 2728 dropped from memory (free 2222807666) 15/07/02 15:29:54 INFO spark.ContextCleaner: Cleaned broadcast 1 15/07/02 15:30:06 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 6638 bytes result sent to driver 15/07/02 15:30:06 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 13740 ms on localhost (1/1) 15/07/02 15:30:06 INFO scheduler.DAGScheduler: Stage 1 (reduce at JsonRDD.scala:54) finished in 13.744 s 15/07/02 15:30:06 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/07/02 15:30:06 INFO scheduler.DAGScheduler: Job 1 finished: reduce at JsonRDD.scala:54, took 13.753319 s 15/07/02 15:30:06 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/07/02 15:30:06 INFO metastore.ObjectStore: ObjectStore, initialize called 15/07/02 15:30:06 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/07/02 15:30:06 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/07/02 15:30:06 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/07/02 15:30:07 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/07/02 15:30:07 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 15/07/02 15:30:07 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "". 15/07/02 15:30:08 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/07/02 15:30:08 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/07/02 15:30:08 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/07/02 15:30:08 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/07/02 15:30:08 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/07/02 15:30:08 INFO metastore.ObjectStore: Initialized ObjectStore 15/07/02 15:30:08 INFO metastore.HiveMetaStore: Added admin role in metastore 15/07/02 15:30:08 INFO metastore.HiveMetaStore: Added public role in metastore 15/07/02 15:30:08 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty 15/07/02 15:30:08 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/07/02 15:30:08 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. df: org.apache.spark.sql.DataFrame = [aac_brand: string, aag__id: bigint, aag_weight: bigint, aca_brand: string, aca_conversion_integration: boolean, aca_daily_budget: bigint, aca_hide_brand_from_publishers: boolean, aca_is_remnant: boolean, aca_short_name: string, accid: string, acr__id: bigint, acr_choices: array<struct<cta:string,headline:string,img:string,target:string>>, acr_cta: string, acr_description1: string, acr_description2: string, acr_destination: string, acr_displayUrl: string, acr_headline: string, acr_img: string, acr_isiUrl: string, acr_paramCTA: string, acr_paramName: string, acr_paramPlaceholder: string, acr_target: string, acr_type: string, acr_weight: bigint, agid: string, akw__id: bigint, akw_canonical_id: bigint, akw_criterion_type: string, akw_destination_url: st... scala> df.registerTempTable("training") scala> scala> val dfCount = ctx.sql("select count(*) as cnt from training") 15/07/02 15:30:09 INFO parse.ParseDriver: Parsing command: select count(*) as cnt from training 15/07/02 15:30:09 INFO parse.ParseDriver: Parse Completed dfCount: org.apache.spark.sql.DataFrame = [cnt: bigint] scala> println(dfCount.first.getLong(0)) 15/07/02 15:30:09 INFO storage.MemoryStore: ensureFreeSpace(90479) called with curMem=215397, maxMem=2223023063 15/07/02 15:30:09 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 88.4 KB, free 2.1 GB) 15/07/02 15:30:09 INFO storage.MemoryStore: ensureFreeSpace(36868) called with curMem=305876, maxMem=2223023063 15/07/02 15:30:09 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 36.0 KB, free 2.1 GB) 15/07/02 15:30:09 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:62086 (size: 36.0 KB, free: 2.1 GB) 15/07/02 15:30:09 INFO storage.BlockManagerMaster: Updated info of block broadcast_3_piece0 15/07/02 15:30:09 INFO spark.SparkContext: Created broadcast 3 from textFile at JSONRelation.scala:114 15/07/02 15:30:09 INFO spark.SparkContext: Starting job: runJob at SparkPlan.scala:122 15/07/02 15:30:09 INFO mapred.FileInputFormat: Total input paths to process : 1 15/07/02 15:30:09 INFO scheduler.DAGScheduler: Registering RDD 10 (mapPartitions at Exchange.scala:101) 15/07/02 15:30:09 INFO scheduler.DAGScheduler: Got job 2 (runJob at SparkPlan.scala:122) with 1 output partitions (allowLocal=false) 15/07/02 15:30:09 INFO scheduler.DAGScheduler: Final stage: Stage 3(runJob at SparkPlan.scala:122) 15/07/02 15:30:09 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 2) 15/07/02 15:30:09 INFO scheduler.DAGScheduler: Missing parents: List(Stage 2) 15/07/02 15:30:09 INFO scheduler.DAGScheduler: Submitting Stage 2 (MapPartitionsRDD[10] at mapPartitions at Exchange.scala:101), which has no missing parents 15/07/02 15:30:09 INFO storage.MemoryStore: ensureFreeSpace(17448) called with curMem=342744, maxMem=2223023063 15/07/02 15:30:09 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 17.0 KB, free 2.1 GB) 15/07/02 15:30:09 INFO storage.MemoryStore: ensureFreeSpace(9310) called with curMem=360192, maxMem=2223023063 15/07/02 15:30:09 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 9.1 KB, free 2.1 GB) 15/07/02 15:30:09 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:62086 (size: 9.1 KB, free: 2.1 GB) 15/07/02 15:30:09 INFO storage.BlockManagerMaster: Updated info of block broadcast_4_piece0 15/07/02 15:30:09 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:839 15/07/02 15:30:09 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 2 (MapPartitionsRDD[10] at mapPartitions at Exchange.scala:101) 15/07/02 15:30:09 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks 15/07/02 15:30:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1442 bytes) 15/07/02 15:30:09 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2) 15/07/02 15:30:09 INFO rdd.HadoopRDD: Input split: file:/Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz:0+22597095 15/07/02 15:30:09 INFO compress.CodecPool: Got brand-new decompressor [.gz] 15/07/02 15:30:15 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 2003 bytes result sent to driver 15/07/02 15:30:15 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 5081 ms on localhost (1/1) 15/07/02 15:30:15 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/07/02 15:30:15 INFO scheduler.DAGScheduler: Stage 2 (mapPartitions at Exchange.scala:101) finished in 5.081 s 15/07/02 15:30:15 INFO scheduler.DAGScheduler: looking for newly runnable stages 15/07/02 15:30:15 INFO scheduler.DAGScheduler: running: Set() 15/07/02 15:30:15 INFO scheduler.DAGScheduler: waiting: Set(Stage 3) 15/07/02 15:30:15 INFO scheduler.DAGScheduler: failed: Set() 15/07/02 15:30:15 INFO scheduler.DAGScheduler: Missing parents for Stage 3: List() 15/07/02 15:30:15 INFO scheduler.DAGScheduler: Submitting Stage 3 (MapPartitionsRDD[14] at map at SparkPlan.scala:97), which is now runnable 15/07/02 15:30:15 INFO storage.MemoryStore: ensureFreeSpace(18920) called with curMem=369502, maxMem=2223023063 15/07/02 15:30:15 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 18.5 KB, free 2.1 GB) 15/07/02 15:30:15 INFO storage.MemoryStore: ensureFreeSpace(10501) called with curMem=388422, maxMem=2223023063 15/07/02 15:30:15 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 10.3 KB, free 2.1 GB) 15/07/02 15:30:15 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:62086 (size: 10.3 KB, free: 2.1 GB) 15/07/02 15:30:15 INFO storage.BlockManagerMaster: Updated info of block broadcast_5_piece0 15/07/02 15:30:15 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:839 15/07/02 15:30:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 3 (MapPartitionsRDD[14] at map at SparkPlan.scala:97) 15/07/02 15:30:15 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 15/07/02 15:30:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, PROCESS_LOCAL, 1171 bytes) 15/07/02 15:30:15 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3) 15/07/02 15:30:15 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 15/07/02 15:30:15 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms 15/07/02 15:30:15 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 1115 bytes result sent to driver 15/07/02 15:30:15 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 52 ms on localhost (1/1) 15/07/02 15:30:15 INFO scheduler.DAGScheduler: Stage 3 (runJob at SparkPlan.scala:122) finished in 0.052 s 15/07/02 15:30:15 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 15/07/02 15:30:15 INFO scheduler.DAGScheduler: Job 2 finished: runJob at SparkPlan.scala:122, took 5.168569 s 88283 scala>