Last active
June 29, 2020 04:14
-
-
Save MLnick/6ec916b646c3004d7523 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ sbt/sbt assembly/assembly | |
$ sbt/sbt examples/assembly | |
$ SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 ./bin/pyspark | |
... | |
14/06/03 15:34:11 INFO SparkUI: Started SparkUI at http://10.0.0.4:4040 | |
Welcome to | |
____ __ | |
/ __/__ ___ _____/ /__ | |
_\ \/ _ \/ _ `/ __/ '_/ | |
/__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT | |
/_/ | |
Using Python version 2.7.6 (default, Jan 10 2014 11:23:15) | |
SparkContext available as sc. | |
In [1]: conf = {"hbase.zookeeper.quorum": "localhost","hbase.mapreduce.inputtable": "data"} | |
In [2]: rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io | |
.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf) | |
14/06/03 15:34:54 INFO MemoryStore: ensureFreeSpace(33603) called with curMem=0, maxMem=309225062 | |
14/06/03 15:34:54 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.8 KB, free 294.9 MB) | |
In [3]: rdd.collect() | |
14/06/03 15:35:07 INFO ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT | |
14/06/03 15:35:07 INFO ZooKeeper: Client environment:host.name=localhost | |
14/06/03 15:35:07 INFO ZooKeeper: Client environment:java.version=1.7.0_60 | |
... | |
14/06/03 16:38:40 INFO NewHadoopRDD: Input split: localhost:, | |
14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as key: ImmutableBytesWritable; | |
Error: couldn't pickle object of type class org.apache.hadoop.hbase.io.ImmutableBytesWritable | |
14/06/03 16:38:40 WARN SerDeUtil: Failed to pickle Java object as value: Result; | |
Error: couldn't pickle object of type class org.apache.hadoop.hbase.client.Result | |
14/06/03 16:38:40 INFO Executor: Serialized size of result for 0 is 738 | |
14/06/03 16:38:40 INFO Executor: Sending result for 0 directly to driver | |
14/06/03 16:38:40 INFO Executor: Finished task ID 0 | |
14/06/03 16:38:40 INFO TaskSetManager: Finished TID 0 in 80 ms on localhost (progress: 1/1) | |
14/06/03 16:38:40 INFO DAGScheduler: Completed ResultTask(0, 0) | |
14/06/03 16:38:40 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool | |
14/06/03 16:38:40 INFO DAGScheduler: Stage 0 (collect at <ipython-input-3-20868699513c>:1) finished in 0.093 s | |
14/06/03 16:38:41 INFO SparkContext: Job finished: collect at <ipython-input-3-20868699513c>:1, took 0.197537 s | |
Out[3]: | |
[(u'72 6f 77 31', u'keyvalues={row1/f1:/1401639141180/Put/vlen=5/ts=0}'), | |
(u'72 6f 77 32', u'keyvalues={row2/f2:/1401639169638/Put/vlen=6/ts=0}')] | |
====== | |
I created a test table in Hbase (0.94.6 to match Spark examples): | |
hbase(main):002:0> scan 'data' | |
ROW COLUMN+CELL | |
row1 column=f1:, timestamp=1401639141180, value=value | |
row2 column=f2:, timestamp=1401639169638, value=value2 | |
2 row(s) in 0.4190 seconds |
@Raider06 this was more of a sketch for new functionality that will be released in Spark 1.1 in a few weeks time.
It is in Spark master branch currently
@Raider06 I tried your example above however, I am getting the following exception. (I am a beginner in spark)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 14.0 in stage 6.0 (TID 23) had a not serializable result: org.apache.hadoop.hbase.io.ImmutableBytesWritable
Serialization stack:
- object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 6c 61 73 74 5f 65 6e 74 69 74 79 5f 62 61 74 63 68)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (6c 61 73 74 5f 65 6e 74 69 74 79 5f 62 61 74 63 68,keyvalues={last_entity_batch/c:d/1441414881172/Put/vlen=5092/mvcc=0}))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
@MLnick: I hope to use hadoopRDD(not newAPIHadoopRDD) for reading hbase table in python,
××××××××××××××××××××××××××××××××××××××××××××××××
××××××××××××××××××××××××××××
can you give me some suggest...
tks very much
We're also struggling with same issue in Kerberized cluster. @MLnick, are you also facing it in kerberized cluster?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Excuse me but i have a problem about the line "rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf)" when i execute the line, Spark say me "AtributeError: 'SparkContext' object has no attribute 'newAPIHadoopRDD' " Can you help me?