Skip to content

Instantly share code, notes, and snippets.

View nsivabalan's full-sized avatar

Sivabalan Narayanan nsivabalan

View GitHub Profile
21/07/05 20:25:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/07/05 20:26:28 WARN HoodieSparkSqlWriter$: hoodie table at s3a://siva-test-bucket-june-16/hudi_testing/hudi_base_2 already exists. Deleting existing data & overwriting with new data.
21/07/05 22:27:52 WARN TaskSetManager: Lost task 438.0 in stage 12.0 (TID 10022, ip-172-31-37-233.us-east-2.compute.internal, executor 2): java.lang.RuntimeException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieRemoteException: Server Error
at org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at org.apache.spark.storage.memory.Mem
21/07/05 15:59:07 INFO view.RemoteHoodieTableFileSystemView: Sending request : (http://ip-172-31-45-49.us-east-2.compute.internal:37481/v1/hoodie/view/marker/create?markername=17518.0%2F273309f4-674f-41fe-80c9-bb0693d38b77-375_196-12-9655_20210705142539.parquet.marker.CREATE&markerdirpath=s3a%3A%2F%2Fsiva-test-bucket-june-16%2Fhudi_testing%2Fhudi_base_2%2F.hoodie%2F.temp%2F20210705142539&timelinehash=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855)
21/07/05 15:59:07 INFO marker.MarkerFilesFactory: Instantiated MarkerFiles with mode: TIMELINE_BASED
21/07/05 15:59:07 INFO view.FileSystemViewManager: Creating remote view for basePath s3a://siva-test-bucket-june-16/hudi_testing/hudi_base_2. Server=ip-172-31-45-49.us-east-2.compute.internal:37481, Timeout=300
21/07/05 15:59:07 INFO marker.TimelineBasedMarkerFiles: ^^^ [timeline-based] Create marker file : 185647.0 787af4e9-70fd-4f1b-aaad-5ea4e1f305e8-374_224-12-9683_20210705142539.parquet
21/07/05 15:59:07 INFO view.RemoteHoodieTableFileSystemView
21/06/24 19:38:17 DEBUG HttpConnection: HttpConnection@f0ca2d2::SocketChannelEndPoint@1c0d1758{/172.31.40.111:49150<->/172.31.35.161:45185,OPEN,fill=-,flush=-,to=4/30000}{io=0/0,kio=0,kro=1}->HttpConnection@f0ca2d2[p=HttpParser{s=START,0 of -1},g=HttpGenerator@58d1601e{s=START}]=>HttpChannelOverHttp@2a2e8dd8{r=0,c=false,c=false/false,a=IDLE,uri=null,age=0} parse HeapByteBuffer@4ff2cf3e[p=0,l=534,c=8192,r=534]={<<<POST /v1/hoodie/v...zip,deflate\r\n\r\n>>>\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} {}
21/06/24 19:38:17 DEBUG HttpParser: parseNext s=START HeapByteBuffer@4ff2cf3e[p=0,l=534,c=8192,r=534]={<<<POST /v1/hoodie/v...zip,deflate\r\n\r\n>>>\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
21/06/24 19:38:17 DEBUG HttpParser: START --> SPACE1
21/06/24 19:38:17 DEBUG HttpParser: SPACE1 --> URI
21/06/24 19:38:17 DEBUG HttpParser: URI -->
21/06/23 22:45:09 DEBUG AbstractEndPoint: Ignored idle endpoint SocketChannelEndPoint@3b772bac{/172.31.35.212:46962<->/172.31.35.80:37099,OPEN,fill=-,flush=-,to=30001/30000}{io=0/0,kio=0,kro=1}->HttpConnection@21fc48c2[p=HttpParser{s=END,0 of 0},g=HttpGenerator@7dbf8533{s=START}]=>HttpChannelOverHttp@596f6a9a{r=1,c=false,c=false/false,a=ASYNC_WAIT,uri=//ip-172-31-35-80.us-east-2.compute.internal:37099/v1/hoodie/view/marker/create?markername=2020-02-01%2F69e5b4da-e883-4568-8aea-80a17594f56a-0_0-5-59_20210623221959.parquet.marker.CREATE&markerdirpath=s3a%3A%2F%2Fsiva-test-bucket-june-16%2Fhudi_testing%2Fhudi_base_105%2F.hoodie%2F.temp%2F20210623221959&timelinehash=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855,age=1470050}
21/06/23 22:45:09 DEBUG HttpChannelState: onTimeout HttpChannelState@96666b9{s=ASYNC_WAIT a=STARTED i=false r=IDLE w=false}
21/06/23 22:45:09 DEBUG HttpChannelState: Dispatch after async timeout HttpChannelState@96666b9{s=ASYNC_WOKEN a=EXPIRED i=false r=IDLE w=false}
21/06/2
scala> spark.time(df1.write.format("hudi").option("hoodie.bulkinsert.shuffle.parallelism","170").option(PRECOMBINE_FIELD_OPT_KEY, "created_at").option(RECORDKEY_FIELD_OPT_KEY, "id").option(PARTITIONPATH_FIELD_OPT_KEY,"date_col").option("hoodie.parquet.compression.codec", "SNAPPY").option(OPERATION_OPT_KEY,"bulk_insert").option(TABLE_NAME, "hudi_1").mode(Overwrite).save("s3a://siva-test-bucket-june-16/hudi_testing/hudi_base_1/"))
21/06/23 18:34:57 WARN hudi.HoodieSparkSqlWriter$: hoodie table at s3a://siva-test-bucket-june-16/hudi_testing/hudi_base_1 already exists. Deleting existing data & overwriting with new data.
21/06/23 18:41:57 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 6.0 (TID 597, ip-172-31-35-212.us-east-2.compute.internal, executor 4): java.lang.RuntimeException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieRemoteException: Read timed out
at org.apache.hudi.client.utils.Lazy
21/06/17 22:02:31 WARN scheduler.TaskSetManager: Lost task 13.1 in stage 7.2 (TID 660, ip-172-31-32-128.us-east-2.compute.internal, executor 16): FetchFailed(null, shuffleId=2, mapIndex=-1, mapId=-1, reduceId=13, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 2
at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2(MapOutputTracker.scala:1010)
at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2$adapted(MapOutputTracker.scala:1006)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1006)
at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:811)
at org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:128)
hudi schema evolution
./spark-shell --packages org.apache.spark:spark-avro_2.12:3.0.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --jars /Users/sivabala/Documents/personal/projects/siva_hudi/apache_hudi_feb_2021/hudi/packaging/hudi-spark-bundle/target/hudi-spark3-bundle_2.12-0.8.0-SNAPSHOT.jar
// spark-shell
val inserts = convertToStringList(dataGen.generateInserts(100))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
21/06/10 21:50:27 ERROR HoodieTestSuiteJob: Failed to run Test Suite
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113)
at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68)
at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203)
at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
//output after 1st compaction( i.e. after first 2 delta commits)
scala> spark.sql("select typeId, eventTime, preComb from hudi_trips_snapshot").show()
+------+-------------------+-------+
|typeId| eventTime|preComb|
+------+-------------------+-------+
| 1| null| 2|
| 2|2021-12-29 09:54:00| 2|
| 3|2021-12-30 09:54:00| 2|
| 4| null| 2|
show logfile metadata --logFilePathPattern /tmp/hudi_trips_cow/.a5619a76-5896-4c48-b568-f63212bab228-0_20210513102613.log.1_0-41-440
╔════════════════╤═════════════╤═════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╤════