Skip to content

Instantly share code, notes, and snippets.

View nsivabalan's full-sized avatar

Sivabalan Narayanan nsivabalan

View GitHub Profile
scala> dfFromData7.write.format("hudi").
| options(getQuickstartWriteConfigs).
| option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
| option(RECORDKEY_FIELD_OPT_KEY, "rowId").
| option(PARTITIONPATH_FIELD_OPT_KEY, "partitionId").
| option("hoodie.index.type","SIMPLE").
| option(TABLE_NAME, tableName).
| mode(Append).
| save(basePath)
21/02/27 16:01:29 ERROR BoundedInMemoryExecutor: error producing records
docker run test_hudi py.test -s --verbose test_hudi.py
============================= test session starts ==============================
platform linux -- Python 3.7.9, pytest-6.1.1, py-1.10.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /
collecting ... collected 1 item
# NB: We use this base image for leveraging Docker support on EMR 6.x
FROM amazoncorretto:8
RUN yum -y update
RUN yum -y install yum-utils
RUN yum -y groupinstall development
RUN yum -y install python3 python3-dev python3-pip python3-virtualenv
RUN yum -y install lzo-devel lzo
17483 [main] WARN org.apache.spark.sql.SparkSession$Builder - Using an existing SparkSession; some configuration may not take effect.
18361 [main] WARN org.apache.spark.sql.SparkSession$Builder - Using an existing SparkSession; some configuration may not take effect.
18927 [main] WARN org.apache.kafka.clients.consumer.ConsumerConfig - The configuration 'hoodie.datasource.hive_sync.partition_fields' was supplied but isn't a known config.
18927 [main] WARN org.apache.kafka.clients.consumer.ConsumerConfig - The configuration 'hoodie.deltastreamer.ingestion.short_trip_db.dummy_table_short_trip.configFile' was supplied but isn't a known config.
18927 [main] WARN org.apache.kafka.clients.consumer.ConsumerConfig - The configuration 'hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile' was supplied but isn't a known config.
18927 [main] WARN org.apache.kafka.clients.consumer.ConsumerConfig - The configuration 'hoodie.deltastreamer.schemaprovider.source.schema.file' was supplied but isn't a
[INFO] ------------------------< org.apache.hudi:hudi >------------------------
[INFO] Building Hudi 0.8.0-SNAPSHOT [1/42]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi ---
[INFO]
[INFO] --------------------< org.apache.hudi:hudi-common >---------------------
[INFO] Building hudi-common 0.8.0-SNAPSHOT [2/42]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO]
[INFO] ------------------------< org.apache.hudi:hudi >------------------------
[INFO] Building Hudi 0.8.0-SNAPSHOT [1/42]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi ---
[INFO]
[INFO] --------------------< org.apache.hudi:hudi-common >---------------------
[INFO] Building hudi-common 0.8.0-SNAPSHOT [2/42]
[INFO] --------------------------------[ jar ]---------------------------------
@nsivabalan
nsivabalan / spark script
Last active February 10, 2021 13:20
MOR read optimized query
spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.11:0.7.0,org.apache.spark:spark-avro_2.11:2.4.4 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
import java.io.File
import java.nio.file.Paths
import java.sql.Timestamp
import org.apache.commons.io.FileUtils
import org.apache.hudi.DataSourceWriteOptions
import java.io.File
import java.nio.file.Paths
import java.sql.Timestamp
import org.apache.commons.io.FileUtils
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.config.{HoodieCompactionConfig, HoodieIndexConfig, HoodieStorageConfig, HoodieWriteConfig}
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import org.apache.hudi.index.HoodieIndex
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.config.HoodieIndexConfig
import org.apache.hudi.DataSourceReadOptions
spark-submit \
--packages org.apache.spark:spark-avro_2.11:2.4.0 \
--conf spark.task.cpus=1 \
--conf spark.executor.cores=1 \
--conf spark.task.maxFailures=100 \
--conf spark.memory.fraction=0.4 \
--conf spark.rdd.compress=true \
--conf spark.kryoserializer.buffer.max=2000m \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.memory.storageFraction=0.1 \