Skip to content

Instantly share code, notes, and snippets.

@nsivabalan
Created March 12, 2022 00:17
Show Gist options
  • Save nsivabalan/24de023d8731994ce99973254aebd105 to your computer and use it in GitHub Desktop.
Save nsivabalan/24de023d8731994ce99973254aebd105 to your computer and use it in GitHub Desktop.
diff --git a/website/docs/basic_configurations.md b/website/docs/basic_configurations.md
index 3eb8cc2ba..47fd991da 100644
--- a/website/docs/basic_configurations.md
+++ b/website/docs/basic_configurations.md
@@ -55,6 +55,20 @@ Options useful for writing tables via `write.format.option(...)`
---
+> #### hoodie.datasource.write.table.type
+> The table type for the underlying data, for this write. This can’t change between writes.<br></br>
+> **Default Value**: COPY_ON_WRITE (Optional)<br></br>
+> `Config Param: TABLE_TYPE`<br></br>
+
+---
+
+> #### hoodie.datasource.write.table.name
+> Table name for the datasource write. Also used to register the table into meta stores.<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: TABLE_NAME`<br></br>
+
+---
+
> #### hoodie.datasource.write.recordkey.field
> Record key field. Value to be used as the `recordKey` component of `HoodieKey`.
Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using
@@ -64,13 +78,6 @@ the dot notation eg: `a.b.c`<br></br>
---
-> #### hoodie.datasource.write.precombine.field
-> Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)<br></br>
-> **Default Value**: ts (Optional)<br></br>
-> `Config Param: PRECOMBINE_FIELD`<br></br>
-
----
-
> #### hoodie.datasource.write.partitionpath.field
> Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value ontained by invoking .toString()<br></br>
> **Default Value**: N/A (Required)<br></br>
@@ -78,13 +85,6 @@ the dot notation eg: `a.b.c`<br></br>
---
-> #### hoodie.datasource.hive_sync.partition_fields
-> Field in the table to use for determining hive partition columns.<br></br>
-> **Default Value**: (Optional)<br></br>
-> `Config Param: HIVE_PARTITION_FIELDS`<br></br>
-
----
-
> #### hoodie.datasource.write.keygenerator.class
> Key generator class, that implements `org.apache.hudi.keygen.KeyGenerator`<br></br>
> **Default Value**: org.apache.hudi.keygen.SimpleKeyGenerator (Optional)<br></br>
@@ -92,6 +92,13 @@ the dot notation eg: `a.b.c`<br></br>
---
+> #### hoodie.datasource.write.precombine.field
+> Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)<br></br>
+> **Default Value**: ts (Optional)<br></br>
+> `Config Param: PRECOMBINE_FIELD`<br></br>
+
+---
+
> #### hoodie.datasource.write.payload.class
> Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective<br></br>
> **Default Value**: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)<br></br>
@@ -99,6 +106,13 @@ the dot notation eg: `a.b.c`<br></br>
---
+> #### hoodie.datasource.write.partitionpath.urlencode
+> Should we url encode the partition path value, before creating the folder structure.<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: URL_ENCODE_PARTITIONING`<br></br>
+
+---
+
> #### hoodie.datasource.hive_sync.enable
> When set to true, register/sync the table to Apache Hive metastore<br></br>
> **Default Value**: false (Optional)<br></br>
@@ -113,20 +127,6 @@ the dot notation eg: `a.b.c`<br></br>
---
-> #### hoodie.datasource.write.table.type
-> The table type for the underlying data, for this write. This can’t change between writes.<br></br>
-> **Default Value**: COPY_ON_WRITE (Optional)<br></br>
-> `Config Param: TABLE_TYPE`<br></br>
-
----
-
-> #### hoodie.datasource.write.table.name
-> Table name for the datasource write. Also used to register the table into meta stores.<br></br>
-> **Default Value**: N/A (Required)<br></br>
-> `Config Param: TABLE_NAME`<br></br>
-
----
-
> #### hoodie.datasource.write.hive_style_partitioning
> Flag to indicate whether to use Hive style partitioning.
If set true, the names of partition folders follow <partition_column_name>=<partition_value> format.
@@ -136,10 +136,10 @@ By default false (the names of partition folders are only partition values)<br><
---
-> #### hoodie.datasource.write.partitionpath.urlencode
-> Should we url encode the partition path value, before creating the folder structure.<br></br>
-> **Default Value**: false (Optional)<br></br>
-> `Config Param: URL_ENCODE_PARTITIONING`<br></br>
+> #### hoodie.datasource.hive_sync.partition_fields
+> Field in the table to use for determining hive partition columns.<br></br>
+> **Default Value**: (Optional)<br></br>
+> `Config Param: HIVE_PARTITION_FIELDS`<br></br>
---
@@ -157,6 +157,137 @@ By default false (the names of partition folders are only partition values)<br><
---
+
+## Flink Options {#Flink-Options}
+
+> #### path
+> Base path for the target hoodie table.
+The path would be created if it does not exist,
+otherwise a Hoodie table expects to be initialized successfully<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: PATH`<br></br>
+
+---
+
+> #### hoodie.table.name
+> Table name to register to Hive metastore<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: TABLE_NAME`<br></br>
+
+---
+
+
+> #### table.type
+> Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ<br></br>
+> **Default Value**: COPY_ON_WRITE (Optional)<br></br>
+> `Config Param: TABLE_TYPE`<br></br>
+
+---
+
+> #### write.operation
+> The write operation, that this write should do<br></br>
+> **Default Value**: upsert (Optional)<br></br>
+> `Config Param: OPERATION`<br></br>
+
+---
+
+> #### write.tasks
+> Parallelism of tasks that do actual write, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: WRITE_TASKS`<br></br>
+
+---
+
+> #### write.bucket_assign.tasks
+> Parallelism of tasks that do bucket assign, default is the parallelism of the execution environment<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: BUCKET_ASSIGN_TASKS`<br></br>
+
+---
+
+> #### write.precombine
+> Flag to indicate whether to drop duplicates before insert/upsert.
+By default these cases will accept duplicates, to gain extra performance:
+1) insert operation;
+2) upsert for MOR table, the MOR table deduplicate on reading<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: PRE_COMBINE`<br></br>
+
+---
+
+> #### read.tasks
+> Parallelism of tasks that do actual read, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: READ_TASKS`<br></br>
+
+---
+
+> #### read.start-commit
+> Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: READ_START_COMMIT`<br></br>
+
+---
+
+> #### read.streaming.enabled
+> Whether to read as streaming source, default false<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: READ_AS_STREAMING`<br></br>
+
+---
+
+
+> #### compaction.tasks
+> Parallelism of tasks that do actual compaction, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: COMPACTION_TASKS`<br></br>
+
+---
+
+> #### hoodie.datasource.write.hive_style_partitioning
+> Whether to use Hive style partitioning.
+If set true, the names of partition folders follow &lt;partition_column_name&gt;=&lt;partition_value&gt; format.
+By default false (the names of partition folders are only partition values)<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_STYLE_PARTITIONING`<br></br>
+
+---
+
+> #### hive_sync.enable
+> Asynchronously sync Hive meta to HMS, default false<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_SYNC_ENABLED`<br></br>
+
+---
+
+> #### hive_sync.mode
+> Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'jdbc'<br></br>
+> **Default Value**: jdbc (Optional)<br></br>
+> `Config Param: HIVE_SYNC_MODE`<br></br>
+
+---
+
+> #### hive_sync.table
+> Table name for hive sync, default 'unknown'<br></br>
+> **Default Value**: unknown (Optional)<br></br>
+> `Config Param: HIVE_SYNC_TABLE`<br></br>
+
+---
+
+> #### hive_sync.db
+> Database name for hive sync, default 'default'<br></br>
+> **Default Value**: default (Optional)<br></br>
+> `Config Param: HIVE_SYNC_DB`<br></br>
+
+---
+
+> #### hive_sync.partition_extractor_class
+> Tool to extract the partition value from HDFS path, default 'SlashEncodedDayPartitionValueExtractor'<br></br>
+> **Default Value**: org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor (Optional)<br></br>
+> `Config Param: HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME`<br></br>
+
+---
+
## Write Client Configs {#WRITE_CLIENT}
Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment