Djellel Difallah dedcode

tzachz / CombineMaps.scala

Last active January 26, 2023 04:31

Apache Spark UserDefinedAggregateFunction combining maps

	import org.apache.spark.SparkContext
	import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
	import org.apache.spark.sql.types._
	import org.apache.spark.sql.{Column, Row, SQLContext}

	/***
	* UDAF combining maps, overriding any duplicate key with "latest" value
	* @param keyType DataType of Map key
	* @param valueType DataType of Value key
	* @param merge function to merge values of identical keys

MLnick / HyperLogLogStoreUDAF.scala

Last active March 16, 2022 05:31

Experimenting with Spark SQL UDAF - HyperLogLog UDAF for distinct counts, that stores the actual HLL for each row to allow further aggregation

	class HyperLogLogStoreUDAF extends UserDefinedAggregateFunction {

	override def inputSchema = new StructType()
	.add("stringInput", BinaryType)

	override def update(buffer: MutableAggregationBuffer, input: Row) = {
	// This input Row only has a single column storing the input value in String (or other Binary data).
	// We only update the buffer when the input value is not null.
	if (!input.isNullAt(0)) {
	if (buffer.isNullAt(0)) {