Last active
December 11, 2020 16:05
-
-
Save umbertogriffo/40a698244e468fdcb0e14a911bf028d3 to your computer and use it in GitHub Desktop.
broadcast_join_medium_size
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import org.apache.spark.sql.functions._ | |
val mediumDf = Seq((0, "zero"), (4, "one")).toDF("id", "value") | |
val largeDf = Seq((0, "zero"), (2, "two"), (3, "three"), (4, "four"), (5, "five")).toDF("id", "value") | |
mediumDf.show() | |
largeDf.show() | |
/* | |
+---+-----+ | |
| id|value| | |
+---+-----+ | |
| 0| zero| | |
| 4| one| | |
+---+-----+ | |
+---+-----+ | |
| id|value| | |
+---+-----+ | |
| 0| zero| | |
| 2| two| | |
| 3|three| | |
| 4| four| | |
| 5| five| | |
+---+-----+ | |
*/ | |
val keys = mediumDf.select("id").as[Int].collect().toSeq | |
print(keys) | |
/* | |
keys: Seq[Int] = WrappedArray(0, 4) | |
*/ | |
val reducedDataFrame = largeDf.filter(col("id").isin(keys:_*)) | |
reducedDataFrame.show() | |
/* | |
+---+-----+ | |
| id|value| | |
+---+-----+ | |
| 0| zero| | |
| 4| four| | |
+---+-----+ | |
*/ | |
val result = reducedDataFrame.join(mediumDf, Seq("id")) | |
result.explain() | |
result.show() | |
/* | |
== Physical Plan == | |
*(1) Project [id#246, value#247, value#238] | |
+- *(1) BroadcastHashJoin [id#246], [id#237], Inner, BuildRight | |
:- LocalTableScan [id#246, value#247] | |
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#234] | |
+- LocalTableScan [id#237, value#238] | |
+---+-----+-----+ | |
| id|value|value| | |
+---+-----+-----+ | |
| 0| zero| zero| | |
| 4| four| one| | |
+---+-----+-----+ | |
*/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment