Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save mkaranasou/5d11550e6483115cdc5d9a028d63c0b7 to your computer and use it in GitHub Desktop.

Select an option

Save mkaranasou/5d11550e6483115cdc5d9a028d63c0b7 to your computer and use it in GitHub Desktop.
Adding indexes to a dataframe with row_num if your data is NOT sortable
# First add a column using the F.monotonically_increasing_id().
# This will add monotonically increasing 64-bit integers like this:
>>> df_final = df_final.withColumn("monotonically_increasing_id", F.monotonically_increasing_id())
+--------+---+-----+-------+-------+----------+---------------------------+
| _1| _2|index|column1|column2|row_number|monotonically_increasing_id|
+--------+---+-----+-------+-------+----------+---------------------------+
| [1, 2]| 0| 0| 1| 2| 1| 0|
|[15, 21]| 1| 1| 15| 21| 2| 1|
+--------+---+-----+-------+-------+----------+---------------------------+
# in this case we have consequent ids - but if we had more data we'd see that this is not true
# we'll be currently running it over the sorted by monotonically_increasing_id data, row per row -
# our window will be of size 2 (rows), the whole dataframe that is
>>> window = Window.orderBy(F.col('monotonically_increasing_id'))
>>> df_final = df_final.withColumn('row_number2', F.row_number().over(window))
>>> df_final.select('index', 'row_number', 'row_number2', 'column1', 'column2').show()
# we see that row_number and row_number2 are the same and in general the result is consistent
+-----+----------+-----------+-------+-------+
|index|row_number|row_number2|column1|column2|
+-----+----------+-----------+-------+-------+
| 0| 1| 1| 1| 2|
| 1| 2| 2| 15| 21|
+-----+----------+-----------+-------+-------+
@monocongo
Copy link
Copy Markdown

This helped me out today. Thank you!

@mkaranasou
Copy link
Copy Markdown
Author

Glad to know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment