Skip to content

Instantly share code, notes, and snippets.

@jeanmidevacc
Created August 2, 2024 18:26
Show Gist options
  • Save jeanmidevacc/6b6f9129b1cdac26ef5a351839aa5bc8 to your computer and use it in GitHub Desktop.
Save jeanmidevacc/6b6f9129b1cdac26ef5a351839aa5bc8 to your computer and use it in GitHub Desktop.
from pyspark.sql import SparkSession, SQLContext
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import Window
dfs_items = #pyspark dataframe that contains items to be indexed (define by itemid)
previous_max_itemid_indexed = 0 #in case it's an incremental process , set a tempora
windowspec = Window.orderBy(F.col("itemid"))# build a window function
dfs_items = dfs_items.withColumn("itemid_indexed", F.row_number().over(windowspec))#Get the row number
dfs_items = dfs_items.withColumn("itemid_indexed", F.col("itemid_indexed") + previous_max_itemid_indexed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment