Skip to content

Instantly share code, notes, and snippets.

@jamiekt
Last active July 17, 2019 11:26
Show Gist options
  • Save jamiekt/78c413102a1d87a8e08977234ce9aa6f to your computer and use it in GitHub Desktop.
Save jamiekt/78c413102a1d87a8e08977234ce9aa6f to your computer and use it in GitHub Desktop.

An example of how to count distinct values in a column using pyspark

from pyspark.sql.functions import countDistinct, col
data = sqlContext.createDataFrame([
('001', 'bananas', 'John Doe', 'Stratford'),
('001', 'apples', 'John Doe', 'Stratford'),
('002', 'apples', 'Jane Doe', 'Aberdeen'),
('002', 'baked beans', 'Jane Doe', 'Aberdeen'),
('002', 'cornflakes', 'Jane Doe', 'Aberdeen'),
('003', 'chocolate', 'John Doe', 'Stratford')
], ['basket', 'product', 'customer', 'store'])
data.groupBy().agg(
countDistinct(col('basket')),
countDistinct(col('product')),
countDistinct(col('customer')),
countDistinct(col('store'))
).toPandas()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment