Loris Cro kristoff-it

⚡

in your codebase, fixing your bugz

vp of community @ziglang • curator of software you can love (https://sycl.it)

kristoff-it / spark-redis-snippet-10

Created January 30, 2019 15:42

	>>> no_pol = result.where(col("occupation") != "POLITICIAN")
	>>> no_pol.write.format("org.apache.spark.sql.redis").option("table", "occupation").option("key.column", "countryCode").save()

kristoff-it / spark-redis-snippet-9

Created January 30, 2019 15:41

	>>> result.where(col("occupation") == "POLITICIAN").count()
	150

kristoff-it / spark-redis-snippet-8

Created January 30, 2019 15:18

	>>> from pyspark.sql.window import Window
	>>> from pyspark.sql.functions import count, col, row_number
	>>> w = Window().partitionBy("countryCode").orderBy(col("count(en_curid)").desc())
	>>> result = counts.withColumn("rn", row_number().over(w)).where(col("rn") == 1).select("countryCode", "occupation")
	>>> result.show(5)
	+-----------+-------------+
	\|countryCode\| occupation\|
	+-----------+-------------+
	\| DZ\| POLITICIAN\|
	\| LT\| POLITICIAN\|

kristoff-it / spark-redis-snippet-7

Created January 30, 2019 15:17

	>>> counts = df.groupby("countryCode", "occupation").agg({"en_curid": "count"})
	>>> counts.show(2)
	+-----------+-------------+---------------+
	\|countryCode\| occupation\|count(en_curid)\|
	+-----------+-------------+---------------+
	\| FR\|MATHEMATICIAN\| 34\|
	\| IT\|SOCCER PLAYER\| 81\|
	+-----------+-------------+---------------+
	only showing top 2 rows

kristoff-it / spark-redis-snippet-6

Created January 30, 2019 15:16

	>>> df = spark.read.format("org.apache.spark.sql.redis").option("table", "people").option("key.column", "en_curid").load()
	>>> df.show(2)
	+--------+-----------+----------+
	\|en_curid\|countryCode\|occupation\|
	+--------+-----------+----------+
	\| 915950\| ZW\| SWIMMER\|
	\| 726159\| UY\|POLITICIAN\|
	+--------+-----------+----------+
	only showing top 2 rows

kristoff-it / spark-redis-snippet-5

Created January 30, 2019 15:16

kristoff-it / spark-redis-snippet-4

Last active January 30, 2019 15:22

kristoff-it / spark-redis-snippet-3

Created January 30, 2019 15:13

>>> data.write.format("org.apache.spark.sql.redis").option("table", "people").option("key.column", "en_curid").save()

kristoff-it / spark-redis-snippet-2

Created January 30, 2019 15:12

	>>> data = full_df.select("en_curid", "countryCode", "occupation")
	>>> data.show(2)
	+--------+-----------+-----------+
	\|en_curid\|countryCode\| occupation\|
	+--------+-----------+-----------+
	\| 307\| US\| POLITICIAN\|
	\| 308\| GR\|PHILOSOPHER\|
	+--------+-----------+-----------+
	only showing top 2 rows

kristoff-it / spark-redis-snippet-1

Created January 30, 2019 15:11

	>>> full_df = spark.read.csv("pantheon.tsv", sep="\t", quote="", header=True, inferSchema=True)
	>>> full_df.dtypes
	[('en_curid', 'int'), ('name', 'string'), ('numlangs', 'int'), ('birthcity', 'string'), ('birthstate', 'string'), ('countryName', 'string'), ('countryCode', 'string'), ('countryCode3', 'string'), ('LAT', 'double'), ('LON', 'double'), ('continentName', 'string'), ('birthyear', 'string'), ('gender', 'string'), ('occupation', 'string'), ('industry', 'string'), ('domain', 'string'), ('TotalPageViews', 'int'), ('L_star', 'double'), ('StdDevPageViews', 'double'), ('PageViewsEnglish', 'int'), ('PageViewsNonEnglish', 'int'), ('AverageViews', 'double'), ('HPI', 'double')]