Skip to content

Instantly share code, notes, and snippets.

@tanish-kr
Last active March 22, 2018 06:27
Show Gist options
  • Save tanish-kr/6bf7e16ad5014069db3df011ecc91249 to your computer and use it in GitHub Desktop.
Save tanish-kr/6bf7e16ad5014069db3df011ecc91249 to your computer and use it in GitHub Desktop.
Spark dataframe and dataset

ListからDataSet, DataFrame作成

  • create dataframe
val ds = List(1,2).toDS
# org.apache.spark.sql.Dataset[Int] = [value: int]

val df = List(1,2).toDF
# org.apache.spark.sql.DataFrame = [value: int]

DataFrame, RDDの差分

  • dataframe diff

exceptを使用する。左辺の差分

df1.except(df2)
  • rdd

subtractを使用する

rdd.subtract(rdd2)

Nanを除外

df.na.drop()

ALSでNaNが出てくる

訓練データにそもそものUserItemもしくわProductItemが存在しないため

DataFrame JOIN

df.join(df2, $"df.id" === $"df2.id", "left_join")

桁あふれをcast

df.select($"rating".cast(DecimalType(18, 5))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment