Turning column values into rows
ntfLog.groupby("auth_method","auth_result").agg(F.count("*").alias("cnt"))
.sort("auth_method","auth_result").show(20,False)| # Copyright (c) 2017 Cary Kempston | |
| # Permission is hereby granted, free of charge, to any person obtaining a copy | |
| # of this software and associated documentation files (the "Software"), to deal | |
| # in the Software without restriction, including without limitation the rights | |
| # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
| # copies of the Software, and to permit persons to whom the Software is | |
| # furnished to do so, subject to the following conditions: | |
| # The above copyright notice and this permission notice shall be included in all |
| # Used to flatten json object while using pandas | |
| from pandas.io.json import json_normalize | |
| def flatten_json(y): | |
| out = {} | |
| def __flatten(x, name=''): | |
| if type(x) is dict: | |
| for a in x: | |
| __flatten(x[a], name + a + '_') |
| %matplotlib inline | |
| buckets = [-87.0, -15, 0, 30, 120] | |
| rdd_histogram_data = ml_bucketized_features\ | |
| .select("ArrDelay")\ | |
| .rdd\ | |
| .flatMap(lambda x: x)\ | |
| .histogram(buckets) | |
| create_hist(rdd_histogram_data) |
| # 附带一个用spark将数据取回本地用于绘图的方法 | |
| def toArr(df, col, dtype=np.int32): | |
| """ | |
| 将DataFrame的一列取回本地,并转成numpy.ndarray格式。 | |
| df: 目标DataFrame | |
| col: 目标列名 | |
| dtype: 目标列的数据格式 | |
| return: 目标列的数据。np.ndarray | |
| """ |
curl http://www.centos.org
Suppose you need to apply the same function to multiple columns in one DataFrame, one straight way is like this:
val newDF = oldDF.withColumn("colA", func("colA")).withColumn("colB", func("colB")).withColumn("colC", func("colC"))If you want to save some type, you can try this:
select with varargs including *:import spark.implicits._UDF can return only a single column at the time. There are two different ways you can overcome this limitation:
The most general solution is a StructType but you can consider ArrayType or MapType as well.
import org.apache.spark.sql.functions.udf
val df = Seq(
(1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")| """ | |
| Turn this | |
| location name Jan-2010 Feb-2010 March-2010 | |
| A "test" 12 20 30 | |
| B "foo" 18 20 25 | |
| into this | |
| location name Date Value |
| """ | |
| Turn this | |
| days name | |
| [1,3,5,7] John | |
| into this | |
| days name | |
| 1 John |