Turning column values into rows
ntfLog.groupby("auth_method","auth_result").agg(F.count("*").alias("cnt"))
.sort("auth_method","auth_result").show(20,False)
# Copyright (c) 2017 Cary Kempston | |
# Permission is hereby granted, free of charge, to any person obtaining a copy | |
# of this software and associated documentation files (the "Software"), to deal | |
# in the Software without restriction, including without limitation the rights | |
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
# copies of the Software, and to permit persons to whom the Software is | |
# furnished to do so, subject to the following conditions: | |
# The above copyright notice and this permission notice shall be included in all |
# Used to flatten json object while using pandas | |
from pandas.io.json import json_normalize | |
def flatten_json(y): | |
out = {} | |
def __flatten(x, name=''): | |
if type(x) is dict: | |
for a in x: | |
__flatten(x[a], name + a + '_') |
%matplotlib inline | |
buckets = [-87.0, -15, 0, 30, 120] | |
rdd_histogram_data = ml_bucketized_features\ | |
.select("ArrDelay")\ | |
.rdd\ | |
.flatMap(lambda x: x)\ | |
.histogram(buckets) | |
create_hist(rdd_histogram_data) |
# 附带一个用spark将数据取回本地用于绘图的方法 | |
def toArr(df, col, dtype=np.int32): | |
""" | |
将DataFrame的一列取回本地,并转成numpy.ndarray格式。 | |
df: 目标DataFrame | |
col: 目标列名 | |
dtype: 目标列的数据格式 | |
return: 目标列的数据。np.ndarray | |
""" |
curl http://www.centos.org
Suppose you need to apply the same function to multiple columns in one DataFrame, one straight way is like this:
val newDF = oldDF.withColumn("colA", func("colA")).withColumn("colB", func("colB")).withColumn("colC", func("colC"))
If you want to save some type, you can try this:
select
with varargs including *
:import spark.implicits._
UDF can return only a single column at the time. There are two different ways you can overcome this limitation:
The most general solution is a StructType but you can consider ArrayType or MapType as well.
import org.apache.spark.sql.functions.udf
val df = Seq(
(1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
""" | |
Turn this | |
location name Jan-2010 Feb-2010 March-2010 | |
A "test" 12 20 30 | |
B "foo" 18 20 25 | |
into this | |
location name Date Value |
""" | |
Turn this | |
days name | |
[1,3,5,7] John | |
into this | |
days name | |
1 John |