Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created December 21, 2016 21:23
Show Gist options
  • Save rjurney/98810280f601561f34b5af4720932158 to your computer and use it in GitHub Desktop.
Save rjurney/98810280f601561f34b5af4720932158 to your computer and use it in GitHub Desktop.
Plot a pyspark.RDD.histogram as a pyplot histogram (via bar)
%matplotlib inline
buckets = [-87.0, -15, 0, 30, 120]
rdd_histogram_data = ml_bucketized_features\
.select("ArrDelay")\
.rdd\
.flatMap(lambda x: x)\
.histogram(buckets)
create_hist(rdd_histogram_data)
def create_hist(rdd_histogram_data):
"""Given an RDD.histogram, plot a pyplot histogram"""
heights = np.array(rdd_histogram_data[1])
full_bins = rdd_histogram_data[0]
mid_point_bins = full_bins[:-1]
widths = [abs(i - j) for i, j in zip(full_bins[:-1], full_bins[1:])]
bar = plt.bar(mid_point_bins, heights, width=widths, color='b')
return bar
@rainsunny
Copy link

Thanks for your code. But I found that you should add align=edge when calling plt.bar, as the default parameter is align=center, which is not what you want in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment