Skip to content

Instantly share code, notes, and snippets.

@chathurawidanage
Created October 21, 2018 00:48
Show Gist options
  • Save chathurawidanage/c16046e264d52224bc1ad763fcba602b to your computer and use it in GitHub Desktop.
Save chathurawidanage/c16046e264d52224bc1ad763fcba602b to your computer and use it in GitHub Desktop.
K Means Harp
from harp.applications import KMeansApplication
import numpy
my_kmeans = KMeansApplication('My Harp KMeans with Harp')
my_kmeans.args("1000 10 100 5 2 2 10", "/kmeans", "/kmeans", "allreduce")
# sets following variables mentioned in docs : https://dsc-spidal.github.io/harp/docs/getting-started/
# <num of points> <num of centriods> <vector size> <num of point files per worker> <number of map tasks> <num threads><number of iteration>
# <work dir>
# <local points dir>
my_kmeans.run()
# invokes following shell command programically utlizing above defined variables
# hadoop jar harp-java-0.1.0.jar edu.iu.kmeans.allreduce.KMeansLauncher 1000 10 100 5 2 2 10 /kmeans /kmeans
my_kmeans.print_result('/kmeans/centroids/out/output')
# read output generated in HDFS by previous command and print it to console by programically executing following shell command
# hadoop fs -cat file_path
arr = my_kmeans.result_to_array('/kmeans/centroids/out/output')
# read stdout which results by executing above shell command and parse that string to numpy structure by calling numpy.loadtxt(cat.stdout)
print(arr)
# now results are in python memort. You can draw graphs, or do anything with results.
sorted_arr = numpy.sort(arr)
print(sorted_arr)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment