You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
aliasVegaLite,as: Vl# a helper to plot labeled datamk_data_layer=fnlabeled_data->Vl.new()|>Vl.data_from_series(x: Nx.to_flat_list(labeled_data[y: 0]),y: Nx.to_flat_list(labeled_data[y: 1]),label: Nx.to_flat_list(labeled_data[y: 2]))|>Vl.mark(:point)|>Vl.encode_field(:x,"x",type: :quantitative,title: "X")|>Vl.encode_field(:y,"y",type: :quantitative,title: "Y")|>Vl.encode_field(:color,"label",type: :nominal)endVl.new(title: "Raw Data w/ True Labels",width: 700,height: 700)|>Vl.layers([mk_data_layer.(labeled)])
Clustering - Initialization
k=2# the unlabeled datadata=labeled[y: 0..1]# calculate initial centroids randomly uniformly in the space that the data spansinitial_centroids=0..(n_dims-1)|>Enum.reduce(nil,fnix,acc->pos=[x_min+(x_max-x_min)*:rand.uniform(),y_min+(y_max-y_min)*:rand.uniform()]caseaccdonil->Nx.tensor([pos++[ix]],names: [:x,:y])_->Nx.concatenate([acc,Nx.tensor([pos++[ix]])])endend)
# helper function to calculate the distance from data to centroids (unlabeled)dist_fn=fnd,centroids->c=Nx.new_axis(centroids,1)Nx.subtract(d,c)|>Nx.power(2)|>Nx.sum(axes: [2])|>Nx.sqrt()end# hepler function to find labelsfind_labels=fnd,centroids->dist_fn.(d,centroids)|>Nx.argmin(axis: 0)endnew_labels=find_labels.(data,initial_centroids[y: 0..(n_dims-1)])alg_labeled=Nx.concatenate([data,Nx.new_axis(new_labels,1)],axis: 1)
Vl.new(title: "Result of First Iteration",width: 700,height: 700)|>Vl.layers([mk_data_layer.(alg_labeled),mk_centroid_layer.(new_centroids)])
Clustering - N Iterations
n_iters=10# rename some variablescentroids=new_centroidslabels=new_labels{final_centroids,final_labels}=Enum.reduce(1..n_iters,{centroids,labels},fn_ix,{pvs_centroids,pvs_labels}->new_centroids=calc_centroids_map.(data,pvs_labels,pvs_centroids)new_centroids=label_centroids.(new_centroids)new_labels=find_labels.(data,new_centroids[y: 0..(n_dims-1)]){new_centroids,new_labels}end)
alg_labeled=Nx.concatenate([data,Nx.new_axis(final_labels,1)],axis: 1)true_labels_layer=Vl.new()|>Vl.data_from_series(x: Nx.to_flat_list(labeled[y: 0]),y: Nx.to_flat_list(labeled[y: 1]),label: Nx.to_flat_list(labeled[y: 2]))|>Vl.mark(:point,size: 200)|>Vl.encode_field(:x,"x",type: :quantitative,title: "X")|>Vl.encode_field(:y,"y",type: :quantitative,title: "Y")|>Vl.encode_field(:color,"label",type: :nominal)Vl.new(title: "Result of N Iterations",width: 700,height: 700)|>Vl.layers([mk_data_layer.(alg_labeled),true_labels_layer,mk_centroid_layer.(final_centroids)])
It requires reshaping, but since reshape is O(1), this is acceptable :)
calc_centroids_map does not need to return a map. Actually, for maps of size > 32 you would end up swapping centroid labels this way. You just need to change it to Enum.map and instead of Map.put, you just return the corresponding Nx.take/Nx.divide directly
The last visualization was kind of confusing. The outer circle is the initial label and the inner, the final label? Perhaps you could've incorporated x for the initial and o for the final one, and then added the new 2 values to the legend. Also, keep in mind that since you initialize the centroids randomly, the labels can switch between each other, so "true" labels is perhaps not the best terminology
Thanks @polvalente ! This all makes sense. Re: (3), the outer circle is actually a "true" label since I generated the data at the outset from two distributions and the color corresponds to which distribution. It's a little contrived, but it was a helpful comparison for me to see if the algorithm was doing what I thought it should.
It requires reshaping, but since reshape is O(1), this is acceptable :)
calc_centroids_map does not need to return a map. Actually, for maps of size > 32 you would end up swapping centroid labels this way. You just need to change it to Enum.map and instead of Map.put, you just return the corresponding Nx.take/Nx.divide directly
The last visualization was kind of confusing. The outer circle is the initial label and the inner, the final label? Perhaps you could've incorporated x for the initial and o for the final one, and then added the new 2 values to the legend. Also, keep in mind that since you initialize the centroids randomly, the labels can switch between each other, so "true" labels is perhaps not the best terminology
Other than this, I liked your code a lot!