Last active
September 25, 2015 19:31
-
-
Save fscottfoti/ac819bdfa8a4036adad3 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.neighbors import KDTree | |
def nearest_neighbor(df1, df2): | |
kdt = KDTree(df1.as_matrix()) | |
distances, indexes = kdt.query(df2.as_matrix(), k=1, return_distance=True) | |
return pd.Series(distances.flatten(), index=df1.index.values[indexes.flatten()]) | |
import sys | |
import pandas as pd | |
import numpy as np | |
args = sys.argv[1:] | |
df1 = pd.read_csv(args[0], index_col="GEOM_ID") | |
df1["area"] = df1.area.apply(np.sqrt) | |
df2 = pd.read_csv(args[1], index_col="GEOM_ID") | |
df2["area"] = df2.area.apply(np.sqrt) | |
s = nearest_neighbor(df1, df2).order() | |
print s.describe() | |
print s.tail() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
GEOM_ID | minx | miny | maxx | maxy | centroidx | centroidy | length | area | |
---|---|---|---|---|---|---|---|---|---|
9720406908141 | 610557.748247 | 4201444.69009 | 610580.952298 | 4201468.10975 | 610562.999503 | 4201462.7687 | 82.9033338922 | 115.961904597 | |
9720406908140 | 610557.748242 | 4201444.69005 | 610580.977218 | 4201468.10993 | 610562.999506 | 4201462.76869 | 82.9035472875 | 115.962321392 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is another try at creating geom ids, but with a different strategy. Here I'm just trying to do "matches" of a list of geometries to another list of geometries. I do this by creating a dataframe which contains centroidx, centroidy, length, area, and the four corners of the bounding box and put them into a dataframe. One dataframe for the "haystack" and one for the "needles."
In my test, I'm pretty sure all the needles are present in the haystack, but several operations have been done that manage to apply some "noise." For my example (of the 2M parcels in the Bay Area), the describe of the distances to the nearest neighbor looks like this:
I added a csv of the 2 rows which are the two geometries which were furthest apart (that .0249222 number) - they still look pretty close.