Skip to content

Instantly share code, notes, and snippets.

@jeffdeville
Created May 6, 2015 17:28
Show Gist options
  • Save jeffdeville/bc26d112a18f03282c8e to your computer and use it in GitHub Desktop.
Save jeffdeville/bc26d112a18f03282c8e to your computer and use it in GitHub Desktop.
Thoughts on Scraping real estate data
// Each parent w/ a sub-hash is a class in Ruby Class naming
// so geo -> Geo, property_details -> PropertyDetails
// Property
{
"property":
{
sources: [
{
"source": ["zillow","trulia","redfine","realtor"],
"geo": {
"street_address": "_",
"city": "_",
"state": "",
"zip": "",
"lat": -1, // Decimal
"lon": -1, // Decimal
"county": "",
"mls"
},
"property_details": { // PropertyDetails
"beds": -1, // Decimal
"baths": -1, // Decimal
"sq_ft": -1, // Integer
"lot_size_sq_ft": -1, // integer (43560 square feet/acre)
"school_district": "blah blah", //text
"year_built": 1920, // Integer
"property_type": ["single_family", "duplex", "triplex", "etc"],
"features": [
{
"category": "", // "Interior Features, Parking / Garage, etc",
"value": "", // String
}
],
"zillow_id": "123456", // string,
"images": {
"caption": "any description",
"url": "full url"
},
},
"comps": [
// an array of properties. Same class and fields
],
"finance": {
"property_taxes": {
"year": 2014, // Integer,
"taxes": 123.23, // Decimal
"tax_assessment": 123456
},
"last_sold_on": "1/2/3", // Date
"last_sold_for": -1, // Decimal
}
}
]
},
"neighborhood": {
"walk_score": 58, // Integer, redfin.com
"walk_score_desc": "", // String,
"dollar_per_sq_ft_image": "", // URL trulia.com,
"median_house_values": [ // Array because order matters
{
"location": "string",
"list_price": 123231, // Decimal (dollars)
"dollar_per_sq_ft": 123.33, // Decimal (dollars)
"sale_list_ratio": 0.95, // Decimal (percentage)
}
]
}
}

Each source will have a lot of the same data, but presented in different ways. However, they don't always agree. So there are a few steps involved:

  1. Gather all of the data from all of the sources
  2. When disagreements exist, figure out which data is accurate
  3. Expose the certain data at the top level object, but show the 'choices' when disagreements exists.
property = SherlockHomes.get("ADDRESS")

# If everyone agrees there are 3 beds
print property.beds
# 3

# If some people disagree
print.property.beds
# UncertaintyError - Zillow: 3, Trulia: 3, Redfin: 4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment