-
-
Save max-mapper/813141 to your computer and use it in GitHub Desktop.
{ | |
title: "title of data", // eg: 'the xyz murder', 'blizzard of 2011', 'chelmsburry bike path', 'MBTA green line', 'bus stop schedule event' etc | |
type: "type of data", // eg: event, physical feature | |
set: "what data set data is from if any", // if data is from national data set e.g. 2000 census | |
category: [category, category], // develop this taxonomy | |
description: "description of data", // text description of data | |
start: RFC3339DateTime, // optional start date | |
end: RFC3339DateTime, // optional end date | |
geometry : {}// optional geojson location feature (http://geojson.org/geojson-spec.html) | |
} | |
} |
boazsender
commented
Feb 6, 2011
- -1 to id being a hash of all that, let's use built in couch ids, no?
- +1 to use of RFC3339DateTime over epoch time
- +1 to the geometry property, and use of the geojson spec.
well the ID is deterministic this way... not random. One question i've struggled with is how to reconcile two copies of the same dataset.
Say we both download a dataset consisting of 10 fire hydrants. I add one, you add one, then we both upload our new sets of 11. There are 12 unique fire hydrants in our little universe at this point but we need a way of de-duplicating the overlapping 10. By setting the id as a hash of the location, timestamp and category we are effectively saying that if a fire hydrant moves location it is now a new unique fire hydrant.
Say for instance that we are actually not adding a new fire hydrant but instead updating a hydrant's location to be more accurate. In this case we would want to reconcile our new hydrant object as a new revision of the old instance of the hydrant and not as a completely new hydrant.
Ahha! Actually at this point I suppose we don't need a deterministic ID but instead we could use Couch's built in random UUID as you stated. The key is actually the reconciliation API and making sure that we can add revisions to existing objects in certain cases and create new objects in others.
/typing out loud
This is a really good point. I suppose that hashing IDs to represent the data is a good way for preventing conflicts. I would think, however, that our validators would be have to evaluate the data in a more holistic way.
While not wildly complicated, this International Open Civic Data Service will have several complex points, the most of which I believe is this one that you raise; dealing with and merging duplicates. Consider for example, two very similar datum from two different data sets. What is the desired behavior?