The reader is advised to check out the official docs first, which are more comprehensive and accurate https://udst.github.io/urbansim/examples.html
One of the features of UrbanSim is regression.
Here we provide an example of how to write the configuration that allows you to tinker with the regression for residential property values in UrbanSim for the SF Bay Area.
At the end you should be able to change the specification for regression for residential housing (hedonic regression). While its not required for you to have used R before, we occasionally refer to R's lm function , because its a standard that many people are familiar with.
In order to follow along with this example, you will need to read along with the Bayarea UrbanSim Repository and the Urbansim Defaults Repository. When we refer to these, we'll use the repository name as the base directory for any file we refer to. (e.g. bayarea_urbansim/README.md or urbansim_defaults/setup.py.)
How to:
- Change the Regression Specification
- Run the Regression
- Run the Change/Find the Housing Data for Regression
- Joining Neighborhood Data to the Main Data Table in the Regression
- How the YAML model type Gets Converted to an UrbanSim Model
- Using Transformed Data in your Model Specifications
- Where Results are Written
##How to Change the Regression Specification You may be accustomed to a regression function to which you pass a specification of the regression as a string, and data that corresponds to the specification. A simple example for housing prices in R is
specification <- "sales price ~ square_feet + bathrooms"
results <- lm(specification, data)
For urbansim for the Bay Area, you have to dig around a bit to change this specification. For example, to change the specification of the housing regression in urbansim, you need to change the following lines of bayarea_urbansim/configs/rsh.yaml
model_expression: np.log(price_per_sqft) ~ I(year_built < 1940) + I(year_built > 2000)
+ np.log1p(sqft_per_unit) + ave_income + stories + poor + renters + sfdu + autoPeakTotal
+ transitPeakTotal + autoOffPeakRetail + ave_lot_size_per_unit + sum_nonresidential_units
+ sum_residential_units
Mechanically, the way that you run this regression is to run python Simulation.py
in the root directory of the bayarea_urbansim repository. Below, we copy the contents of that script.
import time
import models
import pandas as pd
import urbansim.sim.simulation as sim
sim.run([
"neighborhood_vars", # accessibility variables
"rsh_estimate" # residential sales hedonic
])
As you can see, it imports models
, sim
and pandana
modules. Then, before passing rsh_estimate to sim.run, it passes it neighborhood_vars.
We'll just focus on changing out the data source for now. So lets take a closer look at rsh_estimate
. rsh_estimate
is defined in urbansim_defaults/urbansim_defaults/models.py. Below we copy the snippet of code that defines it. What it returns, utils.hedonic_estimate(), looks a bit like lm
in R
@orca.step('rsh_estimate')
def rsh_estimate(homesales, aggregations):
return utils.hedonic_estimate("rsh.yaml", homesales, aggregations)
We pass a specification (rsh.yaml) and data (homesales) to utils.hedonic_estimate(). We also pass aggregations to this function. We'll explore that a bit below, but lets just focus on the homesales data for now.
We've already seen that we can change the regression string specifiction in rsh.yaml. And now we can see that we can change the value taken by homesales
in order to change the data passed to the environment.
So where is homesales
defined?
Homesales is defined in bayarea_urbansim/datasources.py (1)
@orca.table('homesales', cache=True)
def homesales(store):
# we need to read directly from the store here. Why? The buildings
# table itself drops a bunch of columns we need - most notably the
# redfin_sales_price column. Why? Because the developer model will
# append rows (new buildings) to the buildings table and we don't want
# the developer model to know about redfin_sales_price (which is
# meaningless for forecast buildings)
df = store['buildings']
df = df.dropna(subset=["redfin_sale_price"])
df["price_per_sqft"] = df.eval('redfin_sale_price / sqft_per_unit')
df = df.query("sqft_per_unit > 200")
df = df.dropna(subset=["price_per_sqft"])
return df
Here we can see that 'buildings' is a dataframe from the h5 store
for bayarea urbansim, which is pointed to in bayarea_urbansim/configs/settings.yaml
store: 2015_06_01_bayarea_v3.h5
Below we take a look at the function that is called to estimate the regression. A quick look reveals that in addition to passing the standard table to the regression specification, we can also pass join_tbls. For those familiar with doing hedonic regression for real estate, the usefulness of passing join_tbls that contain neighborhood information is immediately apparent.
def hedonic_estimate(cfg, tbl, join_tbls):
"""
Estimate the hedonic model for the specified table
Parameters
----------
cfg : string
The name of the yaml config file from which to read the hedonic model
tbl : DataFrameWrapper
A dataframe for which to estimate the hedonic
join_tbls : list of strings
A list of land use dataframes to give neighborhood info around the
buildings - will be joined to the buildings using existing broadcasts
"""
cfg = misc.config(cfg)
df = to_frame(tbl, join_tbls, cfg)
return yaml_to_class(cfg).fit_from_cfg(df, cfg)
Below is the documentation for how the YAML file that you write gets converted to a model class for use in urbansim. So for example, if you name your yaml config rsh_estimate and give it a key/value pair of
def yaml_to_class(cfg):
"""
Convert the name of a yaml file and get the Python class of the model
associated with the configuration
Parameters
----------
cfg : str
The name of the yaml configuration file
Returns
-------
Nothing
"""
import yaml
model_type = yaml.load(open(cfg))["model_type"]
return {
"regression": RegressionModel,
"segmented_regression": SegmentedRegressionModel,
"discretechoice": MNLDiscreteChoiceModel,
"segmented_discretechoice": SegmentedMNLDiscreteChoiceModel
}[model_type]
def fit(self, data, debug=False):
"""
Fit each of the models in the group.
Parameters
----------
data : pandas.DataFrame
Must have a column with the same name as `segmentation_col`.
debug : bool
If set to true (default false) will pass the debug parameter
to model estimation.
Returns
-------
fits : dict of statsmodels.regression.linear_model.OLSResults
Keys are the segment names.
"""
with log_start_finish(
'fitting models in group {}'.format(self.name), logger):
return {name: self.models[name].fit(df, debug=debug)
for name, df in self._iter_groups(data)}
def fit(self, data, debug=False):
"""
Fit each of the models in the group.
Parameters
----------
data : pandas.DataFrame
Must have a column with the same name as `segmentation_col`.
debug : bool
If set to true (default false) will pass the debug parameter
to model estimation.
Returns
-------
fits : dict of statsmodels.regression.linear_model.OLSResults
Keys are the segment names.
"""
with log_start_finish(
'fitting models in group {}'.format(self.name), logger):
return {name: self.models[name].fit(df, debug=debug)
for name, df in self._iter_groups(data)}
So now it should be clear that the stories
observations specified in the regression above is a vector of observations in the table homesales
.
However, other variables that are specified in the regression are created in a another Model(2), and not simply imported in datasources.py as we described above. For example,
sum_residential_unitsis the outcome of the Model described by the
neighborhood_vars.yaml` config, which is passed to Sim before the residential housing estimation model.
Here's an example of a single variable definition from neighborhood_vars.yaml
variable_definitions:
- name: sum_residential_units
dataframe: buildings
varname: residential_units
radius: 1500
apply: np.log1p
rsh.yaml ends up being both where we read the model specification from, and where we write the fit_parameters out to. Below we have an example of the fit parameters as written out for Bay Area Urbansim.
fit_parameters:
Coefficient:
I(year_built < 1940)[T.True]: 0.1033682688375832
I(year_built > 2000)[T.True]: -0.006005524050255999
Intercept: 2.57048634107115
autoOffPeakRetail: -0.281197051832964
autoPeakTotal: 0.3307628135825661
ave_income: 0.4389761139847244
ave_lot_size_per_unit: 0.0776791124037567
np.log1p(sqft_per_unit): -0.45907905026951035
poor: -0.04725225454921351
renters: 0.16895038525397293
sfdu: -0.08556828228865788
stories: 0.08567483620381938
sum_nonresidential_units: 0.0053824618073246115
sum_residential_units: -0.03159639166738088
transitPeakTotal: 0.012124801890625493
(1)
datasources.py
is imported by models.py,
which as we saw was imported by the script we use to run the regression above--Estimation.py
.
This bayarea_urbansim/datasources.py file is not to be confused with the urbansim_defaults/datasources.py. The same goes for the models.py file in both repositories.