The reader is advised to check out the official docs first, which are more comprehensive and accurate https://udst.github.io/urbansim/examples.html

Regression Example

One of the features of UrbanSim is regression.

Here we provide an example of how to write the configuration that allows you to tinker with the regression for residential property values in UrbanSim for the SF Bay Area.

At the end you should be able to change the specification for regression for residential housing (hedonic regression). While its not required for you to have used R before, we occasionally refer to R's lm function , because its a standard that many people are familiar with.

In order to follow along with this example, you will need to read along with the Bayarea UrbanSim Repository and the Urbansim Defaults Repository. When we refer to these, we'll use the repository name as the base directory for any file we refer to. (e.g. bayarea_urbansim/README.md or urbansim_defaults/setup.py.)

How to:

Change the Regression Specification
Run the Regression
Run the Change/Find the Housing Data for Regression
Joining Neighborhood Data to the Main Data Table in the Regression
How the YAML model type Gets Converted to an UrbanSim Model
Using Transformed Data in your Model Specifications
Where Results are Written

##How to Change the Regression Specification You may be accustomed to a regression function to which you pass a specification of the regression as a string, and data that corresponds to the specification. A simple example for housing prices in R is

specification <- "sales price ~ square_feet + bathrooms" 
results <- lm(specification, data)

For urbansim for the Bay Area, you have to dig around a bit to change this specification. For example, to change the specification of the housing regression in urbansim, you need to change the following lines of bayarea_urbansim/configs/rsh.yaml

model_expression: np.log(price_per_sqft) ~ I(year_built < 1940) + I(year_built > 2000)
    + np.log1p(sqft_per_unit) + ave_income + stories + poor + renters + sfdu + autoPeakTotal
    + transitPeakTotal + autoOffPeakRetail + ave_lot_size_per_unit + sum_nonresidential_units
    + sum_residential_units

How to Run the Regression

Mechanically, the way that you run this regression is to run python Simulation.py in the root directory of the bayarea_urbansim repository. Below, we copy the contents of that script.

import time
import models
import pandas as pd
import urbansim.sim.simulation as sim

sim.run([
    "neighborhood_vars",         # accessibility variables
    "rsh_estimate"               # residential sales hedonic
])

As you can see, it imports models, sim and pandana modules. Then, before passing rsh_estimate to sim.run, it passes it neighborhood_vars.

How to Change/Find the Housing Data for the Regression

We'll just focus on changing out the data source for now. So lets take a closer look at rsh_estimate. rsh_estimate is defined in urbansim_defaults/urbansim_defaults/models.py. Below we copy the snippet of code that defines it. What it returns, utils.hedonic_estimate(), looks a bit like lm in R

@orca.step('rsh_estimate')
def rsh_estimate(homesales, aggregations):
    return utils.hedonic_estimate("rsh.yaml", homesales, aggregations)

We pass a specification (rsh.yaml) and data (homesales) to utils.hedonic_estimate(). We also pass aggregations to this function. We'll explore that a bit below, but lets just focus on the homesales data for now.

We've already seen that we can change the regression string specifiction in rsh.yaml. And now we can see that we can change the value taken by homesales in order to change the data passed to the environment.

So where is homesales defined?

Homesales is defined in bayarea_urbansim/datasources.py (1)

@orca.table('homesales', cache=True)
def homesales(store):
    # we need to read directly from the store here.  Why?  The buildings
    # table itself drops a bunch of columns we need - most notably the
    # redfin_sales_price column.  Why?  Because the developer model will
    # append rows (new buildings) to the buildings table and we don't want
    # the developer model to know about redfin_sales_price (which is
    # meaningless for forecast buildings)
    df = store['buildings']
    df = df.dropna(subset=["redfin_sale_price"])
    df["price_per_sqft"] = df.eval('redfin_sale_price / sqft_per_unit')
    df = df.query("sqft_per_unit > 200")
    df = df.dropna(subset=["price_per_sqft"])
    return df

Here we can see that 'buildings' is a dataframe from the h5 store for bayarea urbansim, which is pointed to in bayarea_urbansim/configs/settings.yaml

store: 2015_06_01_bayarea_v3.h5

Joining `Neighborhood` Data to the Main Data Table in the Regression

Below we take a look at the function that is called to estimate the regression. A quick look reveals that in addition to passing the standard table to the regression specification, we can also pass join_tbls. For those familiar with doing hedonic regression for real estate, the usefulness of passing join_tbls that contain neighborhood information is immediately apparent.

def hedonic_estimate(cfg, tbl, join_tbls):
    """
    Estimate the hedonic model for the specified table

    Parameters
    ----------
    cfg : string
        The name of the yaml config file from which to read the hedonic model
    tbl : DataFrameWrapper
        A dataframe for which to estimate the hedonic
    join_tbls : list of strings
        A list of land use dataframes to give neighborhood info around the
        buildings - will be joined to the buildings using existing broadcasts
    """
    cfg = misc.config(cfg)
    df = to_frame(tbl, join_tbls, cfg)
    return yaml_to_class(cfg).fit_from_cfg(df, cfg)

How the YAML model type Gets Converted to an UrbanSim Model

Below is the documentation for how the YAML file that you write gets converted to a model class for use in urbansim. So for example, if you name your yaml config rsh_estimate and give it a key/value pair of

def yaml_to_class(cfg):
    """
    Convert the name of a yaml file and get the Python class of the model
    associated with the configuration

    Parameters
    ----------
    cfg : str
        The name of the yaml configuration file

    Returns
    -------
    Nothing
    """
    import yaml
    model_type = yaml.load(open(cfg))["model_type"]
    return {
        "regression": RegressionModel,
        "segmented_regression": SegmentedRegressionModel,
        "discretechoice": MNLDiscreteChoiceModel,
        "segmented_discretechoice": SegmentedMNLDiscreteChoiceModel
    }[model_type]

What is UrbanSim using Under The Hood to Fit Models?

def fit(self, data, debug=False):
    """
    Fit each of the models in the group.

    Parameters
    ----------
    data : pandas.DataFrame
        Must have a column with the same name as `segmentation_col`.
    debug : bool
        If set to true (default false) will pass the debug parameter
        to model estimation.

    Returns
    -------
    fits : dict of statsmodels.regression.linear_model.OLSResults
        Keys are the segment names.

    """
    with log_start_finish(
            'fitting models in group {}'.format(self.name), logger):
        return {name: self.models[name].fit(df, debug=debug)
                for name, df in self._iter_groups(data)}



def fit(self, data, debug=False):
    """
    Fit each of the models in the group.

    Parameters
    ----------
    data : pandas.DataFrame
        Must have a column with the same name as `segmentation_col`.
    debug : bool
        If set to true (default false) will pass the debug parameter
        to model estimation.

    Returns
    -------
    fits : dict of statsmodels.regression.linear_model.OLSResults
        Keys are the segment names.

    """
    with log_start_finish(
            'fitting models in group {}'.format(self.name), logger):
        return {name: self.models[name].fit(df, debug=debug)
                for name, df in self._iter_groups(data)}

Using Transformed Data in your Model Specifications

So now it should be clear that the stories observations specified in the regression above is a vector of observations in the table homesales.

However, other variables that are specified in the regression are created in a another Model(2), and not simply imported in datasources.py as we described above. For example, sum_residential_unitsis the outcome of the Model described by theneighborhood_vars.yaml` config, which is passed to Sim before the residential housing estimation model.

Here's an example of a single variable definition from neighborhood_vars.yaml

variable_definitions:
  - name: sum_residential_units
    dataframe: buildings
    varname: residential_units
    radius: 1500
    apply: np.log1p

Where Results are Written

rsh.yaml ends up being both where we read the model specification from, and where we write the fit_parameters out to. Below we have an example of the fit parameters as written out for Bay Area Urbansim.

fit_parameters:
    Coefficient:
        I(year_built < 1940)[T.True]: 0.1033682688375832
        I(year_built > 2000)[T.True]: -0.006005524050255999
        Intercept: 2.57048634107115
        autoOffPeakRetail: -0.281197051832964
        autoPeakTotal: 0.3307628135825661
        ave_income: 0.4389761139847244
        ave_lot_size_per_unit: 0.0776791124037567
        np.log1p(sqft_per_unit): -0.45907905026951035
        poor: -0.04725225454921351
        renters: 0.16895038525397293
        sfdu: -0.08556828228865788
        stories: 0.08567483620381938
        sum_nonresidential_units: 0.0053824618073246115
        sum_residential_units: -0.03159639166738088
        transitPeakTotal: 0.012124801890625493

Footnotes

(1) datasources.py is imported by models.py, which as we saw was imported by the script we use to run the regression above--Estimation.py.

This bayarea_urbansim/datasources.py file is not to be confused with the urbansim_defaults/datasources.py. The same goes for the models.py file in both repositories.

tbuckl/urbansim_notes.md