Skip to content

Instantly share code, notes, and snippets.

import pandas
def get_hourly_entries(df):
'''
The data in the MTA Subway Turnstile data reports on the cumulative
number of entries and exits per row. Assume that you have a dataframe
called df that contains only the rows for a particular turnstile machine
(i.e., unique SCP, C/A, and UNIT). This function should change
these cumulative entry numbers to a count of entries since the last reading
(i.e., entries since the last row in the dataframe).
import pandas
def get_hourly_exits(df):
'''
The data in the MTA Subway Turnstile data reports on the cumulative
number of entries and exits per row. Assume that you have a dataframe
called df that contains only the rows for a particular turnstile machine
(i.e., unique SCP, C/A, and UNIT). This function should change
these cumulative exit numbers to a count of exits since the last reading
(i.e., exits since the last row in the dataframe).
import pandas
def time_to_hour(time):
'''
Given an input variable time that represents time in the format of:
"00:00:00" (hour:minutes:seconds)
Write a function to extract the hour part from the input variable time
and return it as an integer. For example:
1) if hour is 00, your code should return 0
import datetime
def reformat_subway_dates(date):
'''
The dates in our subway data are formatted in the format month-day-year.
The dates in our weather underground data are formatted year-month-day.
In order to join these two data sets together, we'll want the dates formatted
the same way. Write a function that takes as its input a date in the MTA Subway
data format, and returns a date in the weather underground format.
import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
'''
Before we perform any analysis, it might be useful to take a
look at the data we're hoping to analyze. More specifically, let's
examine the hourly entries in our NYC subway data and determine what
distribution the data follows. This data is stored in a dataframe
No
No. Because the data size of rain and not rain are not the same.
import numpy as np
import scipy
import scipy.stats
import pandas
def mann_whitney_plus_means(turnstile_weather):
'''
This function will consume the turnstile_weather dataframe containing
our final turnstile weather data.
Yes
From the results in step 3 we can see that the mean of with_rain and without_rain are quite close. And the P-value of the scipy's Mann-Whitney implementation is small, less than 5%.
import numpy as np
import pandas
import statsmodels.api as sm
"""
In this question, you need to:
1) implement the linear_regression() procedure
2) Select features (in the predictions procedure) and make predictions.
"""
import numpy as np
import scipy
import matplotlib.pyplot as plt
def plot_residuals(turnstile_weather, predictions):
'''
Using the same methods that we used to plot a histogram of entries
per hour for our data, why don't you make a histogram of the residuals
(that is, the difference between the original hourly entry data and the predicted values).
Try different binwidths for your histogram.