Mengyuz’s gists

Mengyuz / Problem set 2-8 - Get Hourly Entries

Created June 5, 2015 15:11

	import pandas

	def get_hourly_entries(df):
	'''
	The data in the MTA Subway Turnstile data reports on the cumulative
	number of entries and exits per row. Assume that you have a dataframe
	called df that contains only the rows for a particular turnstile machine
	(i.e., unique SCP, C/A, and UNIT). This function should change
	these cumulative entry numbers to a count of entries since the last reading
	(i.e., entries since the last row in the dataframe).

Mengyuz / Problem set 2-9 - Get Hourly Exits

Created June 5, 2015 15:12

	import pandas

	def get_hourly_exits(df):
	'''
	The data in the MTA Subway Turnstile data reports on the cumulative
	number of entries and exits per row. Assume that you have a dataframe
	called df that contains only the rows for a particular turnstile machine
	(i.e., unique SCP, C/A, and UNIT). This function should change
	these cumulative exit numbers to a count of exits since the last reading
	(i.e., exits since the last row in the dataframe).

Mengyuz / Problem set 2- 10 - Time to Hour

Created June 5, 2015 15:12

	import pandas

	def time_to_hour(time):
	'''
	Given an input variable time that represents time in the format of:
	"00:00:00" (hour:minutes:seconds)

	Write a function to extract the hour part from the input variable time
	and return it as an integer. For example:
	1) if hour is 00, your code should return 0

Mengyuz / Problem set 2-11 - Reformat Subway Dates

Created June 5, 2015 15:13

	import datetime

	def reformat_subway_dates(date):
	'''
	The dates in our subway data are formatted in the format month-day-year.
	The dates in our weather underground data are formatted year-month-day.

	In order to join these two data sets together, we'll want the dates formatted
	the same way. Write a function that takes as its input a date in the MTA Subway
	data format, and returns a date in the weather underground format.

Mengyuz / Problem set 3- 1 - Exploratory Data Analysis

Created June 5, 2015 15:14

	import numpy as np
	import pandas
	import matplotlib.pyplot as plt

	def entries_histogram(turnstile_weather):
	'''
	Before we perform any analysis, it might be useful to take a
	look at the data we're hoping to analyze. More specifically, let's
	examine the hourly entries in our NYC subway data and determine what
	distribution the data follows. This data is stored in a dataframe

Mengyuz / Problem set 3-2 - Welch's t-Test?N

Created June 5, 2015 15:15

	No


	No. Because the data size of rain and not rain are not the same.

Mengyuz / Problem set 3-3 - Mann-Whitney U-Test

Created June 5, 2015 15:17

	import numpy as np
	import scipy
	import scipy.stats
	import pandas

	def mann_whitney_plus_means(turnstile_weather):
	'''
	This function will consume the turnstile_weather dataframe containing
	our final turnstile weather data.

Mengyuz / Problem set 3-4 - Ridership on Rainy vs. Nonrainy Days

Created June 5, 2015 15:19

	Yes

	From the results in step 3 we can see that the mean of with_rain and without_rain are quite close. And the P-value of the scipy's Mann-Whitney implementation is small, less than 5%.

Mengyuz / Problem set 3-5 - Linear Regression

Created June 5, 2015 15:19

	import numpy as np
	import pandas
	import statsmodels.api as sm

	"""
	In this question, you need to:
	1) implement the linear_regression() procedure
	2) Select features (in the predictions procedure) and make predictions.

	"""

Mengyuz / Problem set 3- 6 - Plotting Residuals

Created June 5, 2015 15:20

	import numpy as np
	import scipy
	import matplotlib.pyplot as plt

	def plot_residuals(turnstile_weather, predictions):
	'''
	Using the same methods that we used to plot a histogram of entries
	per hour for our data, why don't you make a histogram of the residuals
	(that is, the difference between the original hourly entry data and the predicted values).
	Try different binwidths for your histogram.