Created
September 25, 2019 17:42
-
-
Save viniciusmss/7809aff0c4781e74c145874e756c8ffb to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ######### RUN THE CODE BELOW IN R. R-STUDIO IS THE RECOMMENDED IDE. BOTH R AND R-STUDIO ARE FREE. | |
| ######### QUESTIONS SHOULD BE POSTED TO PIAZZA | |
| ######### THE ACTUAL ASSIGNMENT BEGINS ON LINE 71 | |
| ### Multilateral Development Institution Data | |
| foo <- read.csv("https://tinyurl.com/yb4phxx8") # read in the data | |
| # column names | |
| names(foo) | |
| # dimensions of the data set | |
| dim(foo) | |
| # quick look at the data structure | |
| head(foo) | |
| # one thing to be very careful with (in this data set) is the use of dates. 8 columns involve dates. | |
| # take note of the columns representing calendar dates | |
| date.columns <- c(11, 12, 14, 15, 16, 17, 18, 25) | |
| # these columns need some tweaking--I want to address missing values, calling the blank (empty) | |
| # elements "NA" instead of leaving them blank, and I wish to tell R these are "Date" objects. | |
| for(i in date.columns) # this "for loop" only loops through the "date.columns" -- no other columns. | |
| { | |
| # identify which values are missing in the "i"th column of the foo data set | |
| which_values_are_missing <- which(as.character(foo[, i]) == "") | |
| # those values that are missing (blank) in the "i"th column are replaced by <NA> | |
| # because R knows how to handle "NA" -- NA means something special in R--blanks are handled | |
| # more unpredictably (which is bad). | |
| foo[which_values_are_missing, i] <- NA | |
| # last step--replace each of these columns (which is structured as a column of "factor" values) | |
| # as a column of dates--i.e., convert them to an object of "class" = Date. They are dates, after all. | |
| # And if you convert them to the Date class, R will know they are dates and you can manipulate | |
| # dates in a simple, straightforward way. Otherwise, you won't be able to easily manipulate them | |
| # arithmetically. E.g., for simple Date operations, see lines 48-58 below... | |
| # **By the way, if you don't understand what a "factor" is in R, you should Google it.** | |
| foo[, i] <- as.Date(as.character(foo[, i])) | |
| } | |
| # Now R knows that these columns are comprised of dates | |
| # for example... Replicate this yourself... | |
| foo[3,12] | |
| # [1] "1968-03-13" | |
| foo[4,12] | |
| # [1] "1968-07-03" | |
| foo[3,12] - foo[4,12] | |
| # Time difference of -112 days | |
| # Also, one additional helpful hint... How to eliminate rows with NAs... | |
| # The "is.na" function--for more info, Google it or type ?is.na at the R command prompt in the console. | |
| which.have.NAs <- which(is.na(foo$Rating == TRUE)) # for which rows is the claim "is.na" a TRUE claim? | |
| # Then, if you wanted to, e.g., remove all those rows, retaining only the rows with ratings... | |
| new_foo <- foo[-which.have.NAs, ] | |
| # Notice I called this tweaked data set "new_foo" instead of rewriting over the original data set... | |
| # It's a bit safer to do this, in case I decide I want to quickly revert back to the original data set. | |
| ########################################################################### | |
| ### ASSIGNMENT 1 -- You may want to read ALL the questions before you begin. | |
| ### NOTE: FOR ALL QUESTIONS BELOW, ONLY CONSIDER PROJECTS WITH | |
| ### non-missing "Circulation.Date" >= 2008-01-01. | |
| ### EXCLUDE ALL OTHER PROJECTS FROM YOUR ANALYSIS. | |
| ### YOU MUST provide a link to your R code. ------ DON'T FORGET TO DO THIS!!!!!!!!!!!! | |
| # Take note of the column names: i.e., you can type: names(foo) | |
| # fyi: the column called "Rating" is the success rating at completion. 0 = lowest, 3 = highest. | |
| noNA_foo <- new_foo[!is.na(new_foo$CirculationDate), ] | |
| df <- noNA_foo[which(noNA_foo$CirculationDate >= as.Date("2008-01-01")), ] | |
| # Check: | |
| summary(df$CirculationDate) | |
| # (1) When projects are approved, they are approved for a certain period of time (until the time of | |
| # "original completion date"). While projects are active, this "original" completion date is | |
| # often pushed out (extended), and then there is a "revised" completion date. | |
| # You have been told that project duration at approval is generally about | |
| # 2 years (24 months). In other words, (purportedly) when projects are approved, the difference | |
| # between the original project completion date and the the approval date is (supposedly) | |
| # approximately 24 months. | |
| # (a) Is this claim true? Explain. (Remember, for this ENTIRE assignment, only consider | |
| # projects with Circulation.Date >= 2008-01-01. This will be your only reminder...) | |
| # Some projects have no original completion dates. | |
| # We need to disregard them. | |
| mask <- which(is.na(df$OriginalCompletionDate)) | |
| df_withDates <- df[-mask,] | |
| expected_duration <- mean(df_withDates$OriginalCompletionDate) - mean(df_withDates$ApprovalDate) | |
| expected_duration | |
| # Time difference of 643.5879 days | |
| # which is equivalent to 1.76 years. Hence, the claim is false, although quite close. | |
| # Has project duration at approval changed over time (consider projects circulated earlier | |
| # and circulated later). Be sure to discuss mean durations, median durations, and the | |
| # interquartile range of durations (using the "quantile" function). | |
| # Approximate suggested length: 3-5 sentences | |
| # Create a circulation year column | |
| df_withDates$CirculationYear <- format(df_withDates$CirculationDate, "%Y") | |
| # And an expected duration column | |
| df_withDates$ExpectedDuration <- df_withDates$OriginalCompletionDate - df_withDates$ApprovalDate | |
| # The simplest way to do it probably involes a for loop in which you iterate through | |
| # circulation years. I'm providing a fancier solution so that you can get used to | |
| # reading more complex R code. The dplyr library provided a data processing pipeline | |
| # that is very similar to data analysis in SQL. DataCamp has tons of resources on it. | |
| library(dplyr) | |
| durationByYear <- df_withDates %>% # %>% means "pass what is to the left to the function that follows" | |
| group_by(CirculationYear) %>% | |
| summarise(mean.duration = mean(ExpectedDuration), | |
| median.duration = median(ExpectedDuration), | |
| IQR.duration = quantile(ExpectedDuration, 0.75) - quantile(ExpectedDuration, 0.25)) | |
| durationByYear | |
| plot(durationByYear$CirculationYear, durationByYear$mean.duration, | |
| pch=15, col="red", ylim=c(100, 800), | |
| xlab="Circulation Year", ylab="Duration (days)", main="Expected Duration of Projects") | |
| points(durationByYear$CirculationYear, durationByYear$median.duration, pch=16, col="blue") | |
| points(durationByYear$CirculationYear, durationByYear$IQR.duration, pch=17, col="green") | |
| legend("bottomleft", pch=c(15,16,17), | |
| col=c("red","blue","green"), | |
| legend=c("Mean Duration", "Median Duration", "IQR of Durations")) | |
| grid(nx=NA, ny=NULL) | |
| # RESPONSE: | |
| # Project duration at aproval has generally increased over the years. The mean duration has risen | |
| # from about 500 days in 2008 to upwards 700 days in 2018. The median shows a similar trend, | |
| # inreasing from about 600 days in 2008 to 700 in 2018. The IQR of durations has remained largely stable, | |
| # fluctuating close to 400 days but reaching a low of 300 in 2017. | |
| # (b) How does original planned project duration differ from actual duration (if actual duration is | |
| # measured as the duration between "ApprovalDate" and "RevisedCompletionDate"?) Once again, use | |
| # means, medians, and interquartile ranges to explain your results. | |
| # Approximate suggested length: 3-5 sentences | |
| # I did a year-by-year analysis, but this was not necessary. | |
| # Are there rows with NAs in the Revised Completion Date column? | |
| sum(is.na(df_withDates$RevisedCompletionDate)) # No | |
| # Create an actual duration column | |
| df_withDates$ActualDuration <- df_withDates$RevisedCompletionDate - df_withDates$ApprovalDate | |
| mean(df_withDates$ActualDuration) | |
| median(df_withDates$ActualDuration) / 30 | |
| quantile(df_withDates$ActualDuration) | |
| IQR(df_withDates$ActualDuration) | |
| df_withDates$PlanningFallacy <- df_withDates$ActualDuration - df_withDates$ExpectedDuration | |
| mean(df_withDates$PlanningFallacy) | |
| median(df_withDates$PlanningFallacy) | |
| quantile(df_withDates$PlanningFallacy) | |
| IQR(df_withDates$PlanningFallacy) | |
| range(df_withDates$PlanningFallacy) | |
| DurationDifferenceByYear <- df_withDates %>% | |
| group_by(CirculationYear) %>% | |
| summarise(mean.duration = mean(PlanningFallacy), | |
| median.duration = median(PlanningFallacy), | |
| IQR.duration = quantile(PlanningFallacy, 0.75) - quantile(PlanningFallacy, 0.25)) | |
| DurationDifferenceByYear | |
| plot(DurationDifferenceByYear$CirculationYear, | |
| DurationDifferenceByYear$mean.duration, pch=15, col="red", ylim=c(300, 700), | |
| xlab="Circulation Year", ylab="Difference (days)", main="Difference in Expected and Actual Duration") | |
| points(DurationDifferenceByYear$CirculationYear, | |
| DurationDifferenceByYear$median.duration, pch=16, col="blue") | |
| points(DurationDifferenceByYear$CirculationYear, | |
| DurationDifferenceByYear$IQR.duration, pch=17, col="green") | |
| legend("bottomleft", pch=c(15,16,17), | |
| col=c("red","blue","green"), | |
| legend=c("Mean", "Median", "IQR")) | |
| grid(nx=NA, ny=NULL) | |
| # RESPONSE: | |
| # Projects usually take on average an additional 573 days longer than expected to be completed. | |
| # The median of the difference between actual and expected completion time is 485 days. The | |
| # fact that the median is smaller than the mean shows that the data is right-skewed. The IQR | |
| # of 517 days indicates that there is considerable spread in how much longer projects take than expected. | |
| # The mean difference in expected and actual duration seems to vary but without an underlying pattern. | |
| # Its value in both 2008 and 2018 is close to 600 days, reaching a low of 500 days in 2013. The median, | |
| # however, seems to have decreased over the years, starting off close to 600 days in 2008 but sinking | |
| # below 400 days in 2016 and 2018. Finally, there is a lot of variation in the IQR metric, | |
| # which tops at close to 700 in 2015 and sinks to little above 400 in 2018. | |
| # (2) What % of projects that have ratings were rated 0? | |
| # What % were rated 1? What % were rated 2? What % were rated 3? Answer these questions using a table | |
| # or a figure. Provide a title AND an explanatory sentence or two that provides the numerical % results | |
| # rounded to the nearest percentage-point. | |
| # The most straightforward way is... | |
| print("Distribution of Project Ratings") | |
| prop.table(table(df$Rating)) * 100 | |
| # 3% of projects received a rating of 0, 16% of 1, 68% of 2, and 13% of 3. Hence, the | |
| # majority of projects receive a rating of 2. | |
| # (3) Repeat problem 2, but this time exclude all PPTA projects. PPTA projects are more prone to | |
| # negative ratings, because after a certain point in time only the low-rated PPTA projects required | |
| # ratings. PPTA stands for "Project Preparatory Technical Assistance" and it is basically a project | |
| # intended to set up a loan (often a very large multi-million-dollar loan). Only PPTAs that fail to | |
| # "eventuate" to a loan are rated, which is why they are usually rated negatively. | |
| # Excluding PPTA projects | |
| df_noPPTA <- df[-which(df$Type == "PPTA"),] | |
| print("Distribution of Project Ratings (PPTA excluded)") | |
| table(df_noPPTA$Rating) / length(df_noPPTA$Rating) * 100 | |
| # 2% of projects received a rating of 0, 14% of 1, 70% of 2, and 14% of 3. We see | |
| # that non-PPTA projects tend to receive slightly better ratings. | |
| # (4) Identify the top 25% of projects by "Revised.Amount" and the bottom 25% of projects by | |
| # "RevisedAmount". ("RevisedAmount" shows the final project budget.) | |
| # Compare the ratings of these projects. Can you draw a causal conclusion about the effect of | |
| # budget size on ratings? Why or why not? | |
| # Hint: Compare the characteristics of the two project groupings, | |
| # e.g., "Dept", "Division", "Cluster", "Country" | |
| # Approximate suggested length: 3-5 sentences. | |
| # There are several ways to do it. Here's a way: | |
| df_TopQuant <- df[which(df$RevisedAmount >= quantile(df$RevisedAmount, 0.75)),] | |
| df_BottomQuant <- df[which(df$RevisedAmount <= quantile(df$RevisedAmount, 0.25)),] | |
| # Compare the ratings | |
| (table(df_TopQuant$Rating) / length(df_TopQuant$Rating) * 100) - | |
| (table(df_BottomQuant$Rating) / length(df_BottomQuant$Rating) * 100) | |
| # There doesn't seem to be any major difference | |
| # Compare other features | |
| # I will look at count instead of percentages because, given that there are | |
| # more categories, the percentages might be misleading (e.g., a difference of 100% | |
| # because there is one project of a given type ) | |
| # Dept | |
| (table(df_TopQuant$Dept) / length(df_TopQuant$Dept) * 100) - | |
| (table(df_BottomQuant$Dept) / length(df_BottomQuant$Dept) * 100) | |
| # More projects in the top quantile are within the SERD or SDCC, whereas | |
| # more projects in the bottom quantile are in the EARD and SARD departments. | |
| # Cluster | |
| (table(df_TopQuant$Cluster) / length(df_TopQuant$Cluster) * 100) - | |
| (table(df_BottomQuant$Cluster) / length(df_BottomQuant$Cluster) * 100) | |
| # No significant differences | |
| # Country | |
| (table(df_TopQuant$Country) / length(df_TopQuant$Country) * 100) - | |
| (table(df_BottomQuant$Country) / length(df_BottomQuant$Country) * 100) | |
| # More projects in the top quantile are in the REG region, whereas more of those | |
| # in the bottom quantile are in the PRC region. | |
| # Hence, the budget seems to be decorrelated with the rating. However, | |
| # it might be associated with some other project characteristics, such as country | |
| # department. Nevertheless, we would not be able to draw causal conclusions given that | |
| # the data is observational and neither unobservable nor observable characteristics | |
| # are balanced across the "treatment" and control group. | |
| # (5) Imagine your manager asks you to apply Jeremy Howard's drivetrain model to the | |
| # problem of optimal budget-setting to maximize project success (i.e., "Rating"). | |
| # In such a situation, what would be the: | |
| # (a) decision problem or objective? | |
| # R: Maximize project rating/success | |
| # (b) lever or levers? | |
| # R: Budget alocation (we may want to consider levers which describe how the budget is spent as well) | |
| # (c) ideal RCT design? | |
| # R: Allocate budget randomly, which will in expectation rendem project with more and less | |
| # budget comparable because their other characteristics will overlap, and then measure | |
| # the difference in ratings. | |
| # (d) dependent variable(s) and independent variable(s) in the modeler | |
| # R: The independent variable will be the budget allocated, whereas the dependent variable | |
| # will be the rating. | |
| # (e) And---Why would running RCTs and modeling/optimizing over RCT results be preferable | |
| # to using (observational, non-RCT) "foo" data? | |
| # R: Because in observational data, projects with higher and lower budget allocations | |
| # are not guaranteed to be comparable in expectation. For example, such projects may be deployed | |
| # in different regions of the world, which might affect the difference in ratings. Hence, | |
| # We cannot infer that the difference in ratings is due to the difference in budget given that | |
| # there are confounding variables. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment