Day 3 of the Match 5 Kaggle Benchmarks in 5 Days challenge
In the Bike Sharing Demand competition on Kaggle, the goal is to predict the demand for bike share bikes in Washington DC based on historical usage data. For this regression problem, the evaluation metric is RMSLE.
I decided to recreate the mean value benchmark using unix commandline tools. The benchmark consists of using the overall usage mean from the training set for all test set datetimes (i.e. using the same, single value for all predicted counts).
I used the csvkit suite of tools along with sed to recreate the benchmark. This was my first time using csvkit and I'm happy so far!
# Calculate the mean of the training set counts
MEAN=$(csvcut -c 12 train.csv | csvstat --mean)
# Write the test set datetime stamps and the mean value to csv, modifying the columns line
csvcut -c 1 test.csv | sed -e "1 s/$/,count/; 1n; s/$/,$MEAN/" > mean-benchmark.csv
This scores an RMSLE of 1.58456 on the public leaderboard.