Skip to content

Instantly share code, notes, and snippets.

@bhaskarvk
Last active August 25, 2016 22:14
Show Gist options
  • Save bhaskarvk/642a40a14e36a4e08b6094bde4100458 to your computer and use it in GitHub Desktop.
Save bhaskarvk/642a40a14e36a4e08b6094bde4100458 to your computer and use it in GitHub Desktop.

Here's an interesting (at least to me) dataset for practicising regression analysis.

The data

First I installed ImageMagic (See ImageMagic documentation for installation on your OS). Then using ImageMagick's convert utility, I generated about 57 png files filled with random noise. The file dimensions go from 512px x 512px to 4096px x 4096px incrementing by 64 px in each direction.

mkdir -p ./data/RandomNoise
cd ./data/RandomNoise && \
for i in `seq 512 64 4096`; do
  convert -size ${i}x${i} xc: +noise Random random-${i}_$i.png
done

Next I in R I installed the magick package. I also had the usual Hadleyverse packages, dplyr, tidyr, stringr, purrr, pryr etc.

Then I benchmarked magick::image_read() performance over these 57 files and I did this 10 times. You can see the whole R code at magick_image_read_perf.R.

This gave me a 570 rows (57 files x 10) data set described below.

  • reading_function: Constant value 'magick::image_read'
  • Iteration: Iteration Number 1 to 10
  • File Size: Size of Input File in Bytes
  • Object.size: Size of the read object in R's memory in Bytes
  • Mem.Increase: Increase in R's Memory after the image was read
  • User.Time, System.Time, Elapsed.Time: Time values in seconds obtained from system.time(magick::image_read())

You can see the whole data magick_image_read.perf.csv.

Next I took mean and median values of all the measurements over the 10 iterations for each of the 57 files. Here are the magick_image_read.perf.mean.csv and magick_image_read.perf.median.csv. Both these are 57 row dataset with one row for one file.

The problem

The challenge is to predict the time to process the file using the file size and any other variable necessary.

You can take User.Time or Elapsed.Time as your dependent variable and File.Size, Object.Size, Mem.Increase, System.Time as your predictors. You can work with the raw data (570 obs) or the mean/median dataset (57 obs).

Note that following combinations will exhibit strong colinearity

  • User.Time & Elapsed.Time
  • System.Time & Elapsed.Time
  • File.Size & Object.Size & Mem.Increase

If you are going to do linear regression e.g. lm(User.Time ~ File.Size) etc. Make sure to check the validity of the regression model using residual analysis and not just the goodness of fit test of the model.

You can also try regression trees with or without bagging/boosting, and also random forrest models. Feel free to try any model you like.

All the best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment