Skip to content

Instantly share code, notes, and snippets.

@ogrisel
Last active December 22, 2015 03:59
Show Gist options
  • Select an option

  • Save ogrisel/6414050 to your computer and use it in GitHub Desktop.

Select an option

Save ogrisel/6414050 to your computer and use it in GitHub Desktop.
import os
import numpy as np
import bloscpack as bp
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
data = mnist.data.astype(np.float32) / 255.
print("dtype: {}".format(data.dtype))
print("shape: {}".format(data.shape))
print(data.flags)
filename_noshuffle = '/tmp/mnist_noshuffle.blp'
bp.pack_ndarray_file(data, filename_noshuffle,
blosc_args={'typesize': 4, 'shuffle': False, 'clevel': 7})
print("Size without shuffle: %0.3fMB"
% (os.stat(filename_noshuffle).st_size / 1e6))
filename_shuffle = '/tmp/mnist_shuffle.blp'
bp.pack_ndarray_file(data, filename_shuffle,
blosc_args={'typesize': 4, 'shuffle': True, 'clevel': 7})
print("Size with shuffle: %0.3fMB"
% (os.stat(filename_shuffle).st_size / 1e6))
@ogrisel
Copy link
Author

ogrisel commented Sep 2, 2013

MNIST is a dataset of 28 x 28 gray level digits pictures. Many values are 0 (black pixels for the background).

Here is the output of this script:

dtype: float32
shape: (70000, 784)
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
Size without shuffle: 31.769MB
Size with shuffle: 57.185MB

Which means that shuffling is actually detrimental on this data. How is this possible?

Changing the value of typesize from 4 to 8 does not change the size of the blp files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment