Skip to content

Instantly share code, notes, and snippets.

@chrrrisw
Last active April 10, 2017 04:25
Show Gist options
  • Save chrrrisw/760a4f0ff97f77158aeec5d9b3851571 to your computer and use it in GitHub Desktop.
Save chrrrisw/760a4f0ff97f77158aeec5d9b3851571 to your computer and use it in GitHub Desktop.
numpy and scipy treat bytes and bytearrays differently

Open a python interpreter and import numpy:

>>> import numpy as np

Now create a bytes object and create a numpy array from it:

>>> my_bytes = bytes([0, 1, 127, 255])
>>> my_np_bytes = np.array(my_bytes)
>>> my_np_bytes.size
1
>>> my_np_bytes.shape
()
>>> my_np_bytes.dtype
dtype('S4')

As we can see, the bytes object results in an ndarray object of size 1 and of byte-string type.

Now let's repeat for bytearray:

>>> my_bytearray = bytearray([0, 1, 127, 255])
>>> my_np_bytearray = np.array(my_bytearray)
>>> my_np_bytearray.size
4
>>> my_np_bytearray.shape
(4,)
>>> my_np_bytearray.dtype
dtype('uint8')

We now get what we expect, an ndarray of size 4 and of uint8 type.

Why does this matter? Consider the following example where we round-trip through scipy savemat and loadmat:

>>> import numpy as np
>>> import io
>>> import scipy.io
>>> a_int = 3
>>> a_bytes = bytes([0, 1, 127])
>>> a_bytearray = bytearray([0, 1, 127, 255])
>>> bio = io.BytesIO()
>>> my_dict = {'VAR': {'a_int': a_int, 'a_bytes': a_bytes, 'a_bytearray': a_bytearray}}
>>> scipy.io.savemat(bio, my_dict, long_field_names=True)
>>> data_out = scipy.io.loadmat(bio, struct_as_record=False, squeeze_me=True)
>>> recovered_data = {k: getattr(data_out['VAR'], k) for k in data_out['VAR']._fieldnames}
>>> recovered_data
{'a_bytes': '\x00\x01\x7f', 'a_int': 3, 'a_bytearray': array([  0,   1, 127, 255], dtype=uint8)}

So far, all good. Notice, however, that I've dropped the 255 value in the bytes object, that's because 255 is not a valid ascii character. If instead we have:

>>> a_bytes = bytes([0, 1, 127, 255])

We get the following when we call loadmat():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/scipy/io/matlab/mio.py", line 132, in loadmat
    matfile_dict = MR.get_variables(variable_names)
  File "/usr/local/lib/python3.4/dist-packages/scipy/io/matlab/mio5.py", line 292, in get_variables
    res = self.read_var_array(hdr, process)
  File "/usr/local/lib/python3.4/dist-packages/scipy/io/matlab/mio5.py", line 252, in read_var_array
    return self._matrix_reader.array_from_header(header, process)
  File "mio5_utils.pyx", line 625, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy/io/matlab/mio5_utils.c:5993)
  File "mio5_utils.pyx", line 673, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy/io/matlab/mio5_utils.c:5585)
  File "mio5_utils.pyx", line 931, in scipy.io.matlab.mio5_utils.VarReader5.read_struct (scipy/io/matlab/mio5_utils.c:8694)
  File "mio5_utils.pyx", line 623, in scipy.io.matlab.mio5_utils.VarReader5.read_mi_matrix (scipy/io/matlab/mio5_utils.c:5184)
  File "mio5_utils.pyx", line 667, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy/io/matlab/mio5_utils.c:5503)
  File "mio5_utils.pyx", line 824, in scipy.io.matlab.mio5_utils.VarReader5.read_char (scipy/io/matlab/mio5_utils.c:7306)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 3: invalid start byte

So, although 255 is a valid bytes value, the scipy library tries to decode the stream as a string, and fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment