HDF Corruption

CI engineers observed an issue with attempting to retrieve data from a PREST raw data product. The dataset 8f13001a08a14162abfcc0288840f491 comprises a ViewCoverage, a ComplexCoverage and several SimplexCoverages. Each SimplexCoverage comprises several HDF5 files, each file contains a time-series dataset for a specific parameter. The ComplexCoverage combines the SimplexCoverages and provides a seamless API that aggregates the data over a universal time axis. The ViewCoverage provides a filtered view of the data, in this case all of the data is presented so it is a transparent API that delegates all calls to the ComplexCoverage. When the dataset was queried by PyDAP an exception was raised and logged. Initial investigation led to the conclusion that the data file for the data product was corrupted.

The corrupted data file was an HDF5 file that composed a ComplexCoverage. The file contained only meta-data about how the SimplexCoverages were arranged on the file-system, the extents of their data and meta-data about the dataset itself. Each piece of meta-data was stored as a variable-length string attribute in the HDF5 file. The specific error that was presented to users was:

File "./ion/util/pydap/handlers/coverage/coverage_handler.py", line 32, in wrapper 
return func(*args, **kwargs) 

File "./ion/util/pydap/handlers/coverage/coverage_handler.py", line 309, in parse_constraints 
coverage = self.get_coverage(base[0], base[1]) 

File "./ion/util/pydap/handlers/coverage/coverage_handler.py", line 100, in get_coverage 
result = AbstractCoverage.load(root_path, dataset_id,mode='r') 

File "./extern/coverage-model/coverage_model/coverage.py", line 140, in load 
return ccls(root_dir, persistence_guid, mode=mode) 

File "./extern/coverage-model/coverage_model/coverage.py", line 896, in __init__ 
_doload(self) 

File "./extern/coverage-model/coverage_model/coverage.py", line 885, in _doload 
self.reference_coverage = AbstractCoverage.load(self._persistence_layer.rcov_loc, mode='r') 

File "./extern/coverage-model/coverage_model/coverage.py", line 129, in load 
ctype = get_coverage_type(os.path.join(root_dir, persistence_guid, '{0}_master.hdf5'.format(persistence_guid))) 

File "./extern/coverage-model/coverage_model/persistence_helpers.py", line 33, in get_coverage_type 
ctype = unpack(f.attrs['coverage_type']) 

File "./eggs/h5py-2.1.1a2-py2.7-linux-x86_64.egg/h5py/_hl/attrs.py", line 43, in __getitem__ 
attr.read(arr) 

File "h5a.pyx", line 357, in h5py.h5a.AttrID.read (h5py/h5a.c:4125) 

File "_proxy.pyx", line 61, in h5py._proxy.attr_rw (h5py/_proxy.c:855) 

IOError: unable to read attribute (Attribute: Read failed)

The file was corrupted to a level where the HDF5 library could not read the file correctly or parse enough information to continue operating on the file as normal.

Technical Assessment of Corruption

The attributes in the HDF5 data file referenced a location in the file that exceeded the file's end of file offset. In specific versions of HDF5 this would cause a SEGFAULT signal to be sent from the operating system and the process would be killed. In other versions the API would issue an error that the file was unreadable.

Variable length string attributes have four parts that compose the entire message in the file. The first part is the object header which consists of an identifier (0x0C) and the size of the message in the object stack. The data type message which identifies that the attribute contains a string and the character encoding used. The third part identifies the data as a variable length array with no dimensionality. The fourth part is a pointer to a global heap section of the file where the variable length array resides in the file and the length of the string. This particular set of attributes point to an offset that does not exist within the file but is where a new global heap would be created and addressed during an attribute modification.

In the current driver selection we use, all changes to the file occur when the data is flushed or when the file is closed. In both cases the following system calls to write are made.

Write operations performed in sequence on a file to change attrs:

write 96 bytes to the Superblock, usually at the beginning of the file (offset 0x0). This write is important, it updates the superblock and the End of File address.
write 40 bytes to the Symbol Table Cache, this is usually a NOP because no new datasets were created
write 4096 bytes to the end of the file, creates a new global heap.
write 120 bytes Updates the attribute pointers to point to the new global heap where the new values now reside.

HDF5 relies on the POSIX compliance of the write system call which states that it returns the number of bytes successfully written to disk. HDF5 iterates over the block of data and calls write until all of the data is written or an error is set. The code (edited for brevity) is as follows:

    while(size > 0) {
        ssize_t   bytes_in        = 0;    /* # of bytes to write  */
        ssize_t   bytes_wrote     = -1;   /* # of bytes written   */ 
        do {
            bytes_wrote = write(file->fd, buf, bytes_in);
        } while(-1 == bytes_wrote && EINTR == errno);
        if(-1 == bytes_wrote) { /* error */
            ...
        }
        HDassert(bytes_wrote > 0);
        HDassert((size_t)bytes_wrote <= size);


        size -= (size_t)bytes_wrote;
        addr += (haddr_t)bytes_wrote;
        buf = (const char *)buf + bytes_wrote;
    } /* end while */

The current state of the corrupted file is indicative of an I/O problem outside the scope of the software specifically. In the sequence of the four write calls, if any of them had failed then their successor would not execute but the state reflected this:

96 bytes - The superblock was NOT updated but in order to continue to the next step, HDF5 was told that it succeeded, write returned 96 bytes but the file does not reflect the updated state
[?] 40 bytes - There is no way of knowing whether or not this write succeeded since it is identical to the previous state
4096 bytes - The end of the file was never extended and the new global heap was never added
[√]120 bytes - This write succeeded, the pointers were updated to a section of the file that didn't exist yet.

If the write had failed or the process was interrupted before the last write the file would still be usable but in this specific case of failure it lended the file in a non-parsable state by the HDF library.

Some speculative causes:

Hardware Failure
NFS Issue/Bug
Operating System / ext3 (or whatever file system) driver bug
Race condition where two processes attempted to write to the same file, updating similar information

Recovering from corruption

In this specific case I was able to fix the file corruption by changing the pointers of the attributes to the most recent global heap offset, the last attribute modifications would not be reflected in this file but the file is now usable.

I do not believe that there is a programmatic way to identify this level of corruption in a file and repair it, at least not within a reasonable scope of time. I have reverse engineered a very low level HDF file reader as a pure python module. The module decomposes the file into the basic file building blocks and could assist engineers in identifying the part of the file where corruption exists. For now, it would require an engineers attention and knowledge base to fix the file (if the file is recoverable).

Using fixed length strings

The overhead introduced to read and writes of HDF files that use variable-length string attributes is agregious. Each change set to a subset of attributes results in a new global heap being allocated for the file which costs 4096 bytes at a minimum. In short, if you open the file and change one character of one string the file will grow by 4096 bytes. I was able to reduce overall file sizes, reduce block sizes of data blocks written to disk and reduce the probability of the corruption we identified by using fixed-length strings.

Fixed length strings do not store data in a global heap so there is not a 4k overhead to using them, there is a 10-20 byte overhead.
If a write fails in a random order the corruption would be minimized, there are no pointers that would wind up pointing to a null space, the values for the attribtues reside in a contiguous block with the attribute themselves.
If the changed string is longer than the original a new attribute is created at the end of the object stack and the original is changed to a NIL message (deleted). If a write failed at any point you would wind up with a complete usable file that either had a duplicate attribtue with two different values or the old value with the new value missing.
To synchronize the in-memory HDF file with the persisted data file there are fewer system calls made: three in lieu of four, which reduces the probability of corruption.

In the event of future corruption, there may be a way to programmatically reconstruct the file from system resources, similar to how CoverageDoctor works. This would be the preferred method of fixing corruption in lieu of low level HDF file modification which is time consuming, prone to errors and requires human attention.

lukecampbell/assessment.md

HDF Corruption

Technical Assessment of Corruption

Write operations performed in sequence on a file to change attrs:

Recovering from corruption

Using fixed length strings