CI engineers observed an issue with attempting to retrieve data from a PREST raw data product. The dataset
8f13001a08a14162abfcc0288840f491
comprises a ViewCoverage
, a ComplexCoverage
and several SimplexCoverage
s. Each
SimplexCoverage
comprises several HDF5 files, each file contains a time-series dataset for a specific parameter. The
ComplexCoverage
combines the SimplexCoverage
s and provides a seamless API that aggregates the data over a universal
time axis. The ViewCoverage
provides a filtered view of the data, in this case all of the data is presented so it is a
transparent API that delegates all calls to the ComplexCoverage
. When the dataset was queried by PyDAP
an exception
was raised and logged. Initial investigation led to the conclusion that the data file for the data product was
corrupted.
The corrupted data file was an HDF5 file that composed a ComplexCoverage
. The file contained only meta-data about how
the SimplexCoverage
s were arranged on the file-system, the extents of their data and meta-data about the dataset itself.
Each piece of meta-data was stored as a variable-length string attribute in the HDF5 file. The specific error that was
presented to users was:
File "./ion/util/pydap/handlers/coverage/coverage_handler.py", line 32, in wrapper
return func(*args, **kwargs)
File "./ion/util/pydap/handlers/coverage/coverage_handler.py", line 309, in parse_constraints
coverage = self.get_coverage(base[0], base[1])
File "./ion/util/pydap/handlers/coverage/coverage_handler.py", line 100, in get_coverage
result = AbstractCoverage.load(root_path, dataset_id,mode='r')
File "./extern/coverage-model/coverage_model/coverage.py", line 140, in load
return ccls(root_dir, persistence_guid, mode=mode)
File "./extern/coverage-model/coverage_model/coverage.py", line 896, in __init__
_doload(self)
File "./extern/coverage-model/coverage_model/coverage.py", line 885, in _doload
self.reference_coverage = AbstractCoverage.load(self._persistence_layer.rcov_loc, mode='r')
File "./extern/coverage-model/coverage_model/coverage.py", line 129, in load
ctype = get_coverage_type(os.path.join(root_dir, persistence_guid, '{0}_master.hdf5'.format(persistence_guid)))
File "./extern/coverage-model/coverage_model/persistence_helpers.py", line 33, in get_coverage_type
ctype = unpack(f.attrs['coverage_type'])
File "./eggs/h5py-2.1.1a2-py2.7-linux-x86_64.egg/h5py/_hl/attrs.py", line 43, in __getitem__
attr.read(arr)
File "h5a.pyx", line 357, in h5py.h5a.AttrID.read (h5py/h5a.c:4125)
File "_proxy.pyx", line 61, in h5py._proxy.attr_rw (h5py/_proxy.c:855)
IOError: unable to read attribute (Attribute: Read failed)
The file was corrupted to a level where the HDF5 library could not read the file correctly or parse enough information to continue operating on the file as normal.
The attributes in the HDF5 data file referenced a location in the file that exceeded the file's end of file offset. In
specific versions of HDF5 this would cause a SEGFAULT
signal to be sent from the operating system and the process
would be killed. In other versions the API would issue an error that the file was unreadable.
Variable length string attributes have four parts that compose the entire message in the file. The first part is the
object header which consists of an identifier (0x0C
) and the size of the message in the object stack. The data type
message which identifies that the attribute contains a string and the character encoding used. The third part identifies
the data as a variable length array with no dimensionality. The fourth part is a pointer to a global heap section of
the file where the variable length array resides in the file and the length of the string. This particular set of
attributes point to an offset that does not exist within the file but is where a new global heap would be created and
addressed during an attribute modification.
In the current driver selection we use, all changes to the file occur when the data is flushed or when the file is
closed. In both cases the following system calls to write
are made.
write
96 bytes to the Superblock, usually at the beginning of the file (offset 0x0). This write is important, it updates the superblock and the End of File address.write
40 bytes to the Symbol Table Cache, this is usually a NOP because no new datasets were createdwrite
4096 bytes to the end of the file, creates a new global heap.write
120 bytes Updates the attribute pointers to point to the new global heap where the new values now reside.
HDF5 relies on the POSIX compliance of the write
system call which states that it returns the number of bytes
successfully written to disk. HDF5 iterates over the block of data and calls write
until all of the data is written or
an error is set. The code (edited for brevity) is as follows:
while(size > 0) {
ssize_t bytes_in = 0; /* # of bytes to write */
ssize_t bytes_wrote = -1; /* # of bytes written */
do {
bytes_wrote = write(file->fd, buf, bytes_in);
} while(-1 == bytes_wrote && EINTR == errno);
if(-1 == bytes_wrote) { /* error */
...
}
HDassert(bytes_wrote > 0);
HDassert((size_t)bytes_wrote <= size);
size -= (size_t)bytes_wrote;
addr += (haddr_t)bytes_wrote;
buf = (const char *)buf + bytes_wrote;
} /* end while */
The current state of the corrupted file is indicative of an I/O problem outside the scope of the software specifically.
In the sequence of the four write
calls, if any of them had failed then their successor would not execute but the
state reflected this:
- 96 bytes - The superblock was NOT updated but in order to continue to the next step, HDF5 was told that it succeeded, write returned 96 bytes but the file does not reflect the updated state
- [?] 40 bytes - There is no way of knowing whether or not this write succeeded since it is identical to the previous state
- 4096 bytes - The end of the file was never extended and the new global heap was never added
- [√]120 bytes - This write succeeded, the pointers were updated to a section of the file that didn't exist yet.
If the write had failed or the process was interrupted before the last write the file would still be usable but in this specific case of failure it lended the file in a non-parsable state by the HDF library.
Some speculative causes:
- Hardware Failure
- NFS Issue/Bug
- Operating System / ext3 (or whatever file system) driver bug
- Race condition where two processes attempted to write to the same file, updating similar information
In this specific case I was able to fix the file corruption by changing the pointers of the attributes to the most recent global heap offset, the last attribute modifications would not be reflected in this file but the file is now usable.
I do not believe that there is a programmatic way to identify this level of corruption in a file and repair it, at least not within a reasonable scope of time. I have reverse engineered a very low level HDF file reader as a pure python module. The module decomposes the file into the basic file building blocks and could assist engineers in identifying the part of the file where corruption exists. For now, it would require an engineers attention and knowledge base to fix the file (if the file is recoverable).
The overhead introduced to read and writes of HDF files that use variable-length string attributes is agregious. Each change set to a subset of attributes results in a new global heap being allocated for the file which costs 4096 bytes at a minimum. In short, if you open the file and change one character of one string the file will grow by 4096 bytes. I was able to reduce overall file sizes, reduce block sizes of data blocks written to disk and reduce the probability of the corruption we identified by using fixed-length strings.
- Fixed length strings do not store data in a global heap so there is not a 4k overhead to using them, there is a 10-20 byte overhead.
- If a write fails in a random order the corruption would be minimized, there are no pointers that would wind up pointing to a null space, the values for the attribtues reside in a contiguous block with the attribute themselves.
- If the changed string is longer than the original a new attribute is created at the end of the object stack and the original is changed to a NIL message (deleted). If a write failed at any point you would wind up with a complete usable file that either had a duplicate attribtue with two different values or the old value with the new value missing.
- To synchronize the in-memory HDF file with the persisted data file there are fewer system calls made: three in lieu of four, which reduces the probability of corruption.
In the event of future corruption, there may be a way to programmatically reconstruct the file from system resources,
similar to how CoverageDoctor
works. This would be the preferred method of fixing corruption in lieu of low level HDF
file modification which is time consuming, prone to errors and requires human attention.