Last active
August 29, 2015 14:01
-
-
Save joanpau/aad88a6c1a9095cc4ba8 to your computer and use it in GitHub Desktop.
NetCDF text attribute encoding test.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
%TESTNCTEXT NetCDF text attribute test. | |
% | |
% Input and output of text attributes to NetCDF files is not consistent when | |
% they contain non-ASCII characters. Saving the attribute to a file and loading | |
% it again doest not recover the original value. | |
% | |
% The cause of the problem seems to be the different data types used by MATLAB | |
% and NetCDF to represent character data, and that it is not documented how | |
% the conversion is done: | |
% | |
% - CHAR class in MATLAB is 2 bytes, and characters are encoded in what seems | |
% to be UCS-2 (2-byte Universal Character Set), equivalent to UTF-16 | |
% without surrogate pairs. | |
% | |
% - NetCDF data type NC_CHAR is 1 byte, and the format does not specify any | |
% encoding. | |
% | |
% The conversion for writing text attributes seems follow these rules: | |
% | |
% - The text attribute in the NetCDF file and the corresponding MATLAB value | |
% have exactly the same length: a 13-element CHAR array is written as a | |
% 13-element NC_CHAR attribute value. | |
% | |
% - Each NC_CHAR value is set to the least significant byte of the respective | |
% CHAR value. Thus only CHAR codes in the range from 0 to 255 are stored | |
% unaltered in the NetCDF file. | |
% | |
% To read text attributes the conversion seems to be done as follows: | |
% | |
% - The sequence of NC_CHAR elements is decoded according to the current | |
% character set. | |
% | |
% - The null character, if present, terminates the string no matter the value | |
% of the following NC_CHAR elements. | |
% | |
% If the above assumptions are true, by using UCS-2 for internal representation | |
% of character data but storing only the least significant byte when writing to | |
% NetCDF, MATLAB encodes the text attributes according to iso8859-1 (latin1). | |
% Thus if a different character set is in use (default is UTF-8) the encoding | |
% and decoding procedures are not consistent. | |
% | |
% For example, if the character set is UTF-8, only the ASCII characters are | |
% preserved. They are the NC_CHAR values with codes in the range from 0 to 127. | |
% Codes from 128 to 255 are replaced by the 'replacement character' (U+FFFD, | |
% 0xfffd in UTF-16, decimal value 65533), because they are not valid in UTF-8. | |
% | |
% Hence, there are two ways to achieve write-read consistency: | |
% | |
% - Keep the current encoding approach, and always decode text attributes | |
% assuming they are encoded in iso8859-1 (latin1) and clearly state the | |
% text encoding and its limitations in the documentation. This requires | |
% a trivial modification in NETCDF.GETATT. Text attributes should be | |
% read as bytes in the NETCDFLIB call and then: | |
% attrvalue = native2unicode(attrbytes, 'latin1') | |
% | |
% - Keep the current decoding approach, and explicitly encode text attributes | |
% according to the current character set. This requires to modify the | |
% NETCDFLIB mex interface, whose code is not available. A hacky alternative | |
% is to perform the encoding in NETCDF.PUTATT, before the NETCDFLIB call: | |
% attrvalue = char(unicode2native(attrvalue)) | |
% | |
% The above solutions would only provide MATLAB session read-write consistency. | |
% To achieve complete compatibility, the user should be able to set the | |
% encoding of the text attributes when reading and when writing, either as an | |
% option in the function calls or as a preference, with a sensible default | |
% value (e.g the default character set). This would probably require | |
% modifications to the mex interface NETCDFLIB and/or to the functions | |
% NETCDF.GETATT and NETCDF.PUTATT. | |
% | |
% All this should be noted in the documentation. | |
% | |
% See also: | |
% NATIVE2UNICODE | |
% UNICODE2NATIVE | |
% | |
% Author: Joan Pau Beltran | |
% Email: [email protected] | |
%% Create test file. | |
% Create a file with several text attributes, | |
% some of temp with non-ASCII characters. | |
nc_globalid = netcdf.getConstant('NC_GLOBAL'); | |
vocals = 'aàáeèéiíoòóuú'; | |
codes = char(1:255); | |
complete = ['stop' char(0) 'here']; | |
ncid_out = netcdf.create('vocals.nc', 'NC_CLOBBER'); | |
netcdf.putAtt(ncid_out, nc_globalid, 'vocals', vocals); | |
netcdf.putAtt(ncid_out, nc_globalid, 'codes', codes); | |
netcdf.putAtt(ncid_out, nc_globalid, 'complete', complete); | |
netcdf.close(ncid_out); | |
%% Load test file. | |
% Load the file again and try to read the same attribute. | |
% KOMPLETE is truncated at the null characater. This is acceptable. | |
% Other attributes should return the same contents, | |
% but they do not if character set is UTF-8. | |
ncid_in = netcdf.open('vocals.nc', 'NC_NOWRITE'); | |
vokals = netcdf.getAtt(ncid_in, nc_globalid, 'vocals'); | |
kodes = netcdf.getAtt(ncid_in, nc_globalid, 'codes'); | |
komplete = netcdf.getAtt(ncid_in, nc_globalid, 'complete'); | |
netcdf.close(ncid_in); |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment