joanpau · August 29, 2015 14:01
diff --git a/testnctext.m b/testnctext.m
 %TESTNCTEXT  NetCDF text attribute test.
 %
 %  Input and output of text attributes to NetCDF files is not consistent when
 %  they contain non-ASCII characters. Saving the attribute to a file and loading
 %  it again doest not recover the original value.
 %
 %  The cause of the problem seems to be the different data types used by MATLAB 
 %  and NetCDF to represent character data, and that it is not documented how 
 %  the conversion is done:
 %
 %    - CHAR class in MATLAB is 2 bytes, and characters are encoded in what seems
 %      to be UCS-2 (2-byte Universal Character Set), equivalent to UTF-16 
 %      without surrogate pairs.
 %
 %    - NetCDF data type NC_CHAR is 1 byte, and the format does not specify any 
 %      encoding. 
 %
 %  The conversion for writing text attributes seems follow these rules:
 %
 %    - The text attribute in the NetCDF file and the corresponding MATLAB value
 %      have exactly the same length: a 13-element CHAR array is written as a 
 %      13-element NC_CHAR attribute value.
 %
 %    - Each NC_CHAR value is set to the least significant byte of the respective
 %      CHAR value. Thus only CHAR codes in the range from 0 to 255 are stored 
 %      unaltered in the NetCDF file.
 %
 %  To read text attributes the conversion seems to be done as follows:
 %
 %    - The sequence of NC_CHAR elements is decoded according to the current 
 %      character set.
 %
 %    - The null character, if present, terminates the string no matter the value
 %      of the following NC_CHAR elements.
 %  
 %  If the above assumptions are true, by using UCS-2 for internal representation
 %  of character data but storing only the least significant byte when writing to
 %  NetCDF, MATLAB encodes the text attributes according to iso8859-1 (latin1).
 %  Thus if a different character set is in use (default is UTF-8) the encoding
 %  and decoding procedures are not consistent.
 %
 %  For example, if the character set is UTF-8, only the ASCII characters are
 %  preserved. They are the NC_CHAR values with codes in the range from 0 to 127.
 %  Codes from 128 to 255 are replaced by the 'replacement character' (U+FFFD, 
 %  0xfffd in UTF-16, decimal value 65533), because they are not valid in UTF-8.
 %
 %  Hence, there are two ways to achieve write-read consistency:
 %
 %    - Keep the current encoding approach, and always decode text attributes
 %      assuming they are encoded in iso8859-1 (latin1) and clearly state the
 %      text encoding and its limitations in the documentation. This requires 
 %      a trivial modification in NETCDF.GETATT. Text attributes should be 
 %      read as bytes in the NETCDFLIB call and then:
 %        attrvalue = native2unicode(attrbytes, 'latin1')
 % 
 %    - Keep the current decoding approach, and explicitly encode text attributes
 %      according to the current character set. This requires to modify the
 %      NETCDFLIB mex interface, whose code is not available. A hacky alternative
 %      is to perform the encoding in NETCDF.PUTATT, before the NETCDFLIB call:
 %        attrvalue = char(unicode2native(attrvalue))
 %
 %  The above solutions would only provide MATLAB session read-write consistency.
 %  To achieve complete compatibility, the user should be able to set the 
 %  encoding of the text attributes when reading and when writing, either as an 
 %  option in the function calls or as a preference, with a sensible default
 %  value (e.g the default character set). This would probably require
 %  modifications to the mex interface NETCDFLIB and/or to the functions 
 %  NETCDF.GETATT and NETCDF.PUTATT.
 % 
 %  All this should be noted in the documentation.
 %
 %  See also:
 %    NATIVE2UNICODE
 %    UNICODE2NATIVE
 %
 %  Author: Joan Pau Beltran
 %  Email: [email protected]


 %% Create test file.
 % Create a file with several text attributes, 
 % some of temp with non-ASCII characters.
 nc_globalid = netcdf.getConstant('NC_GLOBAL');
 vocals = 'aàáeèéiíoòóuú';
 codes = char(1:255);
 complete = ['stop' char(0) 'here'];
 ncid_out = netcdf.create('vocals.nc', 'NC_CLOBBER');
 netcdf.putAtt(ncid_out, nc_globalid, 'vocals', vocals);
 netcdf.putAtt(ncid_out, nc_globalid, 'codes', codes);
 netcdf.putAtt(ncid_out, nc_globalid, 'complete', complete);
 netcdf.close(ncid_out);


 %% Load test file.
 % Load the file again and try to read the same attribute.
 % KOMPLETE is truncated at the null characater. This is acceptable.
 % Other attributes should return the same contents, 
 % but they do not if character set is UTF-8.
 ncid_in = netcdf.open('vocals.nc', 'NC_NOWRITE');
 vokals = netcdf.getAtt(ncid_in, nc_globalid, 'vocals');
 kodes = netcdf.getAtt(ncid_in, nc_globalid, 'codes');
 komplete = netcdf.getAtt(ncid_in, nc_globalid, 'complete');
 netcdf.close(ncid_in);
	%TESTNCTEXT NetCDF text attribute test.
	%
	% Input and output of text attributes to NetCDF files is not consistent when
	% they contain non-ASCII characters. Saving the attribute to a file and loading
	% it again doest not recover the original value.
	%
	% The cause of the problem seems to be the different data types used by MATLAB
	% and NetCDF to represent character data, and that it is not documented how
	% the conversion is done:
	%
	% - CHAR class in MATLAB is 2 bytes, and characters are encoded in what seems
	% to be UCS-2 (2-byte Universal Character Set), equivalent to UTF-16
	% without surrogate pairs.
	%
	% - NetCDF data type NC_CHAR is 1 byte, and the format does not specify any
	% encoding.
	%
	% The conversion for writing text attributes seems follow these rules:
	%
	% - The text attribute in the NetCDF file and the corresponding MATLAB value
	% have exactly the same length: a 13-element CHAR array is written as a
	% 13-element NC_CHAR attribute value.
	%
	% - Each NC_CHAR value is set to the least significant byte of the respective
	% CHAR value. Thus only CHAR codes in the range from 0 to 255 are stored
	% unaltered in the NetCDF file.
	%
	% To read text attributes the conversion seems to be done as follows:
	%
	% - The sequence of NC_CHAR elements is decoded according to the current
	% character set.
	%
	% - The null character, if present, terminates the string no matter the value
	% of the following NC_CHAR elements.
	%
	% If the above assumptions are true, by using UCS-2 for internal representation
	% of character data but storing only the least significant byte when writing to
	% NetCDF, MATLAB encodes the text attributes according to iso8859-1 (latin1).
	% Thus if a different character set is in use (default is UTF-8) the encoding
	% and decoding procedures are not consistent.
	%
	% For example, if the character set is UTF-8, only the ASCII characters are
	% preserved. They are the NC_CHAR values with codes in the range from 0 to 127.
	% Codes from 128 to 255 are replaced by the 'replacement character' (U+FFFD,
	% 0xfffd in UTF-16, decimal value 65533), because they are not valid in UTF-8.
	%
	% Hence, there are two ways to achieve write-read consistency:
	%
	% - Keep the current encoding approach, and always decode text attributes
	% assuming they are encoded in iso8859-1 (latin1) and clearly state the
	% text encoding and its limitations in the documentation. This requires
	% a trivial modification in NETCDF.GETATT. Text attributes should be
	% read as bytes in the NETCDFLIB call and then:
	% attrvalue = native2unicode(attrbytes, 'latin1')
	%
	% - Keep the current decoding approach, and explicitly encode text attributes
	% according to the current character set. This requires to modify the
	% NETCDFLIB mex interface, whose code is not available. A hacky alternative
	% is to perform the encoding in NETCDF.PUTATT, before the NETCDFLIB call:
	% attrvalue = char(unicode2native(attrvalue))
	%
	% The above solutions would only provide MATLAB session read-write consistency.
	% To achieve complete compatibility, the user should be able to set the
	% encoding of the text attributes when reading and when writing, either as an
	% option in the function calls or as a preference, with a sensible default
	% value (e.g the default character set). This would probably require
	% modifications to the mex interface NETCDFLIB and/or to the functions
	% NETCDF.GETATT and NETCDF.PUTATT.
	%
	% All this should be noted in the documentation.
	%
	% See also:
	% NATIVE2UNICODE
	% UNICODE2NATIVE
	%
	% Author: Joan Pau Beltran
	% Email: [email protected]


	%% Create test file.
	% Create a file with several text attributes,
	% some of temp with non-ASCII characters.
	nc_globalid = netcdf.getConstant('NC_GLOBAL');
	vocals = 'aàáeèéiíoòóuú';
	codes = char(1:255);
	complete = ['stop' char(0) 'here'];
	ncid_out = netcdf.create('vocals.nc', 'NC_CLOBBER');
	netcdf.putAtt(ncid_out, nc_globalid, 'vocals', vocals);
	netcdf.putAtt(ncid_out, nc_globalid, 'codes', codes);
	netcdf.putAtt(ncid_out, nc_globalid, 'complete', complete);
	netcdf.close(ncid_out);


	%% Load test file.
	% Load the file again and try to read the same attribute.
	% KOMPLETE is truncated at the null characater. This is acceptable.
	% Other attributes should return the same contents,
	% but they do not if character set is UTF-8.
	ncid_in = netcdf.open('vocals.nc', 'NC_NOWRITE');
	vokals = netcdf.getAtt(ncid_in, nc_globalid, 'vocals');
	kodes = netcdf.getAtt(ncid_in, nc_globalid, 'codes');
	komplete = netcdf.getAtt(ncid_in, nc_globalid, 'complete');
	netcdf.close(ncid_in);
No results found