mohawk2/perltext.pod

Last active August 29, 2015 14:05

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/mohawk2/9f177e1f690a3bd9a68d.js"></script>
Save mohawk2/9f177e1f690a3bd9a68d to your computer and use it in GitHub Desktop.

Download ZIP

Perl Text

Raw

perltext.pod

NAME

perltext - Perl Text thoughts

DESCRIPTION

This document assumes you have read perlunitut, which in turn assumes you understand the distinction between characters and octets, and that a "string" is a sequence of characters (aka codepoints).

CONTEXT

Currently, Perl distinguishes between strings that are internally UTF-8 (and can therefore have codepoints >255), and those that are not (and can therefore only be 0-255). It does not currently distinguish between strings that are text and those that are not.

This is relevant because when it comes to opening files or directories, Perl currently just passes its internal representation to the relevant system call, without further interpretation. On Unix-like systems, this works because the OS is agnostic about the data given it, and the "right thing" happens. However, on systems that are not agnostic in this way, like Windows, the wrong thing happens. In both cases, I am referring to the "right thing" as users seeing correctly-named files (and directories) resulting from Perl program activity.

DEMONSTRATION

This code produces a listing with the correct file on Linux (where locales normally use UTF-8), but die()s on Win32:

my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
open my $fh, '>', $file or die "open: $!";
my @cmd = $^O eq 'Win32' ? qw(dir) : qw(ls -l);
system @cmd, $file;

Current Win32 workaround:

sub writefile {
  my $file = shift;
  use Data::Dump qw(dump);
  use Encode qw/encode/;
  use Win32API::File qw(:ALL);
  my $enc  = encode("UTF-16LE", $file); # Format supported by NTFS
  my $binary  = eval dump($enc);        # Remove UTF ness
    $binary .= chr(0).chr(0);           # 0 terminate string
  my $F  = Win32API::File::CreateFileW
   ($binary, GENERIC_WRITE, 0, [], OPEN_ALWAYS, 0, 0); #  Create file via Win32API
  die $^E if $^E;                   # Write any error message
  local *FILE;
  OsFHandleOpen(FILE, $F, "w") or die "Cannot open file: $^E";
  \*FILE;
}
 
my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
writefile($file);

PROPOSAL

In order that the first snippet would work in the expected way on Windows, it is proposed that Perl explicitly know when a string is intended as text. This might be implemented by an SV flag SvTEXT. This flag would be set on the return value of Encode::decode (and when read from filehandles with an :encoding(…)), and would be off on returns from Encode::encode. There would also be another mechanism for explicitly setting this flag to true.

Concatenating non-text with non-text, and text with text, would obviously not affect the "text-ness" of the result.

The purpose of this would be that when filesystem-orientated functions, among others, received a string known to be "text", they could do the right thing.

It is intended that Encode::Locale be used to find and make easily-available the correct encodings for filesystems.

ISSUES

The behaviour of concatenating text and non-text needs to be defined.

Option 1: parsimonious

Non-text-ness would be like the behaviour of tainting, and would "win": the result would be non-text.

Option 2: moar detail

Here, there would also be a "binary" flag, which would be set on returns from Encode::encode, pack, and on data read from filehandles with an explicit :raw layer. If neither "text" nor "binary" flags were set, the string's "text-ness" would be considered "unknown". In this option, the outcomes of concatenation would be (b = binary, t = text, u = unknown):

X b | t | u
b b | ? | b
t ? | t | t
u b | t | u

While it may seem desirable for the "?" results above to use Encode::decode to "upgrade" the known-binary string to text, this is impossible without knowing the correct encoding. Once a cautious version has been implemented, an updated version which allowed specifying of a default (probably of UTF-8) might be made.

You can't do that unless you know what kind of binary data it is, which you don't from just a flag. I can give you a string containing a bunch of bytes and tell you that it's binary data. Then you do $binary . $text and you can't decode $binary unless you know what its encoding it. (Heck, it might not even be encoded text, but just a part of a program's heap or something.) You can't encode $text for similar reasons. If you automatically encode text to UTF-8 for concatenation to binary data, you'll be in a world of hurt if $binary was in KOI8-R.

ilmari commented Sep 3, 2014

there would also be a "binary" flag, which would be set on returns from Encode::decode.

That should be Encode::encode. Similarly, other clearly binary-returning functions such as pack should set the binary flag.

Data read from filehandles with an :encoding(…) will also have the text flag (because it uses Encode::decode). Data read from filehandle with an explicit :raw layer should probably be binary, while filehandles with neither should pobably be unknown (at least for now).

Author

mohawk2 commented Sep 8, 2014

@rjbs, updated for your points on encodings. Also added reference to Encode::Locale which I found extremely helpful when I addressed Unicode issues in EUMM.

@ilmari, added your points in. Entertainingly, I had to add an =encoding utf8 at the top so it correctly handled your ….

jhi commented Sep 8, 2014

An additional consideration (one way of putting it…) is that saying ”UTF-8” doesn’t cover Unicode properly. As in ”doesn’t nail it down enough” — doesn’t tell the Unicode normalization.

And if you bring in the difference between creating filesystem objects and looking them up, you need to also think of folding (in addition to normalization). UNIX is again a black box (thought I’ve got a tingly sense that some Linux filesystems might do more?), but NTFS and HFS+ are case-forgiving. (Store the case on creation, but ignore it on lookup.)

(And +1 on rjbs' warnings on trying to guess the encoding heuristically... just don't do it. One can get a reasonably high confidence for "this piece is UTF-8", part of the design of it, but even then it's not 100%. For many encodings, there just isn't a good way, at least without bringing in priors like "this piece is maybe probably in the following human languages..." Just. Don't.)

jhi commented Sep 8, 2014

By the "UTF-8 not nailing it down enough" I meant that filesystems [1] might normalize [2] the text. So what you put in is not necessarily what you get out, and if you concatenate strings you might get results of mixed normalization.

[1] https://en.wikipedia.org/wiki/HFS_Plus uses almost Unicode NFD
[2] https://en.wikipedia.org/wiki/Unicode_equivalence

mohawk2/perltext.pod

NAME

DESCRIPTION

CONTEXT

DEMONSTRATION

PROPOSAL

ISSUES

Option 1: parsimonious

Option 2: moar detail

SEE ALSO

rjbs commented Aug 29, 2014

Uh oh!

rjbs commented Aug 29, 2014

Uh oh!

ilmari commented Sep 3, 2014

Uh oh!

mohawk2 commented Sep 8, 2014

Uh oh!

jhi commented Sep 8, 2014

Uh oh!

jhi commented Sep 8, 2014

Uh oh!