Skip to content

Instantly share code, notes, and snippets.

@mohawk2
Last active August 29, 2015 14:05
Show Gist options
  • Save mohawk2/9f177e1f690a3bd9a68d to your computer and use it in GitHub Desktop.
Save mohawk2/9f177e1f690a3bd9a68d to your computer and use it in GitHub Desktop.
Perl Text

NAME

perltext - Perl Text thoughts

DESCRIPTION

This document assumes you have read perlunitut, which in turn assumes you understand the distinction between characters and octets, and that a "string" is a sequence of characters (aka codepoints).

CONTEXT

Currently, Perl distinguishes between strings that are internally UTF-8 (and can therefore have codepoints >255), and those that are not (and can therefore only be 0-255). It does not currently distinguish between strings that are text and those that are not.

This is relevant because when it comes to opening files or directories, Perl currently just passes its internal representation to the relevant system call, without further interpretation. On Unix-like systems, this works because the OS is agnostic about the data given it, and the "right thing" happens. However, on systems that are not agnostic in this way, like Windows, the wrong thing happens. In both cases, I am referring to the "right thing" as users seeing correctly-named files (and directories) resulting from Perl program activity.

DEMONSTRATION

This code produces a listing with the correct file on Linux (where locales normally use UTF-8), but die()s on Win32:

my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
open my $fh, '>', $file or die "open: $!";
my @cmd = $^O eq 'Win32' ? qw(dir) : qw(ls -l);
system @cmd, $file;

Current Win32 workaround:

sub writefile {
  my $file = shift;
  use Data::Dump qw(dump);
  use Encode qw/encode/;
  use Win32API::File qw(:ALL);
  my $enc  = encode("UTF-16LE", $file); # Format supported by NTFS
  my $binary  = eval dump($enc);        # Remove UTF ness
    $binary .= chr(0).chr(0);           # 0 terminate string
  my $F  = Win32API::File::CreateFileW
   ($binary, GENERIC_WRITE, 0, [], OPEN_ALWAYS, 0, 0); #  Create file via Win32API
  die $^E if $^E;                   # Write any error message
  local *FILE;
  OsFHandleOpen(FILE, $F, "w") or die "Cannot open file: $^E";
  \*FILE;
}
 
my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
writefile($file);

PROPOSAL

In order that the first snippet would work in the expected way on Windows, it is proposed that Perl explicitly know when a string is intended as text. This might be implemented by an SV flag SvTEXT. This flag would be set on the return value of Encode::decode (and when read from filehandles with an :encoding(…)), and would be off on returns from Encode::encode. There would also be another mechanism for explicitly setting this flag to true.

Concatenating non-text with non-text, and text with text, would obviously not affect the "text-ness" of the result.

The purpose of this would be that when filesystem-orientated functions, among others, received a string known to be "text", they could do the right thing.

It is intended that Encode::Locale be used to find and make easily-available the correct encodings for filesystems.

ISSUES

The behaviour of concatenating text and non-text needs to be defined.

Option 1: parsimonious

Non-text-ness would be like the behaviour of tainting, and would "win": the result would be non-text.

Option 2: moar detail

Here, there would also be a "binary" flag, which would be set on returns from Encode::encode, pack, and on data read from filehandles with an explicit :raw layer. If neither "text" nor "binary" flags were set, the string's "text-ness" would be considered "unknown". In this option, the outcomes of concatenation would be (b = binary, t = text, u = unknown):

X b | t | u
b b | ? | b
t ? | t | t
u b | t | u

While it may seem desirable for the "?" results above to use Encode::decode to "upgrade" the known-binary string to text, this is impossible without knowing the correct encoding. Once a cautious version has been implemented, an updated version which allowed specifying of a default (probably of UTF-8) might be made.

SEE ALSO

Stackoverflow answer

Perl5 Porting/todo.pod

@rjbs
Copy link

rjbs commented Aug 29, 2014

On Unix-like systems, this works because the OS is agnostic about the data given it, and the "right thing" happens.

The right thing happens sometimes. If the filesystem has Latin-1 filenames and the string is stored in UTF-8 in Perl, the wrong thing happens. The reverse is also possible.

@rjbs
Copy link

rjbs commented Aug 29, 2014

Presumably it would desirable for the "?" results above to use Encode::decode to "upgrade" the known-binary string to text, with the outcome then being "text"."text" i.e. "text".

You can't do that unless you know what kind of binary data it is, which you don't from just a flag. I can give you a string containing a bunch of bytes and tell you that it's binary data. Then you do $binary . $text and you can't decode $binary unless you know what its encoding it. (Heck, it might not even be encoded text, but just a part of a program's heap or something.) You can't encode $text for similar reasons. If you automatically encode text to UTF-8 for concatenation to binary data, you'll be in a world of hurt if $binary was in KOI8-R.

@ilmari
Copy link

ilmari commented Sep 3, 2014

there would also be a "binary" flag, which would be set on returns from Encode::decode.

That should be Encode::encode. Similarly, other clearly binary-returning functions such as pack should set the binary flag.

Data read from filehandles with an :encoding(…) will also have the text flag (because it uses Encode::decode). Data read from filehandle with an explicit :raw layer should probably be binary, while filehandles with neither should pobably be unknown (at least for now).

@mohawk2
Copy link
Author

mohawk2 commented Sep 8, 2014

@rjbs, updated for your points on encodings. Also added reference to Encode::Locale which I found extremely helpful when I addressed Unicode issues in EUMM.

@ilmari, added your points in. Entertainingly, I had to add an =encoding utf8 at the top so it correctly handled your .

@jhi
Copy link

jhi commented Sep 8, 2014

An additional consideration (one way of putting it…) is that saying ”UTF-8” doesn’t cover Unicode properly. As in ”doesn’t nail it down enough” — doesn’t tell the Unicode normalization.

And if you bring in the difference between creating filesystem objects and looking them up, you need to also think of folding (in addition to normalization). UNIX is again a black box (thought I’ve got a tingly sense that some Linux filesystems might do more?), but NTFS and HFS+ are case-forgiving. (Store the case on creation, but ignore it on lookup.)

(And +1 on rjbs' warnings on trying to guess the encoding heuristically... just don't do it. One can get a reasonably high confidence for "this piece is UTF-8", part of the design of it, but even then it's not 100%. For many encodings, there just isn't a good way, at least without bringing in priors like "this piece is maybe probably in the following human languages..." Just. Don't.)

@jhi
Copy link

jhi commented Sep 8, 2014

By the "UTF-8 not nailing it down enough" I meant that filesystems [1] might normalize [2] the text. So what you put in is not necessarily what you get out, and if you concatenate strings you might get results of mixed normalization.

[1] https://en.wikipedia.org/wiki/HFS_Plus uses almost Unicode NFD
[2] https://en.wikipedia.org/wiki/Unicode_equivalence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment