perltext - Perl Text thoughts
This document assumes you have read perlunitut, which in turn assumes you understand the distinction between characters and octets, and that a "string" is a sequence of characters (aka codepoints).
Currently, Perl distinguishes between strings that are internally UTF-8 (and can therefore have codepoints >255), and those that are not (and can therefore only be 0-255). It does not currently distinguish between strings that are text and those that are not.
This is relevant because when it comes to opening files or directories, Perl currently just passes its internal representation to the relevant system call, without further interpretation. On Unix-like systems, this works because the OS is agnostic about the data given it, and the "right thing" happens. However, on systems that are not agnostic in this way, like Windows, the wrong thing happens. In both cases, I am referring to the "right thing" as users seeing correctly-named files (and directories) resulting from Perl program activity.
This code produces a listing with the correct file on Linux (where locales normally use UTF-8), but die()s on Win32:
my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
open my $fh, '>', $file or die "open: $!";
my @cmd = $^O eq 'Win32' ? qw(dir) : qw(ls -l);
system @cmd, $file;
Current Win32 workaround:
sub writefile {
my $file = shift;
use Data::Dump qw(dump);
use Encode qw/encode/;
use Win32API::File qw(:ALL);
my $enc = encode("UTF-16LE", $file); # Format supported by NTFS
my $binary = eval dump($enc); # Remove UTF ness
$binary .= chr(0).chr(0); # 0 terminate string
my $F = Win32API::File::CreateFileW
($binary, GENERIC_WRITE, 0, [], OPEN_ALWAYS, 0, 0); # Create file via Win32API
die $^E if $^E; # Write any error message
local *FILE;
OsFHandleOpen(FILE, $F, "w") or die "Cannot open file: $^E";
\*FILE;
}
my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
writefile($file);
In order that the first snippet would work in the expected way on Windows, it is proposed that Perl explicitly know when a string is intended as text. This might be implemented by an SV flag SvTEXT
. This flag would be set on the return value of Encode::decode
(and when read from filehandles with an :encoding(…)
), and would be off on returns from Encode::encode
. There would also be another mechanism for explicitly setting this flag to true.
Concatenating non-text with non-text, and text with text, would obviously not affect the "text-ness" of the result.
The purpose of this would be that when filesystem-orientated functions, among others, received a string known to be "text", they could do the right thing.
It is intended that Encode::Locale be used to find and make easily-available the correct encodings for filesystems.
The behaviour of concatenating text and non-text needs to be defined.
Non-text-ness would be like the behaviour of tainting, and would "win": the result would be non-text.
Here, there would also be a "binary" flag, which would be set on returns from Encode::encode
, pack
, and on data read from filehandles with an explicit :raw
layer. If neither "text" nor "binary" flags were set, the string's "text-ness" would be considered "unknown". In this option, the outcomes of concatenation would be (b = binary, t = text, u = unknown):
X b | t | u
b b | ? | b
t ? | t | t
u b | t | u
While it may seem desirable for the "?" results above to use Encode::decode
to "upgrade" the known-binary string to text, this is impossible without knowing the correct encoding. Once a cautious version has been implemented, an updated version which allowed specifying of a default (probably of UTF-8) might be made.
The right thing happens sometimes. If the filesystem has Latin-1 filenames and the string is stored in UTF-8 in Perl, the wrong thing happens. The reverse is also possible.