NAME

perltext - Perl Text thoughts

DESCRIPTION

This document assumes you have read perlunitut, which in turn assumes you understand the distinction between characters and octets, and that a "string" is a sequence of characters (aka codepoints).

CONTEXT

Currently, Perl distinguishes between strings that are internally UTF-8 (and can therefore have codepoints >255), and those that are not (and can therefore only be 0-255). It does not currently distinguish between strings that are text and those that are not.

This is relevant because when it comes to opening files or directories, Perl currently just passes its internal representation to the relevant system call, without further interpretation. On Unix-like systems, this works because the OS is agnostic about the data given it, and the "right thing" happens. However, on systems that are not agnostic in this way, like Windows, the wrong thing happens. In both cases, I am referring to the "right thing" as users seeing correctly-named files (and directories) resulting from Perl program activity.

DEMONSTRATION

This code produces a listing with the correct file on Linux (where locales normally use UTF-8), but die()s on Win32:

my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
open my $fh, '>', $file or die "open: $!";
my @cmd = $^O eq 'Win32' ? qw(dir) : qw(ls -l);
system @cmd, $file;

Current Win32 workaround:

sub writefile {
  my $file = shift;
  use Data::Dump qw(dump);
  use Encode qw/encode/;
  use Win32API::File qw(:ALL);
  my $enc  = encode("UTF-16LE", $file); # Format supported by NTFS
  my $binary  = eval dump($enc);        # Remove UTF ness
    $binary .= chr(0).chr(0);           # 0 terminate string
  my $F  = Win32API::File::CreateFileW
   ($binary, GENERIC_WRITE, 0, [], OPEN_ALWAYS, 0, 0); #  Create file via Win32API
  die $^E if $^E;                   # Write any error message
  local *FILE;
  OsFHandleOpen(FILE, $F, "w") or die "Cannot open file: $^E";
  \*FILE;
}
 
my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
writefile($file);

PROPOSAL

In order that the first snippet would work in the expected way on Windows, it is proposed that Perl explicitly know when a string is intended as text. This might be implemented by an SV flag SvTEXT. This flag would be set on the return value of Encode::decode (and when read from filehandles with an :encoding(…)), and would be off on returns from Encode::encode. There would also be another mechanism for explicitly setting this flag to true.

Concatenating non-text with non-text, and text with text, would obviously not affect the "text-ness" of the result.

The purpose of this would be that when filesystem-orientated functions, among others, received a string known to be "text", they could do the right thing.

It is intended that Encode::Locale be used to find and make easily-available the correct encodings for filesystems.

ISSUES

The behaviour of concatenating text and non-text needs to be defined.

Option 1: parsimonious

Non-text-ness would be like the behaviour of tainting, and would "win": the result would be non-text.

Option 2: moar detail

Here, there would also be a "binary" flag, which would be set on returns from Encode::encode, pack, and on data read from filehandles with an explicit :raw layer. If neither "text" nor "binary" flags were set, the string's "text-ness" would be considered "unknown". In this option, the outcomes of concatenation would be (b = binary, t = text, u = unknown):

X b | t | u
b b | ? | b
t ? | t | t
u b | t | u

While it may seem desirable for the "?" results above to use Encode::decode to "upgrade" the known-binary string to text, this is impossible without knowing the correct encoding. Once a cautious version has been implemented, an updated version which allowed specifying of a default (probably of UTF-8) might be made.

mohawk2/perltext.pod