Skip to content

Instantly share code, notes, and snippets.

@mohawk2
Last active August 29, 2015 14:05
Show Gist options
  • Save mohawk2/9f177e1f690a3bd9a68d to your computer and use it in GitHub Desktop.
Save mohawk2/9f177e1f690a3bd9a68d to your computer and use it in GitHub Desktop.
Perl Text

NAME

perltext - Perl Text thoughts

DESCRIPTION

This document assumes you have read perlunitut, which in turn assumes you understand the distinction between characters and octets, and that a "string" is a sequence of characters (aka codepoints).

CONTEXT

Currently, Perl distinguishes between strings that are internally UTF-8 (and can therefore have codepoints >255), and those that are not (and can therefore only be 0-255). It does not currently distinguish between strings that are text and those that are not.

This is relevant because when it comes to opening files or directories, Perl currently just passes its internal representation to the relevant system call, without further interpretation. On Unix-like systems, this works because the OS is agnostic about the data given it, and the "right thing" happens. However, on systems that are not agnostic in this way, like Windows, the wrong thing happens. In both cases, I am referring to the "right thing" as users seeing correctly-named files (and directories) resulting from Perl program activity.

DEMONSTRATION

This code produces a listing with the correct file on Linux (where locales normally use UTF-8), but die()s on Win32:

my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
open my $fh, '>', $file or die "open: $!";
my @cmd = $^O eq 'Win32' ? qw(dir) : qw(ls -l);
system @cmd, $file;

Current Win32 workaround:

sub writefile {
  my $file = shift;
  use Data::Dump qw(dump);
  use Encode qw/encode/;
  use Win32API::File qw(:ALL);
  my $enc  = encode("UTF-16LE", $file); # Format supported by NTFS
  my $binary  = eval dump($enc);        # Remove UTF ness
    $binary .= chr(0).chr(0);           # 0 terminate string
  my $F  = Win32API::File::CreateFileW
   ($binary, GENERIC_WRITE, 0, [], OPEN_ALWAYS, 0, 0); #  Create file via Win32API
  die $^E if $^E;                   # Write any error message
  local *FILE;
  OsFHandleOpen(FILE, $F, "w") or die "Cannot open file: $^E";
  \*FILE;
}
 
my $file = "\x{00e4}\x{00f6}\x{00fc}\x{263a}.txt";
writefile($file);

PROPOSAL

In order that the first snippet would work in the expected way on Windows, it is proposed that Perl explicitly know when a string is intended as text. This might be implemented by an SV flag SvTEXT. This flag would be set on the return value of Encode::decode (and when read from filehandles with an :encoding(…)), and would be off on returns from Encode::encode. There would also be another mechanism for explicitly setting this flag to true.

Concatenating non-text with non-text, and text with text, would obviously not affect the "text-ness" of the result.

The purpose of this would be that when filesystem-orientated functions, among others, received a string known to be "text", they could do the right thing.

It is intended that Encode::Locale be used to find and make easily-available the correct encodings for filesystems.

ISSUES

The behaviour of concatenating text and non-text needs to be defined.

Option 1: parsimonious

Non-text-ness would be like the behaviour of tainting, and would "win": the result would be non-text.

Option 2: moar detail

Here, there would also be a "binary" flag, which would be set on returns from Encode::encode, pack, and on data read from filehandles with an explicit :raw layer. If neither "text" nor "binary" flags were set, the string's "text-ness" would be considered "unknown". In this option, the outcomes of concatenation would be (b = binary, t = text, u = unknown):

X b | t | u
b b | ? | b
t ? | t | t
u b | t | u

While it may seem desirable for the "?" results above to use Encode::decode to "upgrade" the known-binary string to text, this is impossible without knowing the correct encoding. Once a cautious version has been implemented, an updated version which allowed specifying of a default (probably of UTF-8) might be made.

SEE ALSO

Stackoverflow answer

Perl5 Porting/todo.pod

@jhi
Copy link

jhi commented Sep 8, 2014

By the "UTF-8 not nailing it down enough" I meant that filesystems [1] might normalize [2] the text. So what you put in is not necessarily what you get out, and if you concatenate strings you might get results of mixed normalization.

[1] https://en.wikipedia.org/wiki/HFS_Plus uses almost Unicode NFD
[2] https://en.wikipedia.org/wiki/Unicode_equivalence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment