Skip to content

Instantly share code, notes, and snippets.

@raiph
Last active January 22, 2021 22:33
Show Gist options
  • Save raiph/a9d58825662b5cf2da2cc550cb3c6989 to your computer and use it in GitHub Desktop.
Save raiph/a9d58825662b5cf2da2cc550cb3c6989 to your computer and use it in GitHub Desktop.
trans DWEM

What this is

My documentation of P6's .trans routine for anyone who reads it.

It may be a step toward updating the official doc and/or cleaning up the relevant spec tests and/or functionality.

I have tried to be clear enough that a future me will be able to make sense of it, and hopefully anyone else who reads this.

It's supposedly exhaustive (definitely exhausting!).

Hmm Bugs and known failures appear in Hmm notes at the ends of some sections.

Why I wrote it

On Sept 4th, 2018, after considering .trans as an answer to an SO question "Simultaneous substitutions with s///?", I tried to ensure I understood .trans.

I began by reading the P6 doc page. I found it very hard to understand. Next, I tried experimenting with the function as documented in My early .trans experiment near the end of this gist. I looked at source tests (see Compiler source code / tests at the end of this gist). I went down a series of rabbit holes. When I came up for air I thought I'd document what I'd found.

Arguments

All trans arguments must be pairs:

  • Positional argument pairs are matchers/replacers expressions, explained in the next section. You may pass as many matchers/replacers expression pairs as you like. Passing none means .trans doesn't do anything.

  • The "adverbs" :complement, :delete, and :squash may be passed as named arguments. See the section Positional pairs vs named arguments below to make sure you don't accidentally pass them as positional pairs. (I can't see how this is really an issue right now. I obviously thought so when I first wrote this. I recall falling into this trap. How did I? Why don't I now? Is this really a valid concern?)

Positional pairs: matchers expression => replacers expression

A matchers expression (the key or LHS of a positional pair passed to .trans) generates one or more matchers. A replacers expression (the value or RHS of a pair), coupled with automatic replacer extension (see Replacers extension below), generates one or more replacers that correspond to the matchers generated by the LHS of the pair.

This gist assumes that, at least semantically, all the expressions in all the pairs are converted into a single overall list of matchers and a single corresponding list of replacers, one for each matcher, because that's what .trans seems to do.

Using a single string as a matchers expression or replacers expression

A matchers expression or replacers expression (LHS or RHS of a positional pair argument passed to .trans) that is just a single string of one or more characters generates one or more matchers or replacers, one per character the expression specifies.

Most characters are literal but there are notable powers and quirks.

For example, consider the pair 'abc' => 'de' which has a single string on both sides. The 'abc' turns into three matchers, one for 'a', one for 'b', and one for'c'. The 'de' turns into three corresponding replacers. There's a 'd' replacer corresponding to the 'a' matcher, an 'e' replacer for replacing a 'b', and a 'd' replacer for replacing a 'c'. (See Replacers extension for where the final 'd' replacer comes from. In other scenarios the third replacer would be 'e' or a null string instead.)

A single string matchers/replacers expression supports character ranges that expand to their range. For example, 'a..dm..q' expands to 'abcdmnopq' which in turn becomes 9 matchers or 9 replacers depending on which side of the => it's specified.

Hmm There's crazier stuff too that's been aded to the test files in the roast directory for transliteration. Other than this link I'm not going to mention or document them in this gist for now.

Using regexes in a matchers expression and/or closures in a replacers expression

You can specify a regex as a matchers expression.

You can specify a closure as a replacers expression.

A single closure can be paired with a single regex to make sane use of $/. But if you also have none regexes or multiple regexes things won't turn out well. Here be bugs or at least insane semantics.

Using an array (or list) as a matchers or replacers expression

You can pass an array or list of matchers/replacers as a matchers/replacers expression.

This is how you can specify a string as a single unit, as against a string that specifies each of its characters as its own unit as explained in the previous section: use a matchers or replacers expression that's a list or array or turns into one. For example:

  • <foo bar> is a list specifying two strings, 'foo' and 'bar'. These do not turn into six single character matchers or replacers! Instead they are just two matchers that match three character sub-strings (or two replacers each of which replaces a match with a three character string).

  • ['baz'] would match, or replace a match with, the single (sub-)string 'baz'.

  • The string range 'aa'..'bb' turns into ('aa','ab','ba',bb'), i.e. four two-character string matchers/replacers.

A matchers/replacers expression list/array can include any mix of elements each of which is a:

  • string (which then always mean itself as a unit, not its constituent characters);

  • a string range;

  • a nested list/array (which flattens, recursively, into the outer list/array);

  • a regex (in a matchers expression) or closur (in a replacers expression).

If a list is passed as a matchers or replacers expression, ranges are expanded (and the results coerced to strings if they're numeric), regexes (on LHS of a pair) and closures (on RHS of a pair) are left as is, and if there's anything else it is either coerced into a string or an error is generated.

<foo> and ('foo') are single strings, not lists

In Perl 6 both <foo> and ('foo') are the single string 'foo', not a one element list containing a string. So if it's the only element you pass as a matchers or replacers expression it gets turned into individual characters rather than meaning itself as a string.

The idiomatic way to specify a single string as an indivisible unit is ['foo']. (This is an Array value. Arrays are a type of List. Array literals ([...] where a value is expected) are always arrays unlike (...) which is only a list if it contains one or more commas or semi-colons just as <...> is only a list if it contains multiple "words" separated by one or more spaces.)

Mixing string and array/list specifiers

You can use a single string on the LHS of a transliteration pair to specify a list of individual character matchers to be replaced by their corresponing entry in a list of strings/closures on the RHS.

Or you can use a list to specify a list of sub-string matches/regexes on the LHS to be replaced by the corresponding character according to a single string replacer specification on the RHS.

Hmm If one of the matchers on the left hand side is a null string or regex, and no other matchers match at a given position in the input string then .trans goes into an infinite loop.)

Replacers extension

If the list of matchers resulting from the matcher specification of a transliteration pair passed to .trans is longer than the list of replacers initially generated by the replacer specification of the pair, then the list of replacers is extended to make up the difference.

If :delete has been specified then a null string is repeated to make the two lists the same length.

Otherwise, if a pair consists of just a single string specifier on both the left and right, then the list of replacers (each one an individual character) is extended by repeating the initial list.

Otherwise, the list is extended by repeating the last replacer in the list.

The transliteration process

.trans starts at the start of the input string. It attempts to match at that position. If multiple matchers match it picks a single winner. Depending on what matches and the setting of the :complement, :delete, and :squash adverbs it either keeps, replaces, or deletes the matched character or sub-string or, if nothing matched, then keeps/replaces/deletes the character in the input string at the current matching position. Then it moves the matching position forward to skip over the kept/replaced/deleted character(s).

At each iteration of matching, there are various decisions about what to do including deciding which matcher/replacer pair wins, if any.

If no matchers match at a position, action depends on use (or not) of :complement and :delete

If no matchers match at a character position, then the character at the current matching position is kept if both :complement and :delete are False.

If :complement is True then, provided that there is at least one match somewhere in the string during the overall transliteration processing of the entire string being transformed, the character is replaced by the first replacer in the first pair passed to .trans.

If :complement is False but :delete is True then the character is deleted. (Note that specifying :complement renders :delete irrelevant.)

If several matchers tie, choosing which wins depends on use (or not) of regexes/closures

If several matchers tie for equal longest match, then one of them is chosen:

  • If any matcher is a regex or any replacer is a closure (regardless of whether it matches/replaces any of the input string in this or any other iteration), then the leftmost matcher wins.

  • Otherwise, the rightmost matcher wins.

Hmm This section just isn't right, or isn't the whole story. Investigation continues.

If some matcher wins, action depends on use (or not) of :squash

If one matcher matches more of the input string than any other, then it "wins" outright.

If some matcher wins then its corresponding replacer is used to replace the matching character or sub-string except that if :squash has been specified, and the winning matcher won the previous iteration too, then the matched sub-string is removed and the corresponding replacer is ignored (except that if it's a closure the closure is still called even though its result is ignored).

Positional pairs vs named arguments

The following is intended to remind me and readers of the distinction between these two forms of pairs:

.say for

# Some idiomatic ways to positionally pass pairs with a *single value* for the LHS of a pair:

  # with string 'foo' for key:
  'foo'          => 'baz',
  <foo>          => 'baz',

  # with a regex for key:
  /foo/          => 'baz',

# Some idiomatic ways to positionally pass pairs with a *value list* for the LHS of a pair:
 
  # Array with one element, the string 'foo':
  ['foo']        => 'baz',

  # List with multiple elements, the strings 'foo' and 'bar':
  <foo bar>      => 'baz',

  # List with multiple elements, the string 'foo' and a regex:
  ('foo', /bar/) => 'baz',

  "\n",     

# Some erroneous vs correct ways to pass pairs as named arguments (adverbs):

  # Adverb `foo` is not known to `.trans`. This would get silently ignored.
  foo            => 'baz',
 
  # `.trans` supports these adverbs:
  delete         => True,
  :complement
  :squash

displays:

foo => baz
foo => baz
/foo/ => baz
[foo] => baz
(foo bar) => baz
(foo /bar/) => baz

foo => baz 
delete => True
complement => True

Note that the two initial foo => baz arguments work the same way in the .say for ... context as the one near the end because for is a keyword and it treats all pairs the same way. Only routines make the positional/named argument distinction that pairs have to straddle.

My early .trans experiment

my $a = 'a b c d e f g h i j k l m n o p q r s t u v w x y z';
my $b = 'abcdefghijklmnopqrstuvwxyz';

say .trans: abc         => '1Aa',
            <de>        => '2Dd',
            <fg h>      => '3Ff',
            <ij k>      => <4Ii>,
            <l n>       => <5Ll 6Nn>,
            /o/         => '7Oo',
            /p/         => <8Pp 9Qq>,
            (/r/, /s/)  => <0Rr 1Ss>,
            'tuv'       => '2Tt',
            'wxy'       => <3Ww 4Xx>

for $a, $b;

# a b c 2 D f g F i j I 5Ll m 6Nn 7Oo 8Pp 9Qq q 0Rr 1Ss 2 T t 3Ww 4Xx 4Xx z
# abc2D3F4I5Llm6Nn7Oo8Pp 9Qqq0Rr1Ss2Tt3Ww4Xx4Xxz

Compiler source code / tests

On my journey I checked out the method trans implementation in the relevant Rakudo source code. It was too complex for me to figure out what was intended.

(As a result of this SO I also checked out the tr "nibbler" in its relevant Rakudo source code. It was also too complex for me to figure out what was intended.)

I also checked the test files in the roast directory for transliteration. (This showed me how much appeared to technically be part of 6.c but undocumented, especially the powers and quirks when specifying a single string as a matchers or replacers argument.)

@raiph
Copy link
Author

raiph commented Dec 2, 2020

put "wallé" .comb .trans(:delete, ('a'..'z', 'é') .flat => ('a'..'z') .join) .ords
  • .comb always produces a list. In this case w, a, l, l, é.

  • .trans always coerces its input to a string. The default stringification of a list is its elements separated by a space. So it coerces w, a, l, l, é, to w a l l é. The sole pair argument in the trans call does not have a space. So the four spaces remain in the result. Which is w a l l .

  • .ords always produces a Seq (list). In this case, a list of 8 ordinals.

  • put coerces its list of argments to strings. Again, the default stringification of a list is its elements separated by a space. So you get the result you see.

Dropping the .comb fixes all of that:

put "wallé" .trans(:delete, ('a'..'z', 'é') .flat => ('a'..'z') .join) .ords

put "wallé" .comb .trans(:delete, ('a'..'z', 'é') .flat => ('a'..'z')).join .ords;
  • The .join in ('a'..'z').join on the RHS of a trans pair argument makes no difference if it isn't immediately followed by some other transformation acting on its result. That's because trans treats lists of single characters exactly the same as a string of those characters. So dropping it has no effect.

  • Putting .join outside the trans also has no effect, for a different reason. The result of .trans is a single string. .join is for joining a list of values; it has no effect if its invocant is a single value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment