My documentation of P6's .trans
routine for anyone who reads it.
It may be a step toward updating the official doc and/or cleaning up the relevant spec tests and/or functionality.
I have tried to be clear enough that a future me will be able to make sense of it, and hopefully anyone else who reads this.
It's supposedly exhaustive (definitely exhausting!).
Hmm Bugs and known failures appear in Hmm notes at the ends of some sections.
On Sept 4th, 2018, after considering .trans
as an answer to an SO question "Simultaneous substitutions with s///?", I tried to ensure I understood .trans
.
I began by reading the P6 doc page. I found it very hard to understand. Next, I tried experimenting with the function as documented in My early .trans
experiment near the end of this gist. I looked at source tests (see Compiler source code / tests at the end of this gist). I went down a series of rabbit holes. When I came up for air I thought I'd document what I'd found.
All trans
arguments must be pairs:
-
Positional argument pairs are matchers/replacers expressions, explained in the next section. You may pass as many matchers/replacers expression pairs as you like. Passing none means
.trans
doesn't do anything. -
The "adverbs"
:complement
,:delete
, and:squash
may be passed as named arguments. See the section Positional pairs vs named arguments below to make sure you don't accidentally pass them as positional pairs. (I can't see how this is really an issue right now. I obviously thought so when I first wrote this. I recall falling into this trap. How did I? Why don't I now? Is this really a valid concern?)
A matchers expression (the key or LHS of a positional pair passed to .trans
) generates one or more matchers. A replacers expression (the value or RHS of a pair), coupled with automatic replacer extension (see Replacers extension below), generates one or more replacers that correspond to the matchers generated by the LHS of the pair.
This gist assumes that, at least semantically, all the expressions in all the pairs are converted into a single overall list of matchers and a single corresponding list of replacers, one for each matcher, because that's what .trans
seems to do.
A matchers expression or replacers expression (LHS or RHS of a positional pair argument passed to .trans
) that is just a single string of one or more characters generates one or more matchers or replacers, one per character the expression specifies.
Most characters are literal but there are notable powers and quirks.
For example, consider the pair 'abc' => 'de'
which has a single string on both sides. The 'abc'
turns into three matchers, one for 'a'
, one for 'b'
, and one for'c'
. The 'de'
turns into three corresponding replacers. There's a 'd'
replacer corresponding to the 'a'
matcher, an 'e'
replacer for replacing a 'b'
, and a 'd'
replacer for replacing a 'c'
. (See Replacers extension for where the final 'd'
replacer comes from. In other scenarios the third replacer would be 'e'
or a null string instead.)
A single string matchers/replacers expression supports character ranges that expand to their range. For example, 'a..dm..q'
expands to 'abcdmnopq'
which in turn becomes 9 matchers or 9 replacers depending on which side of the =>
it's specified.
Hmm There's crazier stuff too that's been aded to the test files in the roast directory for transliteration. Other than this link I'm not going to mention or document them in this gist for now.
You can specify a regex as a matchers expression.
You can specify a closure as a replacers expression.
A single closure can be paired with a single regex to make sane use of $/
. But if you also have none regexes or multiple regexes things won't turn out well. Here be bugs or at least insane semantics.
You can pass an array or list of matchers/replacers as a matchers/replacers expression.
This is how you can specify a string as a single unit, as against a string that specifies each of its characters as its own unit as explained in the previous section: use a matchers or replacers expression that's a list or array or turns into one. For example:
-
<foo bar>
is a list specifying two strings,'foo'
and'bar'
. These do not turn into six single character matchers or replacers! Instead they are just two matchers that match three character sub-strings (or two replacers each of which replaces a match with a three character string). -
['baz']
would match, or replace a match with, the single (sub-)string'baz'
. -
The string range
'aa'..'bb'
turns into('aa','ab','ba',bb')
, i.e. four two-character string matchers/replacers.
A matchers/replacers expression list/array can include any mix of elements each of which is a:
-
string (which then always mean itself as a unit, not its constituent characters);
-
a string range;
-
a nested list/array (which flattens, recursively, into the outer list/array);
-
a regex (in a matchers expression) or closur (in a replacers expression).
If a list is passed as a matchers or replacers expression, ranges are expanded (and the results coerced to strings if they're numeric), regexes (on LHS of a pair) and closures (on RHS of a pair) are left as is, and if there's anything else it is either coerced into a string or an error is generated.
In Perl 6 both <foo>
and ('foo')
are the single string 'foo'
, not a one element list containing a string. So if it's the only element you pass as a matchers or replacers expression it gets turned into individual characters rather than meaning itself as a string.
The idiomatic way to specify a single string as an indivisible unit is ['foo']
. (This is an Array
value. Array
s are a type of List
. Array
literals ([...]
where a value is expected) are always arrays unlike (...)
which is only a list if it contains one or more commas or semi-colons just as <...>
is only a list if it contains multiple "words" separated by one or more spaces.)
You can use a single string on the LHS of a transliteration pair to specify a list of individual character matchers to be replaced by their corresponing entry in a list of strings/closures on the RHS.
Or you can use a list to specify a list of sub-string matches/regexes on the LHS to be replaced by the corresponding character according to a single string replacer specification on the RHS.
Hmm If one of the matchers on the left hand side is a null string or regex, and no other matchers match at a given position in the input string then .trans
goes into an infinite loop.)
If the list of matchers resulting from the matcher specification of a transliteration pair passed to .trans
is longer than the list of replacers initially generated by the replacer specification of the pair, then the list of replacers is extended to make up the difference.
If :delete
has been specified then a null string is repeated to make the two lists the same length.
Otherwise, if a pair consists of just a single string specifier on both the left and right, then the list of replacers (each one an individual character) is extended by repeating the initial list.
Otherwise, the list is extended by repeating the last replacer in the list.
.trans
starts at the start of the input string. It attempts to match at that position. If multiple matchers match it picks a single winner. Depending on what matches and the setting of the :complement
, :delete
, and :squash
adverbs it either keeps, replaces, or deletes the matched character or sub-string or, if nothing matched, then keeps/replaces/deletes the character in the input string at the current matching position. Then it moves the matching position forward to skip over the kept/replaced/deleted character(s).
At each iteration of matching, there are various decisions about what to do including deciding which matcher/replacer pair wins, if any.
If no matchers match at a character position, then the character at the current matching position is kept if both :complement
and :delete
are False
.
If :complement
is True
then, provided that there is at least one match somewhere in the string during the overall transliteration processing of the entire string being transformed, the character is replaced by the first replacer in the first pair passed to .trans
.
If :complement
is False
but :delete
is True
then the character is deleted. (Note that specifying :complement
renders :delete
irrelevant.)
If several matchers tie for equal longest match, then one of them is chosen:
-
If any matcher is a regex or any replacer is a closure (regardless of whether it matches/replaces any of the input string in this or any other iteration), then the leftmost matcher wins.
-
Otherwise, the rightmost matcher wins.
Hmm This section just isn't right, or isn't the whole story. Investigation continues.
If one matcher matches more of the input string than any other, then it "wins" outright.
If some matcher wins then its corresponding replacer is used to replace the matching character or sub-string except that if :squash
has been specified, and the winning matcher won the previous iteration too, then the matched sub-string is removed and the corresponding replacer is ignored (except that if it's a closure the closure is still called even though its result is ignored).
The following is intended to remind me and readers of the distinction between these two forms of pairs:
.say for
# Some idiomatic ways to positionally pass pairs with a *single value* for the LHS of a pair:
# with string 'foo' for key:
'foo' => 'baz',
<foo> => 'baz',
# with a regex for key:
/foo/ => 'baz',
# Some idiomatic ways to positionally pass pairs with a *value list* for the LHS of a pair:
# Array with one element, the string 'foo':
['foo'] => 'baz',
# List with multiple elements, the strings 'foo' and 'bar':
<foo bar> => 'baz',
# List with multiple elements, the string 'foo' and a regex:
('foo', /bar/) => 'baz',
"\n",
# Some erroneous vs correct ways to pass pairs as named arguments (adverbs):
# Adverb `foo` is not known to `.trans`. This would get silently ignored.
foo => 'baz',
# `.trans` supports these adverbs:
delete => True,
:complement
:squash
displays:
foo => baz
foo => baz
/foo/ => baz
[foo] => baz
(foo bar) => baz
(foo /bar/) => baz
foo => baz
delete => True
complement => True
Note that the two initial foo => baz
arguments work the same way in the .say for ...
context as the one near the end because for
is a keyword and it treats all pairs the same way. Only routines make the positional/named argument distinction that pairs have to straddle.
my $a = 'a b c d e f g h i j k l m n o p q r s t u v w x y z';
my $b = 'abcdefghijklmnopqrstuvwxyz';
say .trans: abc => '1Aa',
<de> => '2Dd',
<fg h> => '3Ff',
<ij k> => <4Ii>,
<l n> => <5Ll 6Nn>,
/o/ => '7Oo',
/p/ => <8Pp 9Qq>,
(/r/, /s/) => <0Rr 1Ss>,
'tuv' => '2Tt',
'wxy' => <3Ww 4Xx>
for $a, $b;
# a b c 2 D f g F i j I 5Ll m 6Nn 7Oo 8Pp 9Qq q 0Rr 1Ss 2 T t 3Ww 4Xx 4Xx z
# abc2D3F4I5Llm6Nn7Oo8Pp 9Qqq0Rr1Ss2Tt3Ww4Xx4Xxz
On my journey I checked out the method trans
implementation in the relevant Rakudo source code. It was too complex for me to figure out what was intended.
(As a result of this SO I also checked out the tr
"nibbler" in its relevant Rakudo source code. It was also too complex for me to figure out what was intended.)
I also checked the test files in the roast directory for transliteration. (This showed me how much appeared to technically be part of 6.c
but undocumented, especially the powers and quirks when specifying a single string as a matchers or replacers argument.)
Hi @raiph, this is in regards to NB#2, not being able to delete unmatched letters (i.e. anything out of the alphabetic ascii range). I can take @b2gills solution and adapt it to use
trans
, but for now I've given up on trying to get:delete
to work:https://stackoverflow.com/a/63803344/7270649