Created
March 21, 2009 13:10
-
-
Save masak/82837 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ cat things-I-know-about-Buf | |
A C<Buf> is a stringish view of an array of | |
integers, and has no Unicode or character properties without explicit | |
conversion to some kind of C<Str>. (A C<buf> is the native counterpart.) | |
Typically it's an array of bytes serving as a buffer. Bitwise | |
operations on a C<Buf> treat the entire buffer as a single large | |
integer. Bitwise operations on a C<Str> generally fail unless the | |
C<Str> in question can provide an abstract C<Buf> interface somehow. | |
Coercion to C<Buf> should generally invalidate the C<Str> interface. | |
As a generic type C<Buf> may be instantiated as (or bound to) any | |
of C<buf8>, C<buf16>, or C<buf32> (or to any type that provides the | |
appropriate C<Buf> interface), but when used to create a buffer C<Buf> | |
defaults to C<buf8>. | |
Unlike C<Str> types, C<Buf> types prefer to deal with integer string | |
positions, and map these directly to the underlying compact array | |
as indices. That is, these are not necessarily byte positions--an | |
integer position just counts over the number of underlying positions, | |
where one position means one cell of the underlying integer type. | |
Builtin string operations on C<Buf> types return integers and expect | |
integers when dealing with positions. As a limiting case, C<buf8> is | |
just an old-school byte string, and the positions are byte positions. | |
Note, though, that if you remap a section of C<buf32> memory to be | |
C<buf8>, you'll have to multiply all your positions by 4. | |
Bitwise string operators (those starting with C<~>) may only be | |
applied to C<Buf> types or similar compact integer arrays, and treat | |
the entire chunk of memory as a single huge integer. They differ from | |
the C<+> operators in that the C<+> operators would try to convert | |
the string to a number first on the assumption that the string was an | |
ASCII representation of a number. | |
Actual type Use entries for | |
=========== =============== | |
Buf Str or Array of Int | |
A C<Buf> type containing any bytes or integers outside the ASCII | |
range may silently promote to a C<Str> type for pattern matching if | |
and only if its relationship to Unicode is clearly declared or typed. | |
This type information might come from an input filehandle, or the | |
C<Buf> role may be a parametric type that allows you to instantiate | |
buffers with various known encodings. In the absence of such typing | |
information, you may still do pattern matching against the buffer, but | |
(apart from assuming the lowest 7 bits represent ASCII) any attempt | |
to treat the buffer as other than a sequence integers is erroneous, | |
and warnings may be generously issued. | |
$_ X Type of Match Wanted What to use on the right | |
====== === ==================== ======================== | |
Buf Int buffer contains int .match(X) | |
C<Buf> types are based on fixed-width cells and can therefore | |
handle integer positions just fine, and treat them as array indices. | |
In particular, C<buf8> (also known as C<buf>) is just an old-school byte string. | |
Matches against C<Buf> types are restricted to ASCII semantics in | |
the absence of an I<explicit> modifier asking for the array's values | |
to be treated as some particular encoding such as UTF-32. (This is | |
also true for those compact arrays that are considered isomorphic to | |
C<Buf> types.) Positions within C<Buf> types are always integers, | |
counting one per unit cell of the underlying array. Be aware that | |
"from" and "to" positions are reported as being between elements. | |
If matching against a compact array C<@foo>, a final position of 42 | |
indicates that C<@foo[42]> was the first element I<not> included. | |
=item open | |
multi open (Str $name, | |
Bool :$rw = False, | |
Bool :$bin = False, | |
Str :$enc = "Unicode", | |
Any :$nl = "\n", | |
Bool :$chomp = True, | |
... | |
--> IO | |
) is export | |
A convenience method/function that hides most of the OO complexity. | |
It will only open normal files. Text is the default. Note that | |
the "Unicode" encoding implies figuring out which actual UTF is | |
in use, either from a BOM or other heuristics. If heuristics are | |
inconclusive, UTF-8 will be assumed. (No 8-bit encoding will ever | |
be picked implicitly.) A file opened with C<:bin> may still be | |
processed line-by-line, but IO will be in terms of C<Buf> rather | |
than C<Str> types. | |
=item slurp | |
method slurp ($handle: | |
Bool :$bin = False, | |
Str :$enc = "Unicode", | |
--> Str|Buf | |
) is export | |
multi slurp (Str $filename | |
Bool :$bin = False, | |
Str :$enc = "Unicode", | |
--> Str|Buf | |
) | |
Slurps the entire file into a C<Str> (or C<Buf> if C<:bin>) regardless of context. | |
(See also C<lines>.) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment