Skip to content

Instantly share code, notes, and snippets.

@masak
Created March 21, 2009 13:10
Show Gist options
  • Save masak/82837 to your computer and use it in GitHub Desktop.
Save masak/82837 to your computer and use it in GitHub Desktop.
$ cat things-I-know-about-Buf
A C<Buf> is a stringish view of an array of
integers, and has no Unicode or character properties without explicit
conversion to some kind of C<Str>. (A C<buf> is the native counterpart.)
Typically it's an array of bytes serving as a buffer. Bitwise
operations on a C<Buf> treat the entire buffer as a single large
integer. Bitwise operations on a C<Str> generally fail unless the
C<Str> in question can provide an abstract C<Buf> interface somehow.
Coercion to C<Buf> should generally invalidate the C<Str> interface.
As a generic type C<Buf> may be instantiated as (or bound to) any
of C<buf8>, C<buf16>, or C<buf32> (or to any type that provides the
appropriate C<Buf> interface), but when used to create a buffer C<Buf>
defaults to C<buf8>.
Unlike C<Str> types, C<Buf> types prefer to deal with integer string
positions, and map these directly to the underlying compact array
as indices. That is, these are not necessarily byte positions--an
integer position just counts over the number of underlying positions,
where one position means one cell of the underlying integer type.
Builtin string operations on C<Buf> types return integers and expect
integers when dealing with positions. As a limiting case, C<buf8> is
just an old-school byte string, and the positions are byte positions.
Note, though, that if you remap a section of C<buf32> memory to be
C<buf8>, you'll have to multiply all your positions by 4.
Bitwise string operators (those starting with C<~>) may only be
applied to C<Buf> types or similar compact integer arrays, and treat
the entire chunk of memory as a single huge integer. They differ from
the C<+> operators in that the C<+> operators would try to convert
the string to a number first on the assumption that the string was an
ASCII representation of a number.
Actual type Use entries for
=========== ===============
Buf Str or Array of Int
A C<Buf> type containing any bytes or integers outside the ASCII
range may silently promote to a C<Str> type for pattern matching if
and only if its relationship to Unicode is clearly declared or typed.
This type information might come from an input filehandle, or the
C<Buf> role may be a parametric type that allows you to instantiate
buffers with various known encodings. In the absence of such typing
information, you may still do pattern matching against the buffer, but
(apart from assuming the lowest 7 bits represent ASCII) any attempt
to treat the buffer as other than a sequence integers is erroneous,
and warnings may be generously issued.
$_ X Type of Match Wanted What to use on the right
====== === ==================== ========================
Buf Int buffer contains int .match(X)
C<Buf> types are based on fixed-width cells and can therefore
handle integer positions just fine, and treat them as array indices.
In particular, C<buf8> (also known as C<buf>) is just an old-school byte string.
Matches against C<Buf> types are restricted to ASCII semantics in
the absence of an I<explicit> modifier asking for the array's values
to be treated as some particular encoding such as UTF-32. (This is
also true for those compact arrays that are considered isomorphic to
C<Buf> types.) Positions within C<Buf> types are always integers,
counting one per unit cell of the underlying array. Be aware that
"from" and "to" positions are reported as being between elements.
If matching against a compact array C<@foo>, a final position of 42
indicates that C<@foo[42]> was the first element I<not> included.
=item open
multi open (Str $name,
Bool :$rw = False,
Bool :$bin = False,
Str :$enc = "Unicode",
Any :$nl = "\n",
Bool :$chomp = True,
...
--> IO
) is export
A convenience method/function that hides most of the OO complexity.
It will only open normal files. Text is the default. Note that
the "Unicode" encoding implies figuring out which actual UTF is
in use, either from a BOM or other heuristics. If heuristics are
inconclusive, UTF-8 will be assumed. (No 8-bit encoding will ever
be picked implicitly.) A file opened with C<:bin> may still be
processed line-by-line, but IO will be in terms of C<Buf> rather
than C<Str> types.
=item slurp
method slurp ($handle:
Bool :$bin = False,
Str :$enc = "Unicode",
--> Str|Buf
) is export
multi slurp (Str $filename
Bool :$bin = False,
Str :$enc = "Unicode",
--> Str|Buf
)
Slurps the entire file into a C<Str> (or C<Buf> if C<:bin>) regardless of context.
(See also C<lines>.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment