-
fixed
contents
— mapped to a literal byte array (note: shouldn't it writemagic1()
instead? but then it would have to be checked again so that it doesn't violate consistency):public void _read() { this.magic1 = this._io.readBytes(6); if (!(Arrays.equals(magic1(), new byte[] { 80, 65, 67, 75, 45, 49 }))) { throw new KaitaiStream.ValidationNotEqualError(new byte[] { 80, 65, 67, 75, 45, 49 }, magic1(), _io(), "/seq/0"); } // ... } public void _write() { this._io.writeBytes(new byte[] { 80, 65, 67, 75, 45, 49 }); // ... }
The problem with special
contents
handling is that it does not apply tovalid/eq
, even though the idea ofvalid
when introduced in 0.9 was thatcontents
would behave the same as the correspondingvalid/eq
.If we would want to be really consistent in
contents
andvalid
handling, and also allow othervalid
types thaneq
(i.e.valid/{any-of,min,max,expr}
), the general way would be to rely on the user to always set the correct value via the attribute setter and we'll be able to generate the validation expression in_check
that will throw an exception if the value is not valid.In the future, the compiler would be able to set some fields with
valid/eq
automatically, but probably not in the general case, because note thatvalid
supports arbitrary expressions, and until you analyze all features of the expression language (and also realize what all they actually can do), you have no idea if there are some edge cases that may go wrong. Anything that involves the expression language looks simple, if you picture it only on simple examples - it's very tempting to draw general conclusions from these and completely forget about what "interesting" cases can the expression language create, where the idea of simplicity completely falls apart.But we can be confident that we can always handle at least literal values (which would probably cover most common cases in real formats anyway) — detecting if the entire expression AST consists of a single node with the literal value is trivial too. However, I wouldn't address this in the initial version of serialization either, because it's not essential.
-
integer types —
type: [us]1
(e.g.writeU1
),type: [us][248]{,be,le}
(e.g.writeS2le
), nothing fancy here -
bit-sized integers —
type: bX
-
over 2 years ago, I wrote an initial version of the
writeBitsInt
method (KaitaiStream.java:431
), but:- it supports only the big-endian direction (we have both
readBitsInt{Be,Le}
now, so there should be a corresponding methodswriteBitsInt{Be,Le}
as well) - it relies on seeks, so it would work only in seekable streams (which is not probably desirable;
readBitsInt{Be,Le}
don't require a seekable stream, sowriteBitsInt{Be,Le}
should work in non-seekable ones too) - depending on how many bit-sized integers encroach on a particular byte, a single byte may need to be overwritten up to 8 times
- I think it can be written much better overall, the existing version is probably slow and inefficient too
- it supports only the big-endian direction (we have both
-
I think a better approach to writing bit-sized ints would be not to write partial bytes (which of course lead to having to rewrite the bytes multiple times, which also required seeking); instead, write a byte only once when it's finished and you know it won't change (using the
bits
/bitsLeft
properties to save state between calls makes it more like the currentreadBitsInt{Be,Le}
implementation)-
this makes it a bit more difficult to use the
writeBitsInt{Be,Le}
methods from the generated code, because if the write of the last bit field member ends on an unaligned bit position (i.e. somewhere in the middle of a byte, not on a byte boundary), you suddenly have to explicitly ensure that this partial byte gets written too (otherwise it wouldn't be) -
so the compiler will additionally have to insert
alignToByte
calls usually at the end of the_write
method — but again, we don't want to add them somewhere they don't belong, so the compiler will have to spend some time figuring out what might come next in the format (is it really something byte-aligned?) and where to insert thealignToByte
call (inside theif
body just before reading/writing the byte-aligned type? in thefor
loop? in theswitch
case block?), and if we need this analysis for functional writing anyway, we could finally fix these numerous issues in regard to wrongalignToByte
insertions when parsing too (https://gitter.im/kaitai_struct/Lobby?at=628eb63309eea00adeb191b0)- because isn't it a pity that we have such perfect
readBitsInt{Be,Le}
implementations (that's right, I'm confident enough to call them perfect — kaitai-io/kaitai_struct#949), and then the stupid compiler ruins everything just with stuffingalignToByte
somewhere it doesn't belong? Let's make the compiler smarter aboutalignToByte
insertions and Kaitai Struct will suddenly be actually usable for unaligned bit streams! (kaitai-io/kaitai_struct#576)
- because isn't it a pity that we have such perfect
-
of course, this
alignToByte
for writing will need to do something quite different than whatalignToByte
currently does for reading — although clearingbits
andbitsLeft
probably makes sense for writing too, the main task is to actually write the partial byte, but we don't want to write anything in reading mode (so the runtime library will have to know whether we're in writing mode or not, or we'll just keepalignToByte
for reading only and add separatewriteAlignToByte
for writing)
-
-
the reason why I wrote the
writeBitsInt
the way I did was an "idea" to make it ready even for use from positionalinstances
-
because back then, for some reason it seemed necessary to me that in order to be able to use bit-sized integers from
instances
(which can seek anywhere to already written bytes and you don't want to accidentally clear any of the written bits if the currently written bit-sized integer starts and/or ends at an unaligned bit position "somewhere inside a byte"), I need to seek back by 1 byte to re-read the current byte, then seek to where the currently written bits int ends and read that byte too and then the actual writing to a temporary byte array can start (I know, it's indeed a really stupid implementation) -
however,
instances
currently don't work for parsingtype: bX
in the first place (unlessX
is a multiple of 8, but that's usually not the case where you want to use a bit-sized integer type; kaitai-io/kaitai_struct#564), so this is pretty useless and I wouldn't worry about that for now -
secondly, I've now realized that even with the new (less moronic and the one supporting non-seekable streams, although we're currently discussing
instances
which rely on seeking by definition) approach, the support for bit-sized integers ininstances
can be added too (so the only reason to use the old approach is void, things doesn't look good for it) — you just need to do is to callwriteAlignToByte
after the write to ensure the partial byte at the end gets written too.Once we want to add bit addressing (I had a feeling that there was an issue about it, but I can't find it right now) on the top of byte addressing that we support now with the
pos
key to allowinstances
to seek in the middle of a byte, it'll be quite simple to achieve that by adding a special "bit-level seek" method to the runtime library which will change the state ofbits
/bitsLeft
properties as if a previouswriteBitsInt
call has just occurred in_write
of theseq
structure (it will have to "peek" a byte to know what the bits are, which means reading a byte and then seeking back, but this is a "bit seeking" method anyway, so it's quite understandable that it needs seeking support). -
I also remember that it was confusing for me that
bits
and maybebitsLeft
(both currently used for reading) would have to be repurposed by changing their value logic a bit for writing-
so if you would mix writes and reads within one byte, e.g.
writeBitsInt(3)
+readBitsInt(5)
, thereadBitsInt
will misinterpret whatwriteBitsInt
has just set to thebits
andbitsLeft
properties, thus it would just do some undefined behavior -
in fact, the "true" stream byte position (
_io.pos
) will be kind of repurposed for writing bit integers too — when reading, it moves onto the next byte immediately after you read even just 1 bit of the current one (and the yet unprocessed leftover 7 bits are kept inbits
,bitsLeft
is set to7
), but if we want to avoid seeks for writing, we would keep_io.pos
in the current byte and only move it to the next byte when we know what all 8 bits of the current byte will look like (so that we can write its final value into the stream) -
but I conclude that this "value incompatibility" is not a real problem, because we already have it between
readBitsInt{Be,Le}
methods, I've written in the docs that BE and LE bit integers can't share the same byte, and once we implement the check detecting this situation (kaitai-io/kaitai_struct#155 (comment)), this will be totally fine.Besides that, unlike the
bXbe
-bXle
situation which can occur in a real.ksy
specification, consecutive mixed{write,read}BitsInt*
calls should not happen (once everything is correctly implemented in the compiler). So this would be basically a concern not for compiler-generated code, but only for people using the runtime library standalone, and we should really not worry about that. The runtime library solely needs to be usable for the purposes of generated code, anything else is a side effect (but if someone finds the runtime library useful as a standalone library despite our lack of interest in that use case, well, good for them, of course).
-
-
-
-
enum types
-
a basic singular field (Enum0)
- id: pet_1 type: u4 enum: animal
public void _read() { this.pet1 = Animal.byId(this._io.readU4le()); // ... } public void _write() { this._io.writeU4le((long) (pet1().id())); // ... }
-
repeated fields don't work (kaitai-io/kaitai_struct_compiler#181 attempts to fix it, but I'll have to replace the wrong Ast.CodeLiteral with new Ast.InternalName in the 0.10 codebase), see test DebugEnumName:
- id: array_of_ints type: u1 enum: test_enum2 repeat: expr repeat-expr: 1
-
-
floating-point types —
type: f[48]{be,le}
(FloatingPoints) -
byte arrays — per delimitation type:
-
fixed
size
— usewriteBytes
in_write
, check that the length of the user-provided byte array equals the expected one in_check
(SwitchBytearray)public void _read() { this.code = this._io.readBytes(1); // ... } public void _write() { this._io.writeBytes(this.code); // ... } public void _check() { if (code().length != 1) throw new ConsistencyError("code", code().length, 1); }
-
terminator
withoutsize
(TermBytes)-
default options (
consume: true
,include: false
)- id: s1 terminator: 0x7c
public void _read() { this.s1 = this._io.readBytesTerm((byte) 124, false, true, true); // ... } public void _write() { this._io.writeBytes(this.s1); this._io.writeU1(124); // ... }
value consistency (
terminator
+include: false
):field().indexOf(term) == -1
-
consume: false
(+include: false
by default) — note: any sort of consistency check is missing here (there is no guarantee that theterminator
byte occurs)- id: s2 terminator: 0x7c consume: false
public void _read() { // ... this.s2 = this._io.readBytesTerm((byte) 124, false, false, true); // ... } public void _write() { // ... this._io.writeBytes(this.s2); // ... }
The first thing to clear out is that when parsing,
consume: false
already serves as an implicit opt-out from support for non-seekable streams (see thereadBytesTerm
implementations — usually you find something likeif (!consumeTerm) { seek(pos() - 1) }
inside anyway), so it's fine to afford seeking for anything thatconsume: false
fields need to do.The fact that we don't know what field the
terminator
byte is in (it may not even be be part of any field, if the attribute withconsume: false
is the last of allseq
structures in the current substream — then the required terminator byte could never be written by any further_write
) makes it impossible to do it as a simple pure "value" check — it needs to access the stream and read the particular byte whereterminator
should be, so we can't do it in_check
and it would have to be in_write
.The problem is that you won't always be able to do the check in the same type where the
consume: false
field is, you need to delay until the last possibleseq
(meaning that either the current type'sseq
if it has its own substream, or the parent type'sseq
, or the grandparent type'sseq
and so on, until you reach a type that is no longer in this substream) has been fully written into the substream where it is (to be absolutely sure that the byte where theterminator
is supposed to be has been written) and then seek to that byte where it should be, read it and if it's not equal toterminator
as it should, throw aConsistencyError
.Another issue would be if the
consume: false
field effectively is the very last field in the substream that gets written (then it's of course absurd to "expect" that some further field after this last one in this substream will come to the rescue and magically write theterminator
byte that needs to be there, so this of course does not happen and users get theConsistencyError
that they can do nothing about). But since seeking is allowed, this has a very simple and (at least in hindsight) obvious workaround — after writing theconsume: false
field, you also write theterminator
byte and then seek 1 byte back to have the stream position at thisterminator
byte.This is the ultimate one-size-fits-all solution — if other fields follow the current one, the "temporary"
terminator
byte will be rewritten and the_write
of the topmost type still in the current substream will check if it still (after being overwritten) has the value that it must have. Otherwise, if the currentconsume: false
field is the last one, theterminator
byte as we wrote it will not be rewritten and will retain the correct value.value consistency (
terminator
+include: false
):field().indexOf(term) == -1
-
include: true
(+consume: true
by default)- id: s3 terminator: 0x40 include: true
public void _read() { // ... this.s3 = this._io.readBytesTerm((byte) 64, true, true, true); } public void _write() { // ... this._io.writeBytes(this.s3); } public void _check() { if (s3()[(s3()).length - 1] != 64) throw new ConsistencyError("s3", s3()[(s3()).length - 1], 64); }
A small oversight with this
_check
is that it implicitly assumess3().length != 0
, which is true that it has to apply for thes3
value to be consistent, but the question is whether we simply let theArrayIndexOutOfBounds
occur in case the users causess3().length == 0
(the_check
is especially meant to check arbitrary values and report if they are invalid, so it can't make any assumptions about these arbitrary values beforehand). If the_check
method may also throw these other kinds of exceptions thanConsistencyError
, it will be more difficult for the user to catch the exceptions related to consistency violations, because it will no longer be enough to just usetry { obj._check(); } catch (ConsistencyError exc) { ... }
.value consistency (
terminator
+include: true
):field().length != 0 && field().indexOf(term) == field().length - 1
-
is a degenerate combination that doesn't make sense and should be forbidden (you suddenly get one byte that is included in two consecutiveconsume: false
+include: true
seq
fields at the same time, violating the idea ofseq
that a field starts exactly where the previous one ends)
-
-
size
withterminator
and/orpad-right
(BytesPadTerm) - usewriteBytesLimit
in_write
, check that the byte array length does not exceed the maximum serializable size-
the anatomy of serialized bytes is the following:
-
with
terminator
(pad-right: 0x00
is assumed if not given explicitly); the diagram is designed for the default settinginclude: false
(implying that theterminator
byte is outside user-provided bytes) andfield().length < `size`
:specified `size` of the field in .ksy | <-------------------------------------------------> | | | | user-provided bytes | `terminator` | `pad-right` |______________ \ / \ / \ \ \ field().length / \ 1 / \ `size` - field().length - 1 \
-
without
terminator
(onlypad-right
), or alsoterminator
withinclude: true
`size` of the field in .ksy | <----------------------------------> | | | | user-provided bytes | `pad-right` |__________ \ / \ \ \ field().length / \ `size` - field().length \
-
regardless of the options, note that the presence of the
terminator
byte is optional here (whensize
is specified), meaning that if the user-provided bytes fill the entiresize
, it's OK —field().length == `size`
:`size` of the field in .ksy | <----------------------------------> | | | | user-provided bytes | \ / \ field().length /
-
-
only
pad-right
—field().length
can be at mostsize
(padding is optional)- id: str_pad size: 20 pad-right: 0x40
public void _read() { this.strPad = KaitaiStream.bytesStripRight(this._io.readBytes(20), (byte) 64); // ... } public void _write() { this._io.writeBytesLimit(this.strPad, 20, (byte) 64, (byte) 64); // ... } public void _check() { if (strPad().length > 20) throw new ConsistencyError("str_pad", strPad().length, 20); // ... }
If we take "consistency" seriously, the last byte of user-provided bytes (warning: if there is a last byte; this user-provided byte array can have length 0 and it's fine) must not have the value of
pad-right
, because then parsing wouldn't be able to distinguish it from the padding. So likely there should be a consistency check for this.value consistency (
size
+ onlypad-right
):field().length == 0 || field()[field().length - 1] != padByte
-
terminator
andpad-right
—field().length
can be at mostsize
(both terminator byte and padding are optional)- id: str_term_and_pad size: 20 terminator: 0x40 pad-right: 0x2b
public void _read() { // ... this.strTermAndPad = KaitaiStream.bytesTerminate( KaitaiStream.bytesStripRight(this._io.readBytes(20), (byte) 43), (byte) 64, false ); // ... } public void _write() { // ... this._io.writeBytesLimit(this.strTermAndPad, 20, (byte) 64, (byte) 43); // ... } public void _check() { // ... if (strTermAndPad().length > 20) throw new ConsistencyError("str_term_and_pad", strTermAndPad().length, 20); // ... }
value consistency ([
size
+]terminator
+include: false
):field().indexOf(term) == -1
-
only
terminator
(+include: false
by default) —field().length
can be at mostsize
(note: the existing checkif (field().length >= 20) { throw ... }
in the generated code below is wrong, because it does not match this description — it has to befield().length > 20
instead) and it must not contain theterminator
byte at all (not yet checked in_check
); implicitlypad-right: 0x00
- id: str_term size: 20 terminator: 0x40 # pad-right: 0x00 # implicit
public void _read() { // ... this.strTerm = KaitaiStream.bytesTerminate(this._io.readBytes(20), (byte) 64, false); // ... } public void _write() { // ... this._io.writeBytesLimit(this.strTerm, 20, (byte) 64, (byte) 0); // ... } public void _check() { // ... if (strTerm().length >= 20) throw new ConsistencyError("str_term", strTerm().length, 20); // ... }
value consistency ([
size
+]terminator
+include: false
):field().indexOf(term) == -1
-
terminator
withinclude: true
—field().length
can be at mostsize
,field()
must end immediately after the first occurrence of theterminator
byte or may not containterminator
at all (because as mentioned earlier, the presence ofterminator
is always optional forsize
-delimited fields) — note that the current consistency checkif (field()[(field()).length - 1] != 64) { throw ... }
(see the generated code snippet below) doesn't match this characterization and is therefore incorrect- id: str_term_include size: 20 terminator: 0x40 include: true
public void _read() { // ... this.strTermInclude = KaitaiStream.bytesTerminate(this._io.readBytes(20), (byte) 64, true); } public void _write() { // ... this._io.writeBytesLimit(this.strTermInclude, 20, (byte) 0, (byte) 0); } public void _check() { // ... if (strTermInclude().length > 20) throw new ConsistencyError("str_term_include", strTermInclude().length, 20); if (strTermInclude()[(strTermInclude()).length - 1] != 64) throw new ConsistencyError("str_term_include", strTermInclude()[(strTermInclude()).length - 1], 64); }
value consistency ([
size
+]terminator
+include: true
):field().indexOf(term) == -1 || field().indexOf(term) == field().length - 1
(note: thefield().indexOf(term) == -1
already implicitly matches the 0-length value, so the||
operator will short-circuit and it won't get into evaluating the second operand, meaning that the second expression doesn't even have to worry about failing on the 0-length value) -
TODO: add tests and analyze these with
size-eos: true
— update: this is important, because after doing a quick check theterminator
is not handled correctly here
-
-
-
byte arrays with
process
— again, if theterminator
is used, encoded bytes from theprocess
encoder must be scanned for a premature occurrence of theterminator
byte -
strings — everything should be the same as for byte arrays, only with the added encoding "string ⟶ bytes" conversion
-
terminator
withoutsize
(TermStrz) - looks correct (git diff --no-index compiled/java/src/io/kaitai/struct/testwrite/Term{Bytes,Strz}.java
)public void _write() { - this._io.writeBytes(this.s1); + this._io.writeBytes((s1()).getBytes(Charset.forName("UTF-8"))); this._io.writeU1(124); - this._io.writeBytes(this.s2); - this._io.writeBytes(this.s3); + this._io.writeBytes((s2()).getBytes(Charset.forName("UTF-8"))); + this._io.writeBytes((s3()).getBytes(Charset.forName("UTF-8"))); } public void _check() { - if (s3()[(s3()).length - 1] != 64) - throw new ConsistencyError("s3", s3()[(s3()).length - 1], 64); + if ((s3()).getBytes(Charset.forName("UTF-8"))[((s3()).getBytes(Charset.forName("UTF-8"))).length - 1] != 64) + throw new ConsistencyError("s3", (s3()).getBytes(Charset.forName("UTF-8"))[((s3()).getBytes(Charset.forName("UTF-8"))).length - 1], 64); }
-
size
withterminator
and/orpad-right
(StrPadTerm) FIXME:_check
seems correct, but_write
is wrong (it has to use thewriteBytesLimit
method too) — seegit diff --word-diff --no-index compiled/java/src/io/kaitai/struct/testwrite/{Bytes,Str}PadTerm.java
:public void _write() { - this._io.writeBytesLimit(this.strPad, 20, (byte) 64, (byte) 64); - this._io.writeBytesLimit(this.strTerm, 20, (byte) 64, (byte) 0); - this._io.writeBytesLimit(this.strTermAndPad, 20, (byte) 64, (byte) 43); - this._io.writeBytesLimit(this.strTermInclude, 20, (byte) 0, (byte) 0); + this._io.writeBytes((strPad()).getBytes(Charset.forName("UTF-8"))); + this._io.writeBytes((strTerm()).getBytes(Charset.forName("UTF-8"))); + this._io.writeU1(64); + this._io.writeBytes((strTermAndPad()).getBytes(Charset.forName("UTF-8"))); + this._io.writeU1(64); + this._io.writeBytes((strTermInclude()).getBytes(Charset.forName("UTF-8"))); } public void _check() { - if (strPad().length > 20) - throw new ConsistencyError("str_pad", strPad().length, 20); - if (strTerm().length >= 20) - throw new ConsistencyError("str_term", strTerm().length, 20); - if (strTermAndPad().length > 20) - throw new ConsistencyError("str_term_and_pad", strTermAndPad().length, 20); - if (strTermInclude().length > 20) - throw new ConsistencyError("str_term_include", strTermInclude().length, 20); - if (strTermInclude()[(strTermInclude()).length - 1] != 64) - throw new ConsistencyError("str_term_include", strTermInclude()[(strTermInclude()).length - 1], 64); + if ((strPad()).getBytes(Charset.forName("UTF-8")).length > 20) + throw new ConsistencyError("str_pad", (strPad()).getBytes(Charset.forName("UTF-8")).length, 20); + if ((strTerm()).getBytes(Charset.forName("UTF-8")).length >= 20) + throw new ConsistencyError("str_term", (strTerm()).getBytes(Charset.forName("UTF-8")).length, 20); + if ((strTermAndPad()).getBytes(Charset.forName("UTF-8")).length > 20) + throw new ConsistencyError("str_term_and_pad", (strTermAndPad()).getBytes(Charset.forName("UTF-8")).length, 20); + if ((strTermInclude()).getBytes(Charset.forName("UTF-8")).length > 20) + throw new ConsistencyError("str_term_include", (strTermInclude()).getBytes(Charset.forName("UTF-8")).length, 20); + if ((strTermInclude()).getBytes(Charset.forName("UTF-8"))[((strTermInclude()).getBytes(Charset.forName("UTF-8"))).length - 1] != 64) + throw new ConsistencyError("str_term_include", (strTermInclude()).getBytes(Charset.forName("UTF-8"))[((strTermInclude()).getBytes(Charset.forName("UTF-8"))).length - 1], 64); }
-
-
subtypes, no substreams (IfStruct)
public void _read() { this.op1 = new Operation(this._io, this, _root); this.op1._read(); // ... } public void _write() { this.op1._write(this._io); // ... } // ... public static class Operation extends KaitaiStruct.ReadWrite { // ... public void _write() { /* ... */ } }
public class KaitaiStruct { // ... public abstract static class ReadWrite extends ReadOnly { // ... public void _write(KaitaiStream io) { this._io = io; _write(); } // ... } }
-
subtypes with own substreams (BufferedStruct)
-
size
key — as you can see, the substream length must be set manually for now (which is obviously not ideal, but it works for now):- id: len1 type: u4 - id: block1 type: block size: len1
public void _read() { this.len1 = this._io.readU4le(); this._raw_block1 = this._io.readBytes(len1()); KaitaiStream _io__raw_block1 = new ByteBufferKaitaiStream(_raw_block1); this.block1 = new Block(_io__raw_block1, this, _root); this.block1._read(); // ... } public void _write() { this._io.writeU4le(this.len1); KaitaiStream _io__raw_block1 = new ByteBufferKaitaiStream(len1()); this.block1._write(_io__raw_block1); this._io.writeStream(_io__raw_block1); // ... }
public class ByteBufferKaitaiStream extends KaitaiStream { // ... @Override public void writeStream(KaitaiStream other) { bb.put(other.toByteArray()); } // ... }
public abstract class KaitaiStream implements Closeable { // ... abstract public void writeStream(KaitaiStream other); // ... public byte[] toByteArray() { long pos = pos(); seek(0); byte[] r = readBytesFull(); seek(pos); return r; } // ... }
-
terminator
— actually, it seems that we're completely missing tests on wrapping a type in a substream created usingterminator
The thing I would like to test here is the consistency check scanning the substream bytes (after the subtype has been serialized there) for the forbidden occurence of the
terminator
byte (because if it were silently allowed, the substream would be prematurely terminated there when parsing). Ifpad-right
is used, it has its own consistency check as well, etc. There should be tests for that.
-
-
repetitions —
_write
is easy (once you can write one entry, you just repeat what you've just did), but in_check
each type of repetition has its own ways how consistency can be violated-
repeat: expr
— the easy one, check the number of elements in the user-provided array for equality withrepeat-expr
(RepeatNStruct)- id: qty type: u4 - id: chunks type: chunk repeat: expr repeat-expr: qty
public void _write() { this._io.writeU4le(this.qty); for (int i = 0; i < this.chunks.size(); i++) { chunks().get((int) i)._write(this._io); } } public void _check() { if (chunks().size() != qty()) throw new ConsistencyError("chunks", chunks().size(), qty()); for (int i = 0; i < this.chunks.size(); i++) { } }
-
repeat: until
— in principle exactly the same problems as withterminator
, so in short an element matching therepeat-until
condition must be at the end and nowhere else.My pull request kaitai-io/kaitai_struct_compiler#183 adds the check that the last element must match
repeat-until
, but misses the checks that all non-last elements must not match therepeat-until
condition (of course, the consistency would also be violated if any non-last element would match, since parsing would end after that element instead of the actual last one), as @UnePierre correctly pointed out in kaitai-io/kaitai_struct_compiler#183 (comment).Obviously,
terminator
has quite the same logic asrepeat: until
, so they should both have the same set of consistency checks.Unfortunately, as I mentioned in kaitai-io/kaitai_struct_compiler#183 (comment), it's sometimes tempting to use
_io
in therepeat-until
expression in real formats, but it's not possible to reference_io
in the_check
method (this actually applies to all expressions, for example). -
repeat: eos
— there is currently no end-of-stream check to ensure consistency:- id: numbers type: u4 repeat: eos
public void _write() { for (int i = 0; i < this.numbers.size(); i++) { this._io.writeU4le(numbers().get((int) i)); } } public void _check() { for (int i = 0; i < this.numbers.size(); i++) { } }
Since we will only deal with fixed-size streams in the initial serialization support anyway, we actually can check and require
_io.eof
to betrue
after writing the field withrepeat: eos
in_write
(it's not unlike moving checks involving any_io
from_check
to_write
— you can read about this later in these notes). If we're not at the end of the stream, throw aConsistencyError
— this should be enough ensure consistency with parsing.
-
-
type switching - in principle relatively simple (SwitchManualInt)
public void _read() { this.code = this._io.readU1(); switch (code()) { case 73: { this.body = new Intval(this._io, this, _root); ((Intval) (this.body))._read(); break; } case 83: { this.body = new Strval(this._io, this, _root); ((Strval) (this.body))._read(); break; } } } public void _write() { this._io.writeU1(this.code); switch (code()) { case 73: { ((Intval) (this.body))._write(this._io); break; } case 83: { ((Strval) (this.body))._write(this._io); break; } } } // ... private KaitaiStruct.ReadWrite body;
-
in practice there are a few cases with missing type casts (resulting namely in errors like "error: incompatible types:
Object
cannot be converted tobyte[]
" and "error: incompatible types:Long
cannot be converted toint
"), but these should be straightforward to fix -
perhaps in
_check
there could be a check asserting that the actual type of the value stored in the combined property matches the expected type according to the value we're switching on. In Java, though, if there was an incorrect type, the type cast chosen based on the control value would typically fail in_write
anyway with aClassCastException
, so it's a question whether there is any benefit in doing the "same" work in_check
.However, in dynamically-typed languages (JavaScript, Python, etc.), there will be no type conversion in
_write
that could fail — the
-
-
instances
— quite problematic due to lazy evaluation and caching-
value instances: they don't need setters (actually should not have them, the expected procedure is to change dependency values), but there must be a way to regenerate them after the values of properties they depend on change
-
in fact, some method to "invalidate" or unset the cached value of the instance would be enough, which in practice just sets the cache property to
null
and thus the next call of the instance will recalculate it -
the naming suggestions that I can think of are
_invalidate{Inst}
,_reset{Inst}
or_unset{Inst}
— but "unset" is my least favorite, it sounds misleading and I'd rather avoid it (it just feels like "the opposite of setter" or "like a setter, but it just clears the property instead of setting it to a new value", but these interpretations suggest that you're permanently (un)setting something and you would read back the "no value". That's incorrect, and it misses the point of why you'd want to invalidate the value instance: to allow it to fetch a new meaningful value upon future request.) -
vision for the future: value instances can actually be invalidated automatically already in setters of dependency values, and the methods for invalidating value instances could in turn invalidate other value instances in a recursive manner (however, in some cases, it may be a problem to reach all types that can be anywhere in the object structure, e.g. in all entries of a repeated field, etc.)
- but this is again quite an advanced thing and can be done without it (by invalidating all dependent instances manually after changing something), so it's rather unlikely to be in the initial shot of serialization
-
simple example (Expr2) — if you change
len_orig
via a setter,len_mod
must be invalidated, otherwise the serialized structure will be inconsistent (because thelen_orig
has been updated, butstr
would still use the old wrong cachedlen_mod
)types: mod_str: seq: - id: len_orig type: u2 - id: str type: str size: len_mod encoding: UTF-8 - id: rest type: tuple size: 3 instances: len_mod: value: len_orig - 3
-
actually, it seems now clear to me "invalidating" is the correct operation to implement and "regenerating" is not, because only invalidation allows you to delay the evaluation until you actually want it to happen. The moment you know that the old cached value is no longer valid may not be the right time to load a new value.
For example, consider saving
_io.pos
in a value instance (from https://github.com/kaitai-io/coreldraw_cdr.ksy/blob/333988b/coreldraw_cdr.ksy#L1451-L1469, slightly simplified):types: # ... points_list: params: - id: num_points type: s4 seq: - size: 0 if: ofs_points < 0 - id: points type: point(point_types[_index]) repeat: expr repeat-expr: num_points instances: ofs_points: value: _io.pos point_types: pos: ofs_points + point_size * num_points type: u1 repeat: expr repeat-expr: num_points point_size: value: sizeof<s4> * 2
When parsing, the
ofs_points
value instance remembers the stream position just wherepoints
start — this is nice because you have a fixed reference point. However, when editing the parsed file, thepoints_list
may get repositioned in the stream, soofs_points
needs to be invalidated. But the important part is that you invalidateofs_points
in application code (after you changed the length of any preceding field and you know this will eventually affect the offset ofpoints
), and it doesn't make sense sense to re-evaluate_io.pos
at the same moment.In this phase of editing the parsed objects in application code (which will eventually be serialized, but only when everything is ready),
_io
would be probably still in the state in which the parsing in this stream has ended (maybe at the end of stream, but not necessarily); or the input stream also may be already closed (so_io.pos
may even throwIOException
) and writing will be done to a completely different output stream, which will be supplied as a parameter to_write
and thus is not available at this point yet.So what you really want is to just invalidate
ofs_points
in application code and not touch it until the_write
method of thepoints_list
type is called, whereofs_points
would be finally invoked and it would fetch the new_io.pos
value valid in the output stream.
-
-
parse instances — quite problematic, it's difficult to foresee all the problems here
-
they need setters
-
the
_write
method meant as a counterpart of_read
will not (and should not) serialize them, so they must have their own write mechanism -
likewise,
_check
probably won't deal with them either (it should exactly accompany_write
) -
not all parse instances should be actually written: some are just lookaheads (purely parsing helpers), some are union members (so only one union member will be used) etc. — it's up to the user to deal with it
-
so the proposed interface would be as follows (assume a parse instance called
inst
):-
setter
-
_checkInst
— includes all kinds of consistency checks that_check
would do, but only for the particular instance -
_writeInst
method — similar to_write
, but only for the particular instance (and if you think about it, if the instance is of a user-defined type, the_writeInst
would only deal with theseq
structure of it, innerinstances
need to be written again explicitly)-
the
_write
overload allowing to override_io
, i.e. the following:public class KaitaiStruct { // ... public abstract static class ReadWrite extends ReadOnly { // ... public void _write(KaitaiStream io) { this._io = io; _write(); } // ... } // ... }
will not be needed, because it's assumed that the
_write
ofseq
has been already called, so the currentthis._io
is set to the correct output stream that user wants (or it may also be a substream created by a parent_write
).Parse instances also have the
io
key, so they may use a different stream than the_io
of the current type, and overriding the stream that is set inio
clearly makes no sense and can only cause inconsistency.
Actually, the
io
key itself also excludes any naive ideas like "why not write all instances at the end of the_write
method", and let's pretend for a while that the already mentioned fact that all instances are often not meant to be written isn't a problem (even if it is) to make it more interesting. The referenced stream inio
can easily be a "future" one, i.e. it's not that hard to imagine this situation in practice (I've seen it several times):seq: - id: a type: foo - id: b size: 0x100 type: slot types: foo: seq: - id: baz size: 4 instances: inst: io: _parent.b._io pos: 0x10 type: u1 slot: {}
Instances are lazy, so this will work seamlessly when parsing (well, unless we void the laziness e.g. by requiring the value of
inst
in some parsing decision inseq
before the substream ofb
has been created and assigned — then it doesn't work, of course).If we were to naively write
inst
at the end ofFoo._write()
, it would be wrong (too soon), because the output substream forb
hasn't been created yet (so depending on what_parent.b._io
would be at this point, we would be either writing to an old substream left from parsing which will be garbage-collected after it's replaced by the output substream — so that any writes there will be lost, and that's when we are lucky that it wasn't read-only, because in principle it could be; or to anull
stream ifb
is a fresh object with no_io
set by the user via a setter, in which case we would usually get a NullPointerException and the_write
would end here). It's basically the same reason why this instance must be parsed lazily and eager parsing doesn't work — as you can see, eager writing doesn't work either.So this just underlines the correctness of the decision that the user is in charge of writing instances one by one, without any general easy way around it without very careful consideration.
-
-
-
the fact that writing of each parse instance is disconnected from the main
_write
actually creates room for inconsistencies, because I imagine it won't be hard for the user to forget to write some instance deep in the object tree (so parsing this instance in the serialized stream would read garbage). Unfortunately, I don't think we can help much with that in the general case, because we don't know whatinstances
the user wants to write.
-
-
-
user-defined and implicit
_root
,_parent
,_io
and_is_le
parameters of subtypes-
they can be always set to arbitrary value by instantiating the type in application code (for example, when creating a new file from scratch) and injecting it into the enclosing type via setter, so in
_check
of the type in which the parametric type is placed the parameter values should be checked for consistency (from the perspective of the parent type where this_check
is performed,subtype._root
andsubtype._parent
should be the same as what would the_read
method normally pass to the corresponding parameters when instantiating the type;subtype._io
will be actually overridden in_write
, so it doesn't matter; user-defined parameters must be equal to what the expression used as the param value in the type invocation would evaluate to) -
if the parametric type was created during parsing, it would be useful to be able to update its parameters (without having to recreate the type to set the new parameter value via constructor and then copy all the values from the old object to this new one)
-
<future-plans>
I don't think classic setters for direct use by the user are the right answer, because the parameter value should always be the same as the result of the particular expression in the invocation of the parametric type in the parent type, not set to an independent arbitrary value. Although it can't silently cause any inconsistency because the
_check
will throw aConsistencyError
if the parameter value does not equal to what is expected, it's probably not really a great API to have a setter just so that the user can (in fact has to) set the only valid value at the time (i.e. why Kaitai Struct doesn't just set the value itself when it's completely determined).A better API would be not to generate usual setters, but a new method
_updateParams
(naming completely random, of course), which would update_root
,_parent
,(actually, the value of_io
_io
doesn't matter at this point and can benull
or anything else, because it will be replaced by a parent_write
anyway),_is_le
and user-defined parameters (using the same expressions as when invoking the type normally in_read
or instance getters).The
_updateParams
method must be in the type that includes the parametric type in question — for simplicity, it would update parameters of all subtypes included viaseq
within the type (so that we don't have to worry about targeting a specific field, or even a specific entry index for repeated fields). OK, the name is now confusing, let's maybe call it_updateParams_Seq
. For parametric subtypes ininstances
,_updateParams{Inst}
would probably do. As for the implementation of_updateParams_Seq
and_updateParams{Inst}
, they are outside the type whose parameters they are supposed to update, so they can only use its public API (i.e. nothing more than the application code can use). Which means that the parametric type either needs to have public setters, or the_updateParams*
methods would again have to resort to recreating a new instance of the parametric type from scratch and copying everything to it (which is ugly).</future-plans>
Note: well, this initially simple idea quite grew in complexity, and in the end it looks like we'll need public setters for parameters anyway, so let's simply add them for now and call it a day. The user will use the parameter setters if needed, and the
_check
in the parent type will inspect the parameters set in its subtypes and report any inconsistencies found.
-
ignore the previous point — just generate setters and add consistency checks to parent
_check
for:-
_root
-
_parent
-
— unneeded, it would end up not used anyway_io
-
in fact, the
_write
method already serves as the_io
setter, so you can ask any type to serialize to anyKaitaiStream
you provide (it's obviously also used by the generated code to serialize all subtypes, as you may have noticed earlier):public class KaitaiStruct { // ... public abstract static class ReadWrite extends ReadOnly { // ... public void _write(KaitaiStream io) { this._io = io; _write(); } // ... } }
This ability will also be useful for purposes of calculating CRC-32 checksums in application code, for example — you take the object already filled with data that is enough to give you the entire input for CRC-32 once serialized, you
_write
it to a temporary stream, then extract the bytes it provided and pass them to the CRC-32 function. Since_write
is deterministic (unless you do something funny with opaque types I guess), it will produce the same bytes for those values once the actual_write
is performed, so the CRC-32 value calculated this way will be valid.
-
-
_is_le
(if inherited endianness is used) -
user-defined parameters
-
-
-
-
calculated default endianness (
meta/endian/{switch-on,cases}
)- no new conceptual problems, the only tricky thing seems to be the
_is_le
parameter (for inherited endianness) and parameters have been already addressed in the previous bullet point
- no new conceptual problems, the only tricky thing seems to be the
-
using
_io
in expressions (and other places which ultimately have impact on parsing decisions) and how it can mess up with the compiler (in)ability to derive sizes automatically (in the future, so that the users don't have to do the tedious calculations themselves) — actually implies quite a fascinating logical problem, I realized-
maybe a small introduction to the problem by showing that how
if
expressions are used in_write
(IfStruct)public void _read() { this.opcode = this._io.readU1(); if (opcode() == 84) { this.argTuple = new ArgTuple(this._io, this, _root); this.argTuple._read(); } if (opcode() == 83) { this.argStr = new ArgStr(this._io, this, _root); this.argStr._read(); } } public void _write() { this._io.writeU1(this.opcode); if (opcode() == 84) { this.argTuple._write(this._io); } if (opcode() == 83) { this.argStr._write(this._io); } }
Nothing too surprising, I'd say, but it's important to realize that the user
if
expressions are used "as is" in_write
in the same form as in_read
. -
the following occurred to me when pondering about the misery of users having to set all stream lengths themselves, if it can be fully automated in the future by the generated code while retaining all the features that Kaitai Struct has (including the use of
_io
in parsing decisions) -
I realized that it's actually logically impossible in the following case:
seq: - id: len_a type: u1 - id: a size: len_a type: foo types: foo: seq: - id: prefix size: 4 - id: optional_suffix size: 6 if: not _io.eof
Imagine that you want to derive
len_a
automatically (let's say we want the contents of the type to exactly fit into the length we calculate, meaning that the last field of the type will align with EOS and parsing even 1 byte from there would result in an EOF error).prefix
is easy, it always needs 4 bytes, solen_a
must be at least 4. Now we need to evaluate the expression "not _io.eof
" to know ifoptional_suffix
is present or not. Well, we somehow can't seem to know if we're already at EOF at this point (in order to be able to answer the_io.eof
query), because we would have to know that all following fields inseq
will occupy 0 bytes.Let's break down both answers to the
_io.eof
question:-
true
—not _io.eof
is thenfalse
and thereforeoptional_suffix
won't be parsed (so we setlen_a = 4
).A quick check whether this is consistent when parsing - after
prefix
, are we at EOF? Yes, because_io.pos == 4
and_io.size == 4
, so_io.eof
is indeedtrue
. -
false
—not _io.eof
is thentrue
and thereforeoptional_suffix
will be parsed (so we would setlen_a = 4 + 6 = 10
).Consistency check - after
prefix
, are we at EOF? No, because_io.pos == 4
and_io.size == 10
, so_io.eof
isfalse
.
So note that if we attempt to calculate what the length must be (without having this size set in advance by the user),
if: not _io.eof
doesn't tell us anything - we can set it both ways and it always satisfies itself. It looks like it's not thisif
expression that makes decisions, it's the other way around - the decision that user makes (in this case whether to includeoptional_suffix
or not) affects hownot _io.eof
will evaluate.In some cases, it also seems like you need to be able to reverse expressions to do the right thing:
types: foo: seq: - id: prefix size: 4 - id: suffix size: _io.size - _io.pos - 2
Now the compiler would be probably supposed to derive that the right size for the stream would be
4 + suffix.length + 2
(because the unparsed two bytes at the end would remain when parsing as well).Of course, it doesn't take much work to come up with a number of even weirder cases, because the ability to use
_io
everywhere in expressions allows you to do a lot of things (and it's not even always possible to guess what the user wants, so even the most automatic implementation that can be would sometimes still have to ask the user to disambiguate). -
-
-
following the previous point, I think it's actually a correct decision to require the precalculated lengths of substreams and only support fixed-size streams, because it automatically solves the problems with referring to the
_io
(when everything has a known size, it's basically as easy as parsing), and also allows users to reserve additional space for writinginstances
afterwards and whatnot-
generally, it follows the whole idea of the initial serialization version - the assumption is that the user somehow knows a bunch of (quite low-level) things and gives them to the KS-generated serializer, which would not have come with these values itself but at least it's able to give feedback whether they're right or wrong (ideally, if
_check
methods succeed and_write
doesn't throw EOF or other errors too, it should be a guarantee that the parser will read back the exact same values properly) -
however, inside the
_check
method, we will have to be careful if none of the user expressions that may appear there refers to the_io
, because we don't have_io
available in_check
.Simple example - I think the existing PoC implementation checks whether the
size
value matches the actual length of a byte array, so this would most likely be a problem:seq: - id: rest size: _io.size - _io.pos
The above essentially does the same as
size-eos: true
. Nevertheless, I think it still makes sense to check the length ofrest
(the user must have provided the final_io.size
because we only support fixed-size streams, and_io.pos
is known as well), you just have to do it in_write
just before writing the particular attribute, not in_check
as usual. Only in_write
you have the required streams and they will be in the correct state.So we need to analyze all expressions that would normally go into
_check
whether they don't reference any_io
directly or indirectly (i.e. by referring to valueinstances
, which again may refer to_io
directly or indirectly) and if they do, these checks must be moved into_write
before the particular attribute is written.
-
-
setters - per Java conventions, getter should be obviously called
get{Something}
and setterset{Something}
. However, we already don't follow that for getters, so suddenly "following" that only for setters looks wrong to me:public long ofsTags() { return ofsTags; } public void setOfsTags(long _v) { ofsTags = _v; } public long numTags() { return numTags; } public void setNumTags(long _v) { numTags = _v; } public ArrayList<Tag> tags() { return tags; } public void setTags(ArrayList<Tag> _v) { tags = _v; }
In our case, it would be more consistent to leave out the common
set
prefix for setters too in my opinion:public long ofsTags() { return ofsTags; } public void ofsTags(long _v) { ofsTags = _v; } public long numTags() { return numTags; } public void numTags(long _v) { numTags = _v; } public ArrayList<Tag> tags() { return tags; } public void tags(ArrayList<Tag> _v) { tags = _v; }
Overloading like this is not a problem in Java, and it would be even simplify the implementation a bit inside the compiler, as we wouldn't have to construct a special name for setters.
Also it solves the question of setters for
_root
and_parent
, which currently look weird, as if they were using "snake case" (and renaming them likeset_Root
doesn't help in my opinion):public SwitchBytearray _root() { return _root; } public void set_root(SwitchBytearray _v) { _root = _v; } public KaitaiStruct.ReadWrite _parent() { return _parent; } public void set_parent(KaitaiStruct.ReadWrite _v) { _parent = _v; }
I don't think removing
set
makes the API any less enjoyable to use - this is how the asserts after parsing look like:assertEquals(r.len1(), 0x10); assertEquals(r.block1().number1(), 0x42); assertEquals(r.block1().number2(), 0x43); assertEquals(r.len2(), 0x8); assertEquals(r.block2().number1(), 0x44); assertEquals(r.block2().number2(), 0x45); assertEquals(r.finisher(), 0xee);
and one would serialize the same values as follows:
BufferedStruct r = new BufferedStruct(); r.len1(0x10); BufferedStruct.Block block1 = new BufferedStruct.Block(); block1.number1(0x42); block1.number2(0x43); r.block1(block1); r.len2(0x8); BufferedStruct.Block block2 = new BufferedStruct.Block(); block2.number1(0x44); block2.number2(0x45); r.block2(block2); r.finisher(0xee); r._write(io);
instead of:
BufferedStruct r = new BufferedStruct(); r.setLen1(0x10); BufferedStruct.Block block1 = new BufferedStruct.Block(); block1.setNumber1(0x42); block1.setNumber2(0x43); r.setBlock1(block1); r.setLen2(0x8); BufferedStruct.Block block2 = new BufferedStruct.Block(); block2.setNumber1(0x44); block2.setNumber2(0x45); r.setBlock2(block2); r.setFinisher(0xee); r._write(io);
It's just a matter of habit.
This
set
prefix would be probably specific for Java anyway, because other target languages that we support often use different approaches to setters. -
testing
-
our test suite of
.ksy
format specs can of course be reused to test serialization (that's kind of the point), but it's not clear how such tests will be conducted -
there are several options what you can do, each option may expose a slightly different set of errors that may exist in the implementation — they correspond to different use cases of how you can use serialization (see kaitai-io/kaitai_struct#27 (comment))
-
naturally, we can take advantage of parsing that we already have (to get the initial objects to edit and then serialize, or to make sure that the serialized output is consistent with the values that have been written), it actually gives us interesting testing opportunities
-
on ensuring the consistency property (i.e.
_check
methods report violated constraints of user-provided values,_write
)-
unchanged parsed objects are always consistent, inconsistencies arise from setting properties with constraints to arbitrary values by the user
-
this allows us to do property-based testing, which is easy to implement and I think will be effective in finding bugs (because we already have pretty solid parsing, so if the serialization machinery will be doing something wrong, I think we can rely on parsing to detect it):
-
one easy thing to do that might detect faulty
_check
s would be to run_check
s everywhere in the object tree hierarchy of freshly parsed objects and they must report no errors -
another easy thing to reveal faulty
_write
s would be to try writing unchanged parsed objects and if parsing of the serialized bytes doesn't yield exactly the same as what was written, you know that some_write
messed up
-
-
-
the general idea is that if there is some scenario in which the user is able to use setters, constructors to create new objects, etc. to set any values that the actual data types of properties in the target language accept (for example in Java anything with type
KaitaiStruct
would acceptnull
, because it's just a reference type) and then calls_check
s and_write
in the way they're supposed to (of course, if you e.g. don't call the_write{Inst}
method for a particular instance, it won't be written and thus you won't be able to read it back, but this is expected behavior and purely user error) and everything succeeds, and yet parsing of the serialized output does not exactly match the set values, it means that either there is an error in some of the_write
methods (in case the values set by the user are actually consistent), or it was because the user provided values were not consistent and the appropriate_check
method failed to report it -
following the previous point, it would be ideal to integrate fuzzing that would set some interesting values via setters that the KS-generated classes expose. Inevitably, some set of values that the fuzzer provides will be inconsistent, and the idea is to reveal insufficent
_check
s by first finding a combination of values that passes the_check
(I believe that coverage-based fuzzing can help here, as I imagine it, if the first check in_check
fails, it would try to adjust the value so that this first check passes and when it does, it would keep it and try to fix other reportedConsistencyError
s) and then doing a write-read roundtrip to see if both the_write
was really consistent.I would like to avoid having to manually enumerate each specific hardcoded case (because it's too easy to miss important cases, especially when consistency of serialization will be so dependent on user interaction with the objects). Some special serialization use cases probably would have to be tested as the traditional unit test with all these asserts and different input values, but it would be nice to keep it to a minimum, as I believe that a fuzzer can do a better job (instead of a few hardcoded values that take a long time for human to figure out and provide little assurance that it maybe works, fuzzer has the potential to test millions of such values in almost no time).
-
-