-
fixed
contents— mapped to a literal byte array (note: shouldn't it writemagic1()instead? but then it would have to be checked again so that it doesn't violate consistency):public void _read() { this.magic1 = this._io.readBytes(6); if (!(Arrays.equals(magic1(), new byte[] { 80, 65, 67, 75, 45, 49 }))) { throw new KaitaiStream.ValidationNotEqualError(new byte[] { 80, 65, 67, 75, 45, 49 }, magic1(), _io(), "/seq/0"); } // ... } public void _write() { this._io.writeBytes(new byte[] { 80, 65, 67, 75, 45, 49 }); // ... }
The problem with special
contentshandling is that it does not apply tovalid/eq, even though the idea ofvalidwhen introduced in 0.9 was thatcontentswould behave the same as the correspondingvalid/eq.If we would want to be really consistent in
contentsandvalidhandling, and also allow othervalidtypes thaneq(i.e.valid/{any-of,min,max,expr}), the general way would be to rely on the user to always set the correct value via the attribute setter and we'll be able to generate the validation expression in_checkthat will throw an exception if the value is not valid.In the future, the compiler would be able to set some fields with
valid/eqautomatically, but probably not in the general case, because note thatvalidsupports arbitrary expressions, and until you analyze all features of the expression language (and also realize what all they actually can do), you have no idea if there are some edge cases that may go wrong. Anything that involves the expression language looks simple, if you picture it only on simple examples - it's very tempting to draw general conclusions from these and completely forget about what "interesting" cases can the expression language create, where the idea of simplicity completely falls apart.But we can be confident that we can always handle at least literal values (which would probably cover most common cases in real formats anyway) — detecting if the entire expression AST consists of a single node with the literal value is trivial too. However, I wouldn't address this in the initial version of serialization either, because it's not essential.
-
integer types —
type: [us]1(e.g.writeU1),type: [us][248]{,be,le}(e.g.writeS2le), nothing fancy here -
bit-sized integers —
type: bX-
over 2 years ago, I wrote an initial version of the
writeBitsIntmethod (KaitaiStream.java:431), but:- it supports only the big-endian direction (we have both
readBitsInt{Be,Le}now, so there should be a corresponding methodswriteBitsInt{Be,Le}as well) - it relies on seeks, so it would work only in seekable streams (which is not probably desirable;
readBitsInt{Be,Le}don't require a seekable stream, sowriteBitsInt{Be,Le}should work in non-seekable ones too) - depending on how many bit-sized integers encroach on a particular byte, a single byte may need to be overwritten up to 8 times
- I think it can be written much better overall, the existing version is probably slow and inefficient too
- it supports only the big-endian direction (we have both
-
I think a better approach to writing bit-sized ints would be not to write partial bytes (which of course lead to having to rewrite the bytes multiple times, which also required seeking); instead, write a byte only once when it's finished and you know it won't change (using the
bits/bitsLeftproperties to save state between calls makes it more like the currentreadBitsInt{Be,Le}implementation)-
this makes it a bit more difficult to use the
writeBitsInt{Be,Le}methods from the generated code, because if the write of the last bit field member ends on an unaligned bit position (i.e. somewhere in the middle of a byte, not on a byte boundary), you suddenly have to explicitly ensure that this partial byte gets written too (otherwise it wouldn't be) -
so the compiler will additionally have to insert
alignToBytecalls usually at the end of the_writemethod — but again, we don't want to add them somewhere they don't belong, so the compiler will have to spend some time figuring out what might come next in the format (is it really something byte-aligned?) and where to insert thealignToBytecall (inside theifbody just before reading/writing the byte-aligned type? in theforloop? in theswitchcase block?), and if we need this analysis for functional writing anyway, we could finally fix these numerous issues in regard to wrongalignToByteinsertions when parsing too (https://gitter.im/kaitai_struct/Lobby?at=628eb63309eea00adeb191b0)- because isn't it a pity that we have such perfect
readBitsInt{Be,Le}implementations (that's right, I'm confident enough to call them perfect — kaitai-io/kaitai_struct#949), and then the stupid compiler ruins everything just with stuffingalignToBytesomewhere it doesn't belong? Let's make the compiler smarter aboutalignToByteinsertions and Kaitai Struct will suddenly be actually usable for unaligned bit streams! (kaitai-io/kaitai_struct#576)
- because isn't it a pity that we have such perfect
-
of course, this
alignToBytefor writing will need to do something quite different than whatalignToBytecurrently does for reading — although clearingbitsandbitsLeftprobably makes sense for writing too, the main task is to actually write the partial byte, but we don't want to write anything in reading mode (so the runtime library will have to know whether we're in writing mode or not, or we'll just keepalignToBytefor reading only and add separatewriteAlignToBytefor writing)
-
-
the reason why I wrote the
writeBitsIntthe way I did was an "idea" to make it ready even for use from positionalinstances-
because back then, for some reason it seemed necessary to me that in order to be able to use bit-sized integers from
instances(which can seek anywhere to already written bytes and you don't want to accidentally clear any of the written bits if the currently written bit-sized integer starts and/or ends at an unaligned bit position "somewhere inside a byte"), I need to seek back by 1 byte to re-read the current byte, then seek to where the currently written bits int ends and read that byte too and then the actual writing to a temporary byte array can start (I know, it's indeed a really stupid implementation) -
however,
instancescurrently don't work for parsingtype: bXin the first place (unlessXis a multiple of 8, but that's usually not the case where you want to use a bit-sized integer type; kaitai-io/kaitai_struct#564), so this is pretty useless and I wouldn't worry about that for now -
secondly, I've now realized that even with the new (less moronic and the one supporting non-seekable streams, although we're currently discussing
instanceswhich rely on seeking by definition) approach, the support for bit-sized integers ininstancescan be added too (so the only reason to use the old approach is void, things doesn't look good for it) — you just need to do is to callwriteAlignToByteafter the write to ensure the partial byte at the end gets written too.Once we want to add bit addressing (I had a feeling that there was an issue about it, but I can't find it right now) on the top of byte addressing that we support now with the
poskey to allowinstancesto seek in the middle of a byte, it'll be quite simple to achieve that by adding a special "bit-level seek" method to the runtime library which will change the state ofbits/bitsLeftproperties as if a previouswriteBitsIntcall has just occurred in_writeof theseqstructure (it will have to "peek" a byte to know what the bits are, which means reading a byte and then seeking back, but this is a "bit seeking" method anyway, so it's quite understandable that it needs seeking support). -
I also remember that it was confusing for me that
bitsand maybebitsLeft(both currently used for reading) would have to be repurposed by changing their value logic a bit for writing-
so if you would mix writes and reads within one byte, e.g.
writeBitsInt(3)+readBitsInt(5), thereadBitsIntwill misinterpret whatwriteBitsInthas just set to thebitsandbitsLeftproperties, thus it would just do some undefined behavior -
in fact, the "true" stream byte position (
_io.pos) will be kind of repurposed for writing bit integers too — when reading, it moves onto the next byte immediately after you read even just 1 bit of the current one (and the yet unprocessed leftover 7 bits are kept inbits,bitsLeftis set to7), but if we want to avoid seeks for writing, we would keep_io.posin the current byte and only move it to the next byte when we know what all 8 bits of the current byte will look like (so that we can write its final value into the stream) -
but I conclude that this "value incompatibility" is not a real problem, because we already have it between
readBitsInt{Be,Le}methods, I've written in the docs that BE and LE bit integers can't share the same byte, and once we implement the check detecting this situation (kaitai-io/kaitai_struct#155 (comment)), this will be totally fine.Besides that, unlike the
bXbe-bXlesituation which can occur in a real.ksyspecification, consecutive mixed{write,read}BitsInt*calls should not happen (once everything is correctly implemented in the compiler). So this would be basically a concern not for compiler-generated code, but only for people using the runtime library standalone, and we should really not worry about that. The runtime library solely needs to be usable for the purposes of generated code, anything else is a side effect (but if someone finds the runtime library useful as a standalone library despite our lack of interest in that use case, well, good for them, of course).
-
-
-
-
enum types
-
a basic singular field (Enum0)
- id: pet_1 type: u4 enum: animal
public void _read() { this.pet1 = Animal.byId(this._io.readU4le()); // ... } public void _write() { this._io.writeU4le((long) (pet1().id())); // ... }
-
repeated fields don't work (kaitai-io/kaitai_struct_compiler#181 attempts to fix it, but I'll have to replace the wrong Ast.CodeLiteral with new Ast.InternalName in the 0.10 codebase), see test DebugEnumName:
- id: array_of_ints type: u1 enum: test_enum2 repeat: expr repeat-expr: 1
-
-
floating-point types —
type: f[48]{be,le}(FloatingPoints) -
byte arrays — per delimitation type:
-
fixed
size— usewriteBytesin_write, check that the length of the user-provided byte array equals the expected one in_check(SwitchBytearray)public void _read() { this.code = this._io.readBytes(1); // ... } public void _write() { this._io.writeBytes(this.code); // ... } public void _check() { if (code().length != 1) throw new ConsistencyError("code", code().length, 1); }
-
terminatorwithoutsize(TermBytes)-
default options (
consume: true,include: false)- id: s1 terminator: 0x7c
public void _read() { this.s1 = this._io.readBytesTerm((byte) 124, false, true, true); // ... } public void _write() { this._io.writeBytes(this.s1); this._io.writeU1(124); // ... }
value consistency (
terminator+include: false):field().indexOf(term) == -1 -
consume: false(+include: falseby default) — note: any sort of consistency check is missing here (there is no guarantee that theterminatorbyte occurs)- id: s2 terminator: 0x7c consume: false
public void _read() { // ... this.s2 = this._io.readBytesTerm((byte) 124, false, false, true); // ... } public void _write() { // ... this._io.writeBytes(this.s2); // ... }
The first thing to clear out is that when parsing,
consume: falsealready serves as an implicit opt-out from support for non-seekable streams (see thereadBytesTermimplementations — usually you find something likeif (!consumeTerm) { seek(pos() - 1) }inside anyway), so it's fine to afford seeking for anything thatconsume: falsefields need to do.The fact that we don't know what field the
terminatorbyte is in (it may not even be be part of any field, if the attribute withconsume: falseis the last of allseqstructures in the current substream — then the required terminator byte could never be written by any further_write) makes it impossible to do it as a simple pure "value" check — it needs to access the stream and read the particular byte whereterminatorshould be, so we can't do it in_checkand it would have to be in_write.The problem is that you won't always be able to do the check in the same type where the
consume: falsefield is, you need to delay until the last possibleseq(meaning that either the current type'sseqif it has its own substream, or the parent type'sseq, or the grandparent type'sseqand so on, until you reach a type that is no longer in this substream) has been fully written into the substream where it is (to be absolutely sure that the byte where theterminatoris supposed to be has been written) and then seek to that byte where it should be, read it and if it's not equal toterminatoras it should, throw aConsistencyError.Another issue would be if the
consume: falsefield effectively is the very last field in the substream that gets written (then it's of course absurd to "expect" that some further field after this last one in this substream will come to the rescue and magically write theterminatorbyte that needs to be there, so this of course does not happen and users get theConsistencyErrorthat they can do nothing about). But since seeking is allowed, this has a very simple and (at least in hindsight) obvious workaround — after writing theconsume: falsefield, you also write theterminatorbyte and then seek 1 byte back to have the stream position at thisterminatorbyte.This is the ultimate one-size-fits-all solution — if other fields follow the current one, the "temporary"
terminatorbyte will be rewritten and the_writeof the topmost type still in the current substream will check if it still (after being overwritten) has the value that it must have. Otherwise, if the currentconsume: falsefield is the last one, theterminatorbyte as we wrote it will not be rewritten and will retain the correct value.value consistency (
terminator+include: false):field().indexOf(term) == -1 -
include: true(+consume: trueby default)- id: s3 terminator: 0x40 include: true
public void _read() { // ... this.s3 = this._io.readBytesTerm((byte) 64, true, true, true); } public void _write() { // ... this._io.writeBytes(this.s3); } public void _check() { if (s3()[(s3()).length - 1] != 64) throw new ConsistencyError("s3", s3()[(s3()).length - 1], 64); }
A small oversight with this
_checkis that it implicitly assumess3().length != 0, which is true that it has to apply for thes3value to be consistent, but the question is whether we simply let theArrayIndexOutOfBoundsoccur in case the users causess3().length == 0(the_checkis especially meant to check arbitrary values and report if they are invalid, so it can't make any assumptions about these arbitrary values beforehand). If the_checkmethod may also throw these other kinds of exceptions thanConsistencyError, it will be more difficult for the user to catch the exceptions related to consistency violations, because it will no longer be enough to just usetry { obj._check(); } catch (ConsistencyError exc) { ... }.value consistency (
terminator+include: true):field().length != 0 && field().indexOf(term) == field().length - 1 -
is a degenerate combination that doesn't make sense and should be forbidden (you suddenly get one byte that is included in two consecutiveconsume: false+include: trueseqfields at the same time, violating the idea ofseqthat a field starts exactly where the previous one ends)
-
-
sizewithterminatorand/orpad-right(BytesPadTerm) - usewriteBytesLimitin_write, check that the byte array length does not exceed the maximum serializable size-
the anatomy of serialized bytes is the following:
-
with
terminator(pad-right: 0x00is assumed if not given explicitly); the diagram is designed for the default settinginclude: false(implying that theterminatorbyte is outside user-provided bytes) andfield().length < `size`:specified `size` of the field in .ksy | <-------------------------------------------------> | | | | user-provided bytes | `terminator` | `pad-right` |______________ \ / \ / \ \ \ field().length / \ 1 / \ `size` - field().length - 1 \ -
without
terminator(onlypad-right), or alsoterminatorwithinclude: true`size` of the field in .ksy | <----------------------------------> | | | | user-provided bytes | `pad-right` |__________ \ / \ \ \ field().length / \ `size` - field().length \ -
regardless of the options, note that the presence of the
terminatorbyte is optional here (whensizeis specified), meaning that if the user-provided bytes fill the entiresize, it's OK —field().length == `size`:`size` of the field in .ksy | <----------------------------------> | | | | user-provided bytes | \ / \ field().length /
-
-
only
pad-right—field().lengthcan be at mostsize(padding is optional)- id: str_pad size: 20 pad-right: 0x40
public void _read() { this.strPad = KaitaiStream.bytesStripRight(this._io.readBytes(20), (byte) 64); // ... } public void _write() { this._io.writeBytesLimit(this.strPad, 20, (byte) 64, (byte) 64); // ... } public void _check() { if (strPad().length > 20) throw new ConsistencyError("str_pad", strPad().length, 20); // ... }
If we take "consistency" seriously, the last byte of user-provided bytes (warning: if there is a last byte; this user-provided byte array can have length 0 and it's fine) must not have the value of
pad-right, because then parsing wouldn't be able to distinguish it from the padding. So likely there should be a consistency check for this.value consistency (
size+ onlypad-right):field().length == 0 || field()[field().length - 1] != padByte -
terminatorandpad-right—field().lengthcan be at mostsize(both terminator byte and padding are optional)- id: str_term_and_pad size: 20 terminator: 0x40 pad-right: 0x2b
public void _read() { // ... this.strTermAndPad = KaitaiStream.bytesTerminate( KaitaiStream.bytesStripRight(this._io.readBytes(20), (byte) 43), (byte) 64, false ); // ... } public void _write() { // ... this._io.writeBytesLimit(this.strTermAndPad, 20, (byte) 64, (byte) 43); // ... } public void _check() { // ... if (strTermAndPad().length > 20) throw new ConsistencyError("str_term_and_pad", strTermAndPad().length, 20); // ... }
value consistency ([
size+]terminator+include: false):field().indexOf(term) == -1 -
only
terminator(+include: falseby default) —field().lengthcan be at mostsize(note: the existing checkif (field().length >= 20) { throw ... }in the generated code below is wrong, because it does not match this description — it has to befield().length > 20instead) and it must not contain theterminatorbyte at all (not yet checked in_check); implicitlypad-right: 0x00- id: str_term size: 20 terminator: 0x40 # pad-right: 0x00 # implicit
public void _read() { // ... this.strTerm = KaitaiStream.bytesTerminate(this._io.readBytes(20), (byte) 64, false); // ... } public void _write() { // ... this._io.writeBytesLimit(this.strTerm, 20, (byte) 64, (byte) 0); // ... } public void _check() { // ... if (strTerm().length >= 20) throw new ConsistencyError("str_term", strTerm().length, 20); // ... }
value consistency ([
size+]terminator+include: false):field().indexOf(term) == -1 -
terminatorwithinclude: true—field().lengthcan be at mostsize,field()must end immediately after the first occurrence of theterminatorbyte or may not containterminatorat all (because as mentioned earlier, the presence ofterminatoris always optional forsize-delimited fields) — note that the current consistency checkif (field()[(field()).length - 1] != 64) { throw ... }(see the generated code snippet below) doesn't match this characterization and is therefore incorrect- id: str_term_include size: 20 terminator: 0x40 include: true
public void _read() { // ... this.strTermInclude = KaitaiStream.bytesTerminate(this._io.readBytes(20), (byte) 64, true); } public void _write() { // ... this._io.writeBytesLimit(this.strTermInclude, 20, (byte) 0, (byte) 0); } public void _check() { // ... if (strTermInclude().length > 20) throw new ConsistencyError("str_term_include", strTermInclude().length, 20); if (strTermInclude()[(strTermInclude()).length - 1] != 64) throw new ConsistencyError("str_term_include", strTermInclude()[(strTermInclude()).length - 1], 64); }
value consistency ([
size+]terminator+include: true):field().indexOf(term) == -1 || field().indexOf(term) == field().length - 1(note: thefield().indexOf(term) == -1already implicitly matches the 0-length value, so the||operator will short-circuit and it won't get into evaluating the second operand, meaning that the second expression doesn't even have to worry about failing on the 0-length value) -
TODO: add tests and analyze these with
size-eos: true— update: this is important, because after doing a quick check theterminatoris not handled correctly here
-
-
-
byte arrays with
process— again, if theterminatoris used, encoded bytes from theprocessencoder must be scanned for a premature occurrence of theterminatorbyte -
strings — everything should be the same as for byte arrays, only with the added encoding "string ⟶ bytes" conversion
-
terminatorwithoutsize(TermStrz) - looks correct (git diff --no-index compiled/java/src/io/kaitai/struct/testwrite/Term{Bytes,Strz}.java)public void _write() { - this._io.writeBytes(this.s1); + this._io.writeBytes((s1()).getBytes(Charset.forName("UTF-8"))); this._io.writeU1(124); - this._io.writeBytes(this.s2); - this._io.writeBytes(this.s3); + this._io.writeBytes((s2()).getBytes(Charset.forName("UTF-8"))); + this._io.writeBytes((s3()).getBytes(Charset.forName("UTF-8"))); } public void _check() { - if (s3()[(s3()).length - 1] != 64) - throw new ConsistencyError("s3", s3()[(s3()).length - 1], 64); + if ((s3()).getBytes(Charset.forName("UTF-8"))[((s3()).getBytes(Charset.forName("UTF-8"))).length - 1] != 64) + throw new ConsistencyError("s3", (s3()).getBytes(Charset.forName("UTF-8"))[((s3()).getBytes(Charset.forName("UTF-8"))).length - 1], 64); } -
sizewithterminatorand/orpad-right(StrPadTerm) FIXME:_checkseems correct, but_writeis wrong (it has to use thewriteBytesLimitmethod too) — seegit diff --word-diff --no-index compiled/java/src/io/kaitai/struct/testwrite/{Bytes,Str}PadTerm.java:public void _write() { - this._io.writeBytesLimit(this.strPad, 20, (byte) 64, (byte) 64); - this._io.writeBytesLimit(this.strTerm, 20, (byte) 64, (byte) 0); - this._io.writeBytesLimit(this.strTermAndPad, 20, (byte) 64, (byte) 43); - this._io.writeBytesLimit(this.strTermInclude, 20, (byte) 0, (byte) 0); + this._io.writeBytes((strPad()).getBytes(Charset.forName("UTF-8"))); + this._io.writeBytes((strTerm()).getBytes(Charset.forName("UTF-8"))); + this._io.writeU1(64); + this._io.writeBytes((strTermAndPad()).getBytes(Charset.forName("UTF-8"))); + this._io.writeU1(64); + this._io.writeBytes((strTermInclude()).getBytes(Charset.forName("UTF-8"))); } public void _check() { - if (strPad().length > 20) - throw new ConsistencyError("str_pad", strPad().length, 20); - if (strTerm().length >= 20) - throw new ConsistencyError("str_term", strTerm().length, 20); - if (strTermAndPad().length > 20) - throw new ConsistencyError("str_term_and_pad", strTermAndPad().length, 20); - if (strTermInclude().length > 20) - throw new ConsistencyError("str_term_include", strTermInclude().length, 20); - if (strTermInclude()[(strTermInclude()).length - 1] != 64) - throw new ConsistencyError("str_term_include", strTermInclude()[(strTermInclude()).length - 1], 64); + if ((strPad()).getBytes(Charset.forName("UTF-8")).length > 20) + throw new ConsistencyError("str_pad", (strPad()).getBytes(Charset.forName("UTF-8")).length, 20); + if ((strTerm()).getBytes(Charset.forName("UTF-8")).length >= 20) + throw new ConsistencyError("str_term", (strTerm()).getBytes(Charset.forName("UTF-8")).length, 20); + if ((strTermAndPad()).getBytes(Charset.forName("UTF-8")).length > 20) + throw new ConsistencyError("str_term_and_pad", (strTermAndPad()).getBytes(Charset.forName("UTF-8")).length, 20); + if ((strTermInclude()).getBytes(Charset.forName("UTF-8")).length > 20) + throw new ConsistencyError("str_term_include", (strTermInclude()).getBytes(Charset.forName("UTF-8")).length, 20); + if ((strTermInclude()).getBytes(Charset.forName("UTF-8"))[((strTermInclude()).getBytes(Charset.forName("UTF-8"))).length - 1] != 64) + throw new ConsistencyError("str_term_include", (strTermInclude()).getBytes(Charset.forName("UTF-8"))[((strTermInclude()).getBytes(Charset.forName("UTF-8"))).length - 1], 64); }
-
-
subtypes, no substreams (IfStruct)
public void _read() { this.op1 = new Operation(this._io, this, _root); this.op1._read(); // ... } public void _write() { this.op1._write(this._io); // ... } // ... public static class Operation extends KaitaiStruct.ReadWrite { // ... public void _write() { /* ... */ } }
public class KaitaiStruct { // ... public abstract static class ReadWrite extends ReadOnly { // ... public void _write(KaitaiStream io) { this._io = io; _write(); } // ... } }
-
subtypes with own substreams (BufferedStruct)
-
sizekey — as you can see, the substream length must be set manually for now (which is obviously not ideal, but it works for now):- id: len1 type: u4 - id: block1 type: block size: len1
public void _read() { this.len1 = this._io.readU4le(); this._raw_block1 = this._io.readBytes(len1()); KaitaiStream _io__raw_block1 = new ByteBufferKaitaiStream(_raw_block1); this.block1 = new Block(_io__raw_block1, this, _root); this.block1._read(); // ... } public void _write() { this._io.writeU4le(this.len1); KaitaiStream _io__raw_block1 = new ByteBufferKaitaiStream(len1()); this.block1._write(_io__raw_block1); this._io.writeStream(_io__raw_block1); // ... }
public class ByteBufferKaitaiStream extends KaitaiStream { // ... @Override public void writeStream(KaitaiStream other) { bb.put(other.toByteArray()); } // ... }
public abstract class KaitaiStream implements Closeable { // ... abstract public void writeStream(KaitaiStream other); // ... public byte[] toByteArray() { long pos = pos(); seek(0); byte[] r = readBytesFull(); seek(pos); return r; } // ... }
-
terminator— actually, it seems that we're completely missing tests on wrapping a type in a substream created usingterminatorThe thing I would like to test here is the consistency check scanning the substream bytes (after the subtype has been serialized there) for the forbidden occurence of the
terminatorbyte (because if it were silently allowed, the substream would be prematurely terminated there when parsing). Ifpad-rightis used, it has its own consistency check as well, etc. There should be tests for that.
-
-
repetitions —
_writeis easy (once you can write one entry, you just repeat what you've just did), but in_checkeach type of repetition has its own ways how consistency can be violated-
repeat: expr— the easy one, check the number of elements in the user-provided array for equality withrepeat-expr(RepeatNStruct)- id: qty type: u4 - id: chunks type: chunk repeat: expr repeat-expr: qty
public void _write() { this._io.writeU4le(this.qty); for (int i = 0; i < this.chunks.size(); i++) { chunks().get((int) i)._write(this._io); } } public void _check() { if (chunks().size() != qty()) throw new ConsistencyError("chunks", chunks().size(), qty()); for (int i = 0; i < this.chunks.size(); i++) { } }
-
repeat: until— in principle exactly the same problems as withterminator, so in short an element matching therepeat-untilcondition must be at the end and nowhere else.My pull request kaitai-io/kaitai_struct_compiler#183 adds the check that the last element must match
repeat-until, but misses the checks that all non-last elements must not match therepeat-untilcondition (of course, the consistency would also be violated if any non-last element would match, since parsing would end after that element instead of the actual last one), as @UnePierre correctly pointed out in kaitai-io/kaitai_struct_compiler#183 (comment).Obviously,
terminatorhas quite the same logic asrepeat: until, so they should both have the same set of consistency checks.Unfortunately, as I mentioned in kaitai-io/kaitai_struct_compiler#183 (comment), it's sometimes tempting to use
_ioin therepeat-untilexpression in real formats, but it's not possible to reference_ioin the_checkmethod (this actually applies to all expressions, for example). -
repeat: eos— there is currently no end-of-stream check to ensure consistency:- id: numbers type: u4 repeat: eos
public void _write() { for (int i = 0; i < this.numbers.size(); i++) { this._io.writeU4le(numbers().get((int) i)); } } public void _check() { for (int i = 0; i < this.numbers.size(); i++) { } }
Since we will only deal with fixed-size streams in the initial serialization support anyway, we actually can check and require
_io.eofto betrueafter writing the field withrepeat: eosin_write(it's not unlike moving checks involving any_iofrom_checkto_write— you can read about this later in these notes). If we're not at the end of the stream, throw aConsistencyError— this should be enough ensure consistency with parsing.
-
-
type switching - in principle relatively simple (SwitchManualInt)
public void _read() { this.code = this._io.readU1(); switch (code()) { case 73: { this.body = new Intval(this._io, this, _root); ((Intval) (this.body))._read(); break; } case 83: { this.body = new Strval(this._io, this, _root); ((Strval) (this.body))._read(); break; } } } public void _write() { this._io.writeU1(this.code); switch (code()) { case 73: { ((Intval) (this.body))._write(this._io); break; } case 83: { ((Strval) (this.body))._write(this._io); break; } } } // ... private KaitaiStruct.ReadWrite body;
-
in practice there are a few cases with missing type casts (resulting namely in errors like "error: incompatible types:
Objectcannot be converted tobyte[]" and "error: incompatible types:Longcannot be converted toint"), but these should be straightforward to fix -
perhaps in
_checkthere could be a check asserting that the actual type of the value stored in the combined property matches the expected type according to the value we're switching on. In Java, though, if there was an incorrect type, the type cast chosen based on the control value would typically fail in_writeanyway with aClassCastException, so it's a question whether there is any benefit in doing the "same" work in_check.However, in dynamically-typed languages (JavaScript, Python, etc.), there will be no type conversion in
_writethat could fail — the
-
-
instances— quite problematic due to lazy evaluation and caching-
value instances: they don't need setters (actually should not have them, the expected procedure is to change dependency values), but there must be a way to regenerate them after the values of properties they depend on change
-
in fact, some method to "invalidate" or unset the cached value of the instance would be enough, which in practice just sets the cache property to
nulland thus the next call of the instance will recalculate it -
the naming suggestions that I can think of are
_invalidate{Inst},_reset{Inst}or_unset{Inst}— but "unset" is my least favorite, it sounds misleading and I'd rather avoid it (it just feels like "the opposite of setter" or "like a setter, but it just clears the property instead of setting it to a new value", but these interpretations suggest that you're permanently (un)setting something and you would read back the "no value". That's incorrect, and it misses the point of why you'd want to invalidate the value instance: to allow it to fetch a new meaningful value upon future request.) -
vision for the future: value instances can actually be invalidated automatically already in setters of dependency values, and the methods for invalidating value instances could in turn invalidate other value instances in a recursive manner (however, in some cases, it may be a problem to reach all types that can be anywhere in the object structure, e.g. in all entries of a repeated field, etc.)
- but this is again quite an advanced thing and can be done without it (by invalidating all dependent instances manually after changing something), so it's rather unlikely to be in the initial shot of serialization
-
simple example (Expr2) — if you change
len_origvia a setter,len_modmust be invalidated, otherwise the serialized structure will be inconsistent (because thelen_orighas been updated, butstrwould still use the old wrong cachedlen_mod)types: mod_str: seq: - id: len_orig type: u2 - id: str type: str size: len_mod encoding: UTF-8 - id: rest type: tuple size: 3 instances: len_mod: value: len_orig - 3
-
actually, it seems now clear to me "invalidating" is the correct operation to implement and "regenerating" is not, because only invalidation allows you to delay the evaluation until you actually want it to happen. The moment you know that the old cached value is no longer valid may not be the right time to load a new value.
For example, consider saving
_io.posin a value instance (from https://github.com/kaitai-io/coreldraw_cdr.ksy/blob/333988b/coreldraw_cdr.ksy#L1451-L1469, slightly simplified):types: # ... points_list: params: - id: num_points type: s4 seq: - size: 0 if: ofs_points < 0 - id: points type: point(point_types[_index]) repeat: expr repeat-expr: num_points instances: ofs_points: value: _io.pos point_types: pos: ofs_points + point_size * num_points type: u1 repeat: expr repeat-expr: num_points point_size: value: sizeof<s4> * 2
When parsing, the
ofs_pointsvalue instance remembers the stream position just wherepointsstart — this is nice because you have a fixed reference point. However, when editing the parsed file, thepoints_listmay get repositioned in the stream, soofs_pointsneeds to be invalidated. But the important part is that you invalidateofs_pointsin application code (after you changed the length of any preceding field and you know this will eventually affect the offset ofpoints), and it doesn't make sense sense to re-evaluate_io.posat the same moment.In this phase of editing the parsed objects in application code (which will eventually be serialized, but only when everything is ready),
_iowould be probably still in the state in which the parsing in this stream has ended (maybe at the end of stream, but not necessarily); or the input stream also may be already closed (so_io.posmay even throwIOException) and writing will be done to a completely different output stream, which will be supplied as a parameter to_writeand thus is not available at this point yet.So what you really want is to just invalidate
ofs_pointsin application code and not touch it until the_writemethod of thepoints_listtype is called, whereofs_pointswould be finally invoked and it would fetch the new_io.posvalue valid in the output stream.
-
-
parse instances — quite problematic, it's difficult to foresee all the problems here
-
they need setters
-
the
_writemethod meant as a counterpart of_readwill not (and should not) serialize them, so they must have their own write mechanism -
likewise,
_checkprobably won't deal with them either (it should exactly accompany_write) -
not all parse instances should be actually written: some are just lookaheads (purely parsing helpers), some are union members (so only one union member will be used) etc. — it's up to the user to deal with it
-
so the proposed interface would be as follows (assume a parse instance called
inst):-
setter
-
_checkInst— includes all kinds of consistency checks that_checkwould do, but only for the particular instance -
_writeInstmethod — similar to_write, but only for the particular instance (and if you think about it, if the instance is of a user-defined type, the_writeInstwould only deal with theseqstructure of it, innerinstancesneed to be written again explicitly)-
the
_writeoverload allowing to override_io, i.e. the following:public class KaitaiStruct { // ... public abstract static class ReadWrite extends ReadOnly { // ... public void _write(KaitaiStream io) { this._io = io; _write(); } // ... } // ... }
will not be needed, because it's assumed that the
_writeofseqhas been already called, so the currentthis._iois set to the correct output stream that user wants (or it may also be a substream created by a parent_write).Parse instances also have the
iokey, so they may use a different stream than the_ioof the current type, and overriding the stream that is set inioclearly makes no sense and can only cause inconsistency.
Actually, the
iokey itself also excludes any naive ideas like "why not write all instances at the end of the_writemethod", and let's pretend for a while that the already mentioned fact that all instances are often not meant to be written isn't a problem (even if it is) to make it more interesting. The referenced stream iniocan easily be a "future" one, i.e. it's not that hard to imagine this situation in practice (I've seen it several times):seq: - id: a type: foo - id: b size: 0x100 type: slot types: foo: seq: - id: baz size: 4 instances: inst: io: _parent.b._io pos: 0x10 type: u1 slot: {}
Instances are lazy, so this will work seamlessly when parsing (well, unless we void the laziness e.g. by requiring the value of
instin some parsing decision inseqbefore the substream ofbhas been created and assigned — then it doesn't work, of course).If we were to naively write
instat the end ofFoo._write(), it would be wrong (too soon), because the output substream forbhasn't been created yet (so depending on what_parent.b._iowould be at this point, we would be either writing to an old substream left from parsing which will be garbage-collected after it's replaced by the output substream — so that any writes there will be lost, and that's when we are lucky that it wasn't read-only, because in principle it could be; or to anullstream ifbis a fresh object with no_ioset by the user via a setter, in which case we would usually get a NullPointerException and the_writewould end here). It's basically the same reason why this instance must be parsed lazily and eager parsing doesn't work — as you can see, eager writing doesn't work either.So this just underlines the correctness of the decision that the user is in charge of writing instances one by one, without any general easy way around it without very careful consideration.
-
-
-
the fact that writing of each parse instance is disconnected from the main
_writeactually creates room for inconsistencies, because I imagine it won't be hard for the user to forget to write some instance deep in the object tree (so parsing this instance in the serialized stream would read garbage). Unfortunately, I don't think we can help much with that in the general case, because we don't know whatinstancesthe user wants to write.
-
-
-
user-defined and implicit
_root,_parent,_ioand_is_leparameters of subtypes-
they can be always set to arbitrary value by instantiating the type in application code (for example, when creating a new file from scratch) and injecting it into the enclosing type via setter, so in
_checkof the type in which the parametric type is placed the parameter values should be checked for consistency (from the perspective of the parent type where this_checkis performed,subtype._rootandsubtype._parentshould be the same as what would the_readmethod normally pass to the corresponding parameters when instantiating the type;subtype._iowill be actually overridden in_write, so it doesn't matter; user-defined parameters must be equal to what the expression used as the param value in the type invocation would evaluate to) -
if the parametric type was created during parsing, it would be useful to be able to update its parameters (without having to recreate the type to set the new parameter value via constructor and then copy all the values from the old object to this new one)
-
<future-plans>
I don't think classic setters for direct use by the user are the right answer, because the parameter value should always be the same as the result of the particular expression in the invocation of the parametric type in the parent type, not set to an independent arbitrary value. Although it can't silently cause any inconsistency because the
_checkwill throw aConsistencyErrorif the parameter value does not equal to what is expected, it's probably not really a great API to have a setter just so that the user can (in fact has to) set the only valid value at the time (i.e. why Kaitai Struct doesn't just set the value itself when it's completely determined).A better API would be not to generate usual setters, but a new method
_updateParams(naming completely random, of course), which would update_root,_parent,(actually, the value of_io_iodoesn't matter at this point and can benullor anything else, because it will be replaced by a parent_writeanyway),_is_leand user-defined parameters (using the same expressions as when invoking the type normally in_reador instance getters).The
_updateParamsmethod must be in the type that includes the parametric type in question — for simplicity, it would update parameters of all subtypes included viaseqwithin the type (so that we don't have to worry about targeting a specific field, or even a specific entry index for repeated fields). OK, the name is now confusing, let's maybe call it_updateParams_Seq. For parametric subtypes ininstances,_updateParams{Inst}would probably do. As for the implementation of_updateParams_Seqand_updateParams{Inst}, they are outside the type whose parameters they are supposed to update, so they can only use its public API (i.e. nothing more than the application code can use). Which means that the parametric type either needs to have public setters, or the_updateParams*methods would again have to resort to recreating a new instance of the parametric type from scratch and copying everything to it (which is ugly).</future-plans>
Note: well, this initially simple idea quite grew in complexity, and in the end it looks like we'll need public setters for parameters anyway, so let's simply add them for now and call it a day. The user will use the parameter setters if needed, and the
_checkin the parent type will inspect the parameters set in its subtypes and report any inconsistencies found.
-
ignore the previous point — just generate setters and add consistency checks to parent
_checkfor:-
_root -
_parent -
— unneeded, it would end up not used anyway_io-
in fact, the
_writemethod already serves as the_iosetter, so you can ask any type to serialize to anyKaitaiStreamyou provide (it's obviously also used by the generated code to serialize all subtypes, as you may have noticed earlier):public class KaitaiStruct { // ... public abstract static class ReadWrite extends ReadOnly { // ... public void _write(KaitaiStream io) { this._io = io; _write(); } // ... } }
This ability will also be useful for purposes of calculating CRC-32 checksums in application code, for example — you take the object already filled with data that is enough to give you the entire input for CRC-32 once serialized, you
_writeit to a temporary stream, then extract the bytes it provided and pass them to the CRC-32 function. Since_writeis deterministic (unless you do something funny with opaque types I guess), it will produce the same bytes for those values once the actual_writeis performed, so the CRC-32 value calculated this way will be valid.
-
-
_is_le(if inherited endianness is used) -
user-defined parameters
-
-
-
-
calculated default endianness (
meta/endian/{switch-on,cases})- no new conceptual problems, the only tricky thing seems to be the
_is_leparameter (for inherited endianness) and parameters have been already addressed in the previous bullet point
- no new conceptual problems, the only tricky thing seems to be the
-
using
_ioin expressions (and other places which ultimately have impact on parsing decisions) and how it can mess up with the compiler (in)ability to derive sizes automatically (in the future, so that the users don't have to do the tedious calculations themselves) — actually implies quite a fascinating logical problem, I realized-
maybe a small introduction to the problem by showing that how
ifexpressions are used in_write(IfStruct)public void _read() { this.opcode = this._io.readU1(); if (opcode() == 84) { this.argTuple = new ArgTuple(this._io, this, _root); this.argTuple._read(); } if (opcode() == 83) { this.argStr = new ArgStr(this._io, this, _root); this.argStr._read(); } } public void _write() { this._io.writeU1(this.opcode); if (opcode() == 84) { this.argTuple._write(this._io); } if (opcode() == 83) { this.argStr._write(this._io); } }
Nothing too surprising, I'd say, but it's important to realize that the user
ifexpressions are used "as is" in_writein the same form as in_read. -
the following occurred to me when pondering about the misery of users having to set all stream lengths themselves, if it can be fully automated in the future by the generated code while retaining all the features that Kaitai Struct has (including the use of
_ioin parsing decisions) -
I realized that it's actually logically impossible in the following case:
seq: - id: len_a type: u1 - id: a size: len_a type: foo types: foo: seq: - id: prefix size: 4 - id: optional_suffix size: 6 if: not _io.eof
Imagine that you want to derive
len_aautomatically (let's say we want the contents of the type to exactly fit into the length we calculate, meaning that the last field of the type will align with EOS and parsing even 1 byte from there would result in an EOF error).prefixis easy, it always needs 4 bytes, solen_amust be at least 4. Now we need to evaluate the expression "not _io.eof" to know ifoptional_suffixis present or not. Well, we somehow can't seem to know if we're already at EOF at this point (in order to be able to answer the_io.eofquery), because we would have to know that all following fields inseqwill occupy 0 bytes.Let's break down both answers to the
_io.eofquestion:-
true—not _io.eofis thenfalseand thereforeoptional_suffixwon't be parsed (so we setlen_a = 4).A quick check whether this is consistent when parsing - after
prefix, are we at EOF? Yes, because_io.pos == 4and_io.size == 4, so_io.eofis indeedtrue. -
false—not _io.eofis thentrueand thereforeoptional_suffixwill be parsed (so we would setlen_a = 4 + 6 = 10).Consistency check - after
prefix, are we at EOF? No, because_io.pos == 4and_io.size == 10, so_io.eofisfalse.
So note that if we attempt to calculate what the length must be (without having this size set in advance by the user),
if: not _io.eofdoesn't tell us anything - we can set it both ways and it always satisfies itself. It looks like it's not thisifexpression that makes decisions, it's the other way around - the decision that user makes (in this case whether to includeoptional_suffixor not) affects hownot _io.eofwill evaluate.In some cases, it also seems like you need to be able to reverse expressions to do the right thing:
types: foo: seq: - id: prefix size: 4 - id: suffix size: _io.size - _io.pos - 2
Now the compiler would be probably supposed to derive that the right size for the stream would be
4 + suffix.length + 2(because the unparsed two bytes at the end would remain when parsing as well).Of course, it doesn't take much work to come up with a number of even weirder cases, because the ability to use
_ioeverywhere in expressions allows you to do a lot of things (and it's not even always possible to guess what the user wants, so even the most automatic implementation that can be would sometimes still have to ask the user to disambiguate). -
-
-
following the previous point, I think it's actually a correct decision to require the precalculated lengths of substreams and only support fixed-size streams, because it automatically solves the problems with referring to the
_io(when everything has a known size, it's basically as easy as parsing), and also allows users to reserve additional space for writinginstancesafterwards and whatnot-
generally, it follows the whole idea of the initial serialization version - the assumption is that the user somehow knows a bunch of (quite low-level) things and gives them to the KS-generated serializer, which would not have come with these values itself but at least it's able to give feedback whether they're right or wrong (ideally, if
_checkmethods succeed and_writedoesn't throw EOF or other errors too, it should be a guarantee that the parser will read back the exact same values properly) -
however, inside the
_checkmethod, we will have to be careful if none of the user expressions that may appear there refers to the_io, because we don't have_ioavailable in_check.Simple example - I think the existing PoC implementation checks whether the
sizevalue matches the actual length of a byte array, so this would most likely be a problem:seq: - id: rest size: _io.size - _io.pos
The above essentially does the same as
size-eos: true. Nevertheless, I think it still makes sense to check the length ofrest(the user must have provided the final_io.sizebecause we only support fixed-size streams, and_io.posis known as well), you just have to do it in_writejust before writing the particular attribute, not in_checkas usual. Only in_writeyou have the required streams and they will be in the correct state.So we need to analyze all expressions that would normally go into
_checkwhether they don't reference any_iodirectly or indirectly (i.e. by referring to valueinstances, which again may refer to_iodirectly or indirectly) and if they do, these checks must be moved into_writebefore the particular attribute is written.
-
-
setters - per Java conventions, getter should be obviously called
get{Something}and setterset{Something}. However, we already don't follow that for getters, so suddenly "following" that only for setters looks wrong to me:public long ofsTags() { return ofsTags; } public void setOfsTags(long _v) { ofsTags = _v; } public long numTags() { return numTags; } public void setNumTags(long _v) { numTags = _v; } public ArrayList<Tag> tags() { return tags; } public void setTags(ArrayList<Tag> _v) { tags = _v; }
In our case, it would be more consistent to leave out the common
setprefix for setters too in my opinion:public long ofsTags() { return ofsTags; } public void ofsTags(long _v) { ofsTags = _v; } public long numTags() { return numTags; } public void numTags(long _v) { numTags = _v; } public ArrayList<Tag> tags() { return tags; } public void tags(ArrayList<Tag> _v) { tags = _v; }
Overloading like this is not a problem in Java, and it would be even simplify the implementation a bit inside the compiler, as we wouldn't have to construct a special name for setters.
Also it solves the question of setters for
_rootand_parent, which currently look weird, as if they were using "snake case" (and renaming them likeset_Rootdoesn't help in my opinion):public SwitchBytearray _root() { return _root; } public void set_root(SwitchBytearray _v) { _root = _v; } public KaitaiStruct.ReadWrite _parent() { return _parent; } public void set_parent(KaitaiStruct.ReadWrite _v) { _parent = _v; }
I don't think removing
setmakes the API any less enjoyable to use - this is how the asserts after parsing look like:assertEquals(r.len1(), 0x10); assertEquals(r.block1().number1(), 0x42); assertEquals(r.block1().number2(), 0x43); assertEquals(r.len2(), 0x8); assertEquals(r.block2().number1(), 0x44); assertEquals(r.block2().number2(), 0x45); assertEquals(r.finisher(), 0xee);
and one would serialize the same values as follows:
BufferedStruct r = new BufferedStruct(); r.len1(0x10); BufferedStruct.Block block1 = new BufferedStruct.Block(); block1.number1(0x42); block1.number2(0x43); r.block1(block1); r.len2(0x8); BufferedStruct.Block block2 = new BufferedStruct.Block(); block2.number1(0x44); block2.number2(0x45); r.block2(block2); r.finisher(0xee); r._write(io);
instead of:
BufferedStruct r = new BufferedStruct(); r.setLen1(0x10); BufferedStruct.Block block1 = new BufferedStruct.Block(); block1.setNumber1(0x42); block1.setNumber2(0x43); r.setBlock1(block1); r.setLen2(0x8); BufferedStruct.Block block2 = new BufferedStruct.Block(); block2.setNumber1(0x44); block2.setNumber2(0x45); r.setBlock2(block2); r.setFinisher(0xee); r._write(io);
It's just a matter of habit.
This
setprefix would be probably specific for Java anyway, because other target languages that we support often use different approaches to setters. -
testing
-
our test suite of
.ksyformat specs can of course be reused to test serialization (that's kind of the point), but it's not clear how such tests will be conducted -
there are several options what you can do, each option may expose a slightly different set of errors that may exist in the implementation — they correspond to different use cases of how you can use serialization (see kaitai-io/kaitai_struct#27 (comment))
-
naturally, we can take advantage of parsing that we already have (to get the initial objects to edit and then serialize, or to make sure that the serialized output is consistent with the values that have been written), it actually gives us interesting testing opportunities
-
on ensuring the consistency property (i.e.
_checkmethods report violated constraints of user-provided values,_write)-
unchanged parsed objects are always consistent, inconsistencies arise from setting properties with constraints to arbitrary values by the user
-
this allows us to do property-based testing, which is easy to implement and I think will be effective in finding bugs (because we already have pretty solid parsing, so if the serialization machinery will be doing something wrong, I think we can rely on parsing to detect it):
-
one easy thing to do that might detect faulty
_checks would be to run_checks everywhere in the object tree hierarchy of freshly parsed objects and they must report no errors -
another easy thing to reveal faulty
_writes would be to try writing unchanged parsed objects and if parsing of the serialized bytes doesn't yield exactly the same as what was written, you know that some_writemessed up
-
-
-
the general idea is that if there is some scenario in which the user is able to use setters, constructors to create new objects, etc. to set any values that the actual data types of properties in the target language accept (for example in Java anything with type
KaitaiStructwould acceptnull, because it's just a reference type) and then calls_checks and_writein the way they're supposed to (of course, if you e.g. don't call the_write{Inst}method for a particular instance, it won't be written and thus you won't be able to read it back, but this is expected behavior and purely user error) and everything succeeds, and yet parsing of the serialized output does not exactly match the set values, it means that either there is an error in some of the_writemethods (in case the values set by the user are actually consistent), or it was because the user provided values were not consistent and the appropriate_checkmethod failed to report it -
following the previous point, it would be ideal to integrate fuzzing that would set some interesting values via setters that the KS-generated classes expose. Inevitably, some set of values that the fuzzer provides will be inconsistent, and the idea is to reveal insufficent
_checks by first finding a combination of values that passes the_check(I believe that coverage-based fuzzing can help here, as I imagine it, if the first check in_checkfails, it would try to adjust the value so that this first check passes and when it does, it would keep it and try to fix other reportedConsistencyErrors) and then doing a write-read roundtrip to see if both the_writewas really consistent.I would like to avoid having to manually enumerate each specific hardcoded case (because it's too easy to miss important cases, especially when consistency of serialization will be so dependent on user interaction with the objects). Some special serialization use cases probably would have to be tested as the traditional unit test with all these asserts and different input values, but it would be nice to keep it to a minimum, as I believe that a fuzzer can do a better job (instead of a few hardcoded values that take a long time for human to figure out and provide little assurance that it maybe works, fuzzer has the potential to test millions of such values in almost no time).
-
-