Created
December 21, 2010 13:51
-
-
Save paulfitz/749949 to your computer and use it in GitHub Desktop.
This is format specification for table differences
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
default: | |
asciidoc tdiff_spec_draft.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tDiff Format Specification | |
========================== | |
:Author: Paul Fitzpatrick, Joe Panico | |
:Email: [email protected], [email protected] | |
:Date: 2011-01-12 | |
:Revision: 0.2 | |
[abstract] | |
Purpose | |
------- | |
This document defines the tDiff format; a text encoding for describing | |
the differences between two tables of data. | |
tDiff output is intended for two audiences: end-users who are looking | |
for an easily readable report that describes the differences between | |
two tables; and tools that will interpret tDiff output in order to | |
perform further processing or analysis (e.g. "patch" tools). | |
Scope | |
----- | |
tDiff should be able to describe the differences between *any types* | |
of data tables. That includes: tables that have an identifiable unique | |
key (e.g. RDBMS tables with primary key) and tables that don't (RDBMS | |
tables without primary key, or heaps); tables where row ordering is | |
meaningful, whether explicit or implied, such as spreadsheets or CSV | |
files; and tables where row ordering has no inherent meaning (RDBMS | |
tables). Further, tDiff should be able to represent the differences | |
between two different *types* of table, e.g. an RDBMS table versus a | |
spreadsheet table. | |
Limitations | |
----------- | |
tDiff is designed to described the differences in *content* between | |
two tables-- it does not have any facilities for describing the | |
differences in *structure* between two tables. A future revision will | |
expand the specification to include structure or schema information. | |
tDiff is designed to described text (character) content. It has no | |
facilities to describe differences in content that does not have a | |
natural default character representation. A future revision may expand | |
the specification to include arbitrary binary data. | |
tDiff can represent the differences between exactly *two* tables. It | |
cannot represent an n-way compare where n > 2. It also cannot | |
represent the differences between collections of tables, such | |
as multiple-sheet spreadsheets. | |
Representation | |
-------------- | |
tDiff documents use the UTF-8 character encoding. | |
Top-level structure | |
------------------- | |
tDiff documents comprise any number of <<Comment,comment blocks>>, | |
<<Control,control blocks>>, and | |
<<Hunk,diff hunks>>, interleaved in any order. | |
Each diff hunk describes a related set of differences | |
between two tables. Each hunk could stand on its own as | |
an independent tDiff document. When there is a choice | |
in how to decompose differences between two tables as | |
a sequence of hunks, | |
generators are encouraged to choose a decomposition that | |
minimizes ordering effects between hunks. | |
.Document example | |
---- | |
# tdiff version 0.2 | |
/* | |
* this comment could describe the whole document | |
*/ | |
/* | |
* this comment could describe the following hunk | |
*/ | |
* |bridge=Brooklyn|designer:'J.A.Roebling'|length:1595| | |
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| | |
* |bridge=Queensborough|designer:'Palmer & Hornbostel'|length:1182| | |
/* | |
* this comment could describe the following hunk, even | |
* though it's not immediately adjacent to the hunk (there is | |
* an intervening newline). | |
*/ | |
- |bridge=Spamspan|designer:'S.Spamington'|length:10000| | |
+ |bridge=Manhattan|designer:'G.Lindenthal'|length:1470| | |
/* | |
* final, wrap up, comment | |
*/ | |
---- | |
[[Comment]] | |
Comment blocks | |
-------------- | |
Comment blocks are delimited using: /* */ (C style). Any content can | |
occur within a comment block. | |
.Example comment | |
---- | |
/* This is an example single-line comment */ | |
/* This is an | |
example multi-line | |
comment */ | |
---- | |
[[Control]] | |
Control blocks | |
-------------- | |
Control blocks begin with "# ", and are delimited by a newline or | |
linefeed. Control blocks may hold meta information about diffs, | |
or environmental information that might be useful to an interpreter. | |
Apart from the special <<Header,header control block>>, they | |
lie outside of the scope of this specification. | |
[[Header]] | |
Header | |
------ | |
A tdiff document should begin with a special | |
<<Control, control block>> called the header. | |
The header begins with the characters "# tdiff". It is there | |
to aid in rapid identification of tdiff documents. Documents | |
*without* this block should be accepted by interpreters, but | |
this block should always be included by generators. | |
.Example header | |
---- | |
# tdiff version 0.2 | |
---- | |
[[Hunk]] | |
Hunk | |
---- | |
A hunk is a series of one or more adjacent <<Line,diff lines>>, | |
optionally preceeded by a <<Column,column line>>, where each | |
diff line represents the | |
differences between the source tables for a single row. The | |
lines within a hunk should be separated by only the newline | |
characters that terminate each diff line, so that they all appear as | |
adjacent lines within a text editor. Within a tDiff document, | |
hunks are delimited from each other via intermediate filler | |
or comment blocks. | |
.Example hunk | |
---- | |
- |bridge=Spamspan|designer:'S.Spamington'|length:10000| | |
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| | |
+ |bridge=Manhattan|designer:'G.Lindenthal'|length:1470| | |
---- | |
[[Line]] | |
Diff line | |
--------- | |
A diff line describes differences in a single row of the two tables that | |
were compared. One table is designated the left or local table (called *L*) | |
and the other table is designated the right or remote table (called *R*). | |
There are three types of diff lines: | |
- MISSING line: describes a row that is present in *L* but absent in *R*. | |
- EXTRA line: describes a row that is absent in *L* but present in *R*. | |
- CHANGE line: describes a row that is present in both tables, but differs | |
in some specific column values. | |
Each diff line occupies its own line in the document, and begins with | |
one of three characters. These three characters are called "line type" | |
characters: | |
- MISSING lines begin with plus: _'+'_. In order to make *R* look like *L* we would have to add the missing row to *R*. | |
- EXTRA lines begin with minus: _'-'_. In order to make *R* look like *L* we would remove the extra row from *R*. | |
- CHANGE lines begin with equals: _'='_. In order to make *R* look like *L* we would update some of its column values. | |
The line type character can be left or right padded with any amount of | |
whitespace, for readability. The line type character is followed by | |
any number of name-value pairs, where the names represent column | |
names, and the values are the values for the corresponding column name | |
in that particular row. The name is separated from the value by an | |
equals ('=') sign for identifying columns (usually part of the primary | |
key, but see <<Identity,Keys versus identity>>) | |
or a colon (':') sign for all other columns. | |
The name-value pairs, as well as the line type | |
character, are delimited with whitespace and/or a pipe _'|'_ character. | |
.Example diff line | |
---- | |
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| | |
---- | |
[[Column]] | |
Column line | |
----------- | |
Optionally, key names can be removed from diff lines. To do this, a | |
special column line may be given within a hunk. This lists | |
column names, followed by "=" for identifying columns. | |
.Column line example | |
---- | |
@ |bridge=|designer|length| | |
---- | |
This establishes bridge as an identifying column that appears first, | |
followed by designer and length columns. We can now rewrite this: | |
.before | |
---- | |
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| | |
---- | |
as this: | |
.after | |
---- | |
= |Williamsburg|'D.D.Duck'->'L.L.Buck'|1600| | |
---- | |
The effect of column lines is limited to within a single hunk. | |
ROW Pseudo Column | |
----------------- | |
Some sources of table data have a meaningful notion of row or line | |
number (e.g. spreadsheets, CSV files) but others do not (RDBMS | |
tables). The representation of the row number may be | |
external to the table data itself. In order to accommodate these | |
sources, document generators can make use of the ROW pseudo | |
column. | |
.pseudo column examples | |
---- | |
+ |ROW=3|name3:value3|name4:value4| | |
- |ROW=4| | |
= |ROW=5|name3:old-value9->new-value9|name4:old-value10->new-value10 | | |
---- | |
Note that a diff that uses ROW numbers may be less useful for applying to | |
RDBMS tables, and alternate row identity choices may be considered by the | |
generator (or specified to the generator by the user). See | |
<<Identity,Keys versus identity>>. | |
Column Numbers | |
-------------- | |
Some sources of table data do not use named columns-- for instance, | |
CSV files that have not designated a column header row. In those | |
cases, column names are replaced with column number designators. The | |
designators follow this pattern: | |
1, 2, ... | |
N.B. The ordinal numbers embedded within column number designators are | |
1's based, not 2's based. In other words, counting starts at 1, not at | |
zero. | |
.column number example | |
---- | |
= |1=value1|2=value2|3:old-value3->new-value3| | |
---- | |
If column names are omitted, and no <<Column,column line>> is given, | |
then ordinal numbers are assumed. All columns are assumed to be | |
identifying, unless they are modified. | |
The previous example | |
can therefore be equivalently expressed as: | |
.column number example (rewritten) | |
---- | |
= |value1|value2|old-value3->new-value3| | |
---- | |
Context | |
------- | |
Though not always strictly required to fully and accurately described | |
differences between two tables, generators are allowed to include | |
extra rows and extra column values as contextual information. These | |
can help human readers oriented themselves within the data, and might | |
help machine interpreters to resolve otherwise ambiguities in applying | |
a tDiff document. | |
In the case of missing or extra row lines, extra | |
column data can be included within the row as name-value pairs, just | |
like any other name-value pairs. In the case of a change line, | |
extra name-value pairs can be included within the line. | |
These contextual name-value pairs should also | |
include only the old value. In other words, contextual column values should | |
not include "->new-value". In addition to contextual column | |
information, a hunk may contain any number of contextual rows. These | |
rows appear with the line type character _'*'_. Streaks of contextual | |
rows can prefix and suffix the block of actual (+, -, =) real diff | |
lines, but cannot be interspersed amongst them. In other words, a | |
hunk is: | |
- <streak of context lines> | |
- <streak of real (+ or - or =) diff lines> | |
- <streak of context lines> | |
.context example | |
---- | |
* |bridge=Brooklyn|designer:'J.A.Roebling'|length:1595| | |
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| | |
* |bridge=Queensborough|designer:'Palmer & Hornbostel'|length:1182| | |
---- | |
Some tables are ordered, some unordered. The context features | |
of tDiff make sense for ordered tables. When a diff between | |
ordered tables is applied as a patch to an unordered table, | |
context features may be ignored. When a diff between unordered | |
tables is applied as a patch to an ordered table, the resulting | |
order may be underspecified (but presumably does not matter). | |
[[Identity]] | |
Keys versus identity | |
-------------------- | |
Determining whether a row is present in *L* and *R* requires | |
a judgment about row identity. This judgment may be simple. | |
For example, the identity of a row may simply be the value of | |
its primary key. However, it is possible that the identity | |
of a row is distinct from its primary key. Consider for | |
example a table with an auto-incrementing integer primary key, | |
rather than something derived from the row data. Comparison | |
of that key between separately maintained copies of that table | |
will be meaningless. For meaningful comparison, an alternate | |
row identity would need to be constructed. | |
This issue lies outside the tDiff specification, but it is | |
important that implementors be aware that columns used for | |
identification may in some cases not be part of the primary key. | |
Appendix: Quoting | |
----------------- | |
Names or values may be quoted in a tDiff document. Quoting is done | |
as follows: | |
- All instances of the single-quote character are duplicated into pairs. | |
- All control characters and the backslash character are escaped as for C literals. | |
- The name or value is wrapped in single-quotes | |
It is always safe to single-quote a name or value. Names or values | |
*must* by quoted in any of the following conditions: | |
- A name or value conflicts with a reserved word: NULL, ROW. | |
- A name or value contains any character in the 7-bit ASCII range that is *not* in the following set: [A-Za-z0-9+.] | |
- A *name* begins with any of the characters [0-9+.]. | |
In the case of column diffs, for each cell that was different between | |
*L* and *R*, both the old and new values are displayed. The | |
old value must come first, followed by '->' (dash greater than), | |
followed by the new value. For all three diff line types, the | |
generator may include *L* name-value pairs that are not | |
strictly needed, but may help with row identification. | |
Appendix: Grammar | |
----------------- | |
.Grammar | |
---- | |
document ::= header (block)* | |
block ::= hunk | control | comment | filler | |
hunk ::= (hunk_header)? (diff_line)+ | |
hunk_header ::= '@' (divider name)* break | |
diff_line ::= ('-' | '+' | '=') (divider term)* break | |
term ::= (name ('='|':'))? (value '->')? value | |
break ::= divider? linebreak | |
control ::= "#" (divider value)* break | |
comment ::= | ("/*" comment_body "*/") | |
filler ::= (linebreak | divider)+ | |
header ::= '#' (divider term)* break | |
divider ::= (' ' | '\t')* ('|')? (' ' | '\t')* | |
linebreak ::= ('\r\n' | '\r' | '\n') | |
# unspecified in grammar: name, value, comment_body | |
---- | |
The linebreak non-terminal needs to be parsed greedily to work with all | |
common end-of-line formats, since the number of linebreaks is | |
significant in the grammar. The comment_body non-terminal is | |
as for C. | |
Appendix: Complete examples | |
--------------------------- | |
In example one, both tables are in an RDBMS, both tables have the same | |
column names, and the rows are identified using column1. | |
.Example 1 tables | |
---- | |
L: R: | |
column1,column2,column3,column4 column1,column2,column3,column4 | |
1, 0000, x, aaaa ---------------------------- | |
---------------------------- 2, 1111, x, aaaa | |
3, 2222, x, aaaa 3, 2222, y, aaaa | |
4, 3333, x, aaaa 4, 0000, z, bbbb | |
5, 4444, x, aaaa 5, 4444, z, bbbb | |
6, 5555, x, aaaa 6, 5555, u, aaaa | |
---------------------------- 7, 0000, v, aaaa | |
---------------------------- 8, 1111, x, aaaa | |
---- | |
.Example 1 diff, variant 1 | |
---- | |
# tdiff version 0.2 | |
/* | |
* this is the tDiff document for example 1, using 1 hunk only and no context. | |
* Note the "|" usage varies from previous examples in this document. | |
* "|" plays the same role as spaces and tabs in the spec, so varying | |
* styles are possible. | |
*/ | |
- | column1=1 | |
+ | column1=2| column2:1111| column3:x| column4:aaaa | |
= | column1=3| column3:x->y | |
= | column1=4| column2:3333->0000| column3:x->z| column4:aaaa->bbbb | |
= | column1=5| column3:x->z| column4:aaaa->bbbb | |
= | column1=6| column3:x->u | |
+ | column1=7| column2:0000| column3:v| column4:aaaa | |
+ | column1=8| column2:1111| column3:x| column4:aaaa | |
/* | |
* end of tDiff document | |
*/ | |
---- | |
.Example 1 diff, variant 2 | |
---- | |
# tdiff version 0.2 | |
/* | |
* here is a tDiff document that is equivalent to the document above, except | |
* that it uses 8 hunks, more comments, and adds in some context | |
*/ | |
/* | |
* hunk 1: notice that columns 2,3,4 are context-- not strictly necessary | |
* to specify a remove | |
*/ | |
- | column1=1| column2:0000| column3:x| column4:aaaa | |
/* | |
* hunk 2: notice that the hunks are separated by standalone newline | |
*/ | |
+ | column1=2| column2:1111| column3:x| column4:aaaa | |
/* | |
* hunk 3: notice that column2 and column4 are merely context | |
*/ | |
= | column1=3| column2:2222| column3:x->y| column4:aaaa | |
/* | |
* hunk 4: notice that the column diff line is surrounded by context rows, and | |
* that the context rows describe the values on the RHS. | |
*/ | |
* | column1=3| column2:2222| column3:x| column4:aaaa | |
= | column1=4| column2:3333->0000| column3:x->z| column4:aaaa->bbbb | |
* | column1=5| column2:4444| column3:x| column4:aaaa | |
/* | |
* hunk 5 | |
*/ | |
= | column1=5| column3:x->z| column4:aaaa->bbbb | |
/* | |
* hunk 6 | |
*/ | |
= | column1=6| column3:x->u | |
/* | |
* hunk 7 | |
*/ | |
+ | column1=7| column2:0000| column3:v| column4:aaaa | |
/* | |
* hunk 8 | |
*/ | |
+ | column1=8| column2:1111| column3:x| column4:aaaa | |
---- | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment