Skip to content

Instantly share code, notes, and snippets.

@felipeochoa
Created March 8, 2017 19:37
Show Gist options
  • Save felipeochoa/81d8fa27901e8222c6ffbeb165a85acc to your computer and use it in GitHub Desktop.
Save felipeochoa/81d8fa27901e8222c6ffbeb165a85acc to your computer and use it in GitHub Desktop.
Quick overview of how the DOCX format works

How the DOCX spec works

The standard can be downloaded from the ISO website at this direct link

DOCX documents are a zipped folder containing several interacting components in a word doc. The main ones are:

  • word/document.xml: The main document content
  • word/styles.xml: Name style information (e.g. "Header 1"), similar to CSS
  • word/numbering.xml: Sort of like CSS for numbering styles (e.g., "a)" vs "iii.")

Note on measures: The fundamental unit in DOCX is the TWIP, a "twentieth of a point", where a point ("pt") is 1/72 of an inch. Typically, properties referring to a physical length will accept a number indicating TWIPS or a string with a number followed by "mm|cm|in|pt|pc|pi" to indacte the units.

word/document.xml

The main document content consists of a sequence of block-level items wrapped in a body element. There are other "stories" you can include beyond body, such as comments, headers, etc. The main types of block-level content are paragraphs (p) and tables (tbl). Block-level elements have a sub-element specifying their "properties" (pPr for paragraphs and tblPr for tables), which include different options for styling and layout of the element. Each option corresponds to a child element for properties. Tables and paragraphs have different properties available as follows:

Paragraph properties

  • Style name pStyle: reference to an entry in word/styles.xml. Sort of like a CSS class

  • Numbering info numPr: reference to an entry in word/numbering.xml. The reference is both an id (numId) and a level number (ilvl). Paragraphs with this property have a number/bullet placed before the beginning of text.

  • Tab stops tabs: Contains a list of tab stops (tab) to set on the given paragraph. Each tab stop specifies a distance pos, a stop type (val), and an optional leader, indicating the fill character. Valid values for these attributes are:

    • val: "clear", "start", "center", "end", "decimal", "bar", "num"
    • leader: "none", "dot", "hyphen", "underscore", "heavy", "middleDot"
    • pos: The distance from the left margin to the tab stop
  • Indentation ind: This is an object with the following attributes:

    • start: Indentation from the left margin. Negative values move text backwards.
    • firstLine: Additional indentation for the first line. If hanging is also given, this is ignored
    • hanging: Negative indentation for the first line. Trumps firstLine if also given
    • end: Additional margin to leave empty on the right. Negative values move the margin backwards.

    All of these properties accept a TWIPS number or number + unit value. They also all have alternates suffixed with Chars (e.g., startChars) to specify the indentation in "character units"

  • Spacing spacing: Controls spacing between lines and above/below the paragraph. The core attributes are as follows:

    • before: Similar to CSS margin-top, in TWIPS or measure + unit.
    • beforeLines: Similar to CSS margin-top, measured in hundredths of a line
    • after: Similar to CSS margin-bottom, in TWIPS or measure + unit.
    • afterLines: Similar to CSS margin-bottom, measured in hundredths of a line
    • line: Similar to CSS line-height, in 240ths of a line. The meaning of this attribute can change if lineRule is not blank or auto (see the spec for details)
    • There are a couple more attributes, which see section 17.3.1.33
  • There are many more possible properties for a paragraph. See the spec for details on the following

    • adjustRightInd
    • autoSpaceDE
    • autoSpaceDN
    • bidi
    • cnfStyle
    • contextualSpacing
    • divId
    • framePr
    • jc
    • keepLines
    • keepNext
    • kinsoku
    • mirrorIndents
    • outlineLvl
    • overflowPunct
    • pBdr
    • pageBreakBefore
    • shd
    • snapToGrid
    • suppressAutoHyphens
    • suppressLineNumbers
    • suppressOverlap
    • textAlignment
    • textDirection
    • textboxTightWrap
    • topLinePunct
    • widowControl
    • wordWrap

Table properties

  • Style tblStyle: A reference to a style in word/styles.xml. Sort of like a CSS class

  • Left indent tblInd: This element has two attributes used to specify the leading indentation for a table, type and w. Depending on the value of type, w takes on a different meaning as follows:

    • dxa: w is interpreted as a number of TWIPS
    • pct: If w is a number, it is interpreted as 1/50ths of a 1% of the document width (excluding margins). If it ends in "%" then it species the percentage of document width directly
    • nil: w is ignored and margin is 0
    • auto: w is ignored and margin is deferred to parent styles
  • Borders tblBorders: Contains up to six elements top, start, bottom, end, insideH, and insideV (the first four correspond to the CSS top, left, bottom, and right). If there is a conflict between a cell border and the table border, cell borders typically win (but see tblPrEx for the corner-case where they don't). Each element has the following attributes:

    • color: "RRGGBB" in hex (no leading "#") or "auto"
    • sz: Border size in eighths of a point. Minimum border size is .25pt and maximum border size is 12pt.
    • val: Type of border. E.g., "single", "dashed", "dotted", "double", etc. See the spec for the full list (17.18.2)
    • Many more. See the spec (17.3.4)
  • Width tblW: This is an indication of the preferred width, which is an input into the overall layout algorithm. This element has the same two attributes as tblInd above.

  • Other table properties include:

    • bidiVisual
    • jc
    • shd
    • tblCaption
    • tblCellMar
    • tblCellSpacing
    • tblDescription
    • tblLayout
    • tblLook
    • tblOverlap
    • tblStyleColBandSize
    • tblStyleRowBandSize
    • tblW
    • tblpPr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment