We introduce Chomsky Normal Form, which is used to answer questions about context-free languages.
Chomsky Normal Form. A grammar where every production is either of the form A → BC or A → c (where A, B, C are arbitrary variables and c an arbitrary symbol). Example:
S → AS | a
A → SA | b
(If language contains ε, then we allow S → ε where S is start symbol, and forbid S on RHS.)
The key advantage is that in Chomsky Normal Form, every derivation of a string of n letters has exactly 2n − 1 steps. Thus: one can determine if a string is in the language by exhaustive search of all derivations.
The conversion to Chomsky Normal Form has four main steps:
- Get rid of all ε productions.
- Get rid of all productions where RHS is one variable.
- Replace every production that is too long by shorter productions.
- Move all terminals to productions where RHS is one terminal.
-
Eliminate ε Productions Determine the nullable variables (those that gen- erate ε) (algorithm given earlier). Go through all productions, and for each, omit every possible subset of nullable variables. For example, if P → AxB with both A and B nullable, add productions P → xB | Ax | x. After this, delete all productions with empty RHS.
-
Eliminate Variable Unit Productions A unit production is where RHS has only one symbol. Consider production A → B. Then for every pro- duction B → α, add the production A → α. Re- peat until done (but don’t re-create a unit pro- duction already deleted).
-
Replace Long Productions by Shorter Ones For example, if have production A → BCD, then replace it with A → BE and E → CD. (In theory this introduces many new variables, but one can re-use variables if careful.)
-
Move Terminals to Unit Productions For every terminal on the right of a non-unit production, add a substitute variable. For example, replace production A → bC with productions A → BC and B → b.
Consider the CFG:
S → aXbX
X → aY | bY | ε
Y → X | c
The variable X is nullable; and so therefore is Y . After elimination of ε, we obtain:
S → aXbX | abX | aXb | ab
X → aY | bY | a | b
Y → X | c
After elimination of the unit production Y → X, we obtain:
S → aXbX | abX | aXb | ab
X → aY | bY | a | b
Y → aY | bY | a | b | c
Now, break up the RHSs of S; and replace a by A, b by B and c by C wherever not units:
S → EF | AF | EB | AB
X → AY | BY | a | b
Y → AY | BY | a | b | c
E → AX
F → BX
A → a
B → b
C → c
S → AbA
A → Aa | ε
After the first step:
S → AbA | bA | Ab | b
A → Aa | a
The second step does not apply. After the third step: S → T A | bA | Ab | b A → Aa | a T → Ab
And finally: S → T A | BA | AB | b A → AC | a T → AB B → b C → a
There are special forms for CFGs such as Chomsky Normal Form, where every production has the form A → BC or A → c. The algorithm to convert to this form involves (1) determining all nullable variables and getting rid of all ε-productions, (2) getting rid of all variable unit productions, (3) breaking up long productions, and (4) moving terminals to unit productions.