Skip to content

Instantly share code, notes, and snippets.

@hellekin
Created November 5, 2012 23:47
Show Gist options
  • Select an option

  • Save hellekin/4021212 to your computer and use it in GitHub Desktop.

Select an option

Save hellekin/4021212 to your computer and use it in GitHub Desktop.
Analysis of Elgg's (that is: Wordpress') autop()

Analyzing autop()

The Function

/**
 * Create paragraphs from text with line spacing
 * Borrowed from Wordpress.
 *
 * @param string $pee The string
 * @param bool   $br  Add BRs?
 *
 * @todo Rewrite
 * @return string
 **/
function autop($pee, $br = 1) {
	$pee = $pee . "\n"; // just to make things a little easier, pad the end
	$pee = preg_replace('|<br />\s*<br />|', "\n\n", $pee);
	// Space things out a little
	$allblocks = '(?:table|thead|tfoot|caption|colgroup|tbody|tr|td|th|div|dl|dd|dt|ul|ol|li|pre|select|form|map|area|blockquote|address|math|style|input|p|h[1-6]|hr)';
	$pee = preg_replace('!(<' . $allblocks . '[^>]*>)!', "\n$1", $pee);
	$pee = preg_replace('!(</' . $allblocks . '>)!', "$1\n\n", $pee);
	$pee = str_replace(array("\r\n", "\r"), "\n", $pee); // cross-platform newlines
	if (strpos($pee, '<object') !== false) {
		$pee = preg_replace('|\s*<param([^>]*)>\s*|', "<param$1>", $pee); // no pee inside object/embed
		$pee = preg_replace('|\s*</embed>\s*|', '</embed>', $pee);
	}
	$pee = preg_replace("/\n\n+/", "\n\n", $pee); // take care of duplicates
	$pee = preg_replace('/\n?(.+?)(?:\n\s*\n|\z)/s', "<p>$1</p>\n", $pee); // make paragraphs, including one at the end
	$pee = preg_replace('|<p>\s*?</p>|', '', $pee); // under certain strange conditions it could create a P of entirely whitespace
	$pee = preg_replace('!<p>([^<]+)\s*?(</(?:div|address|form)[^>]*>)!', "<p>$1</p>$2", $pee);
	$pee = preg_replace('|<p>|', "$1<p>", $pee);
	$pee = preg_replace('!<p>\s*(</?' . $allblocks . '[^>]*>)\s*</p>!', "$1", $pee); // don't pee all over a tag
	$pee = preg_replace("|<p>(<li.+?)</p>|", "$1", $pee); // problem with nested lists
	$pee = preg_replace('|<p><blockquote([^>]*)>|i', "<blockquote$1><p>", $pee);
	$pee = str_replace('</blockquote></p>', '</p></blockquote>', $pee);
	$pee = preg_replace('!<p>\s*(</?' . $allblocks . '[^>]*>)!', "$1", $pee);
	$pee = preg_replace('!(</?' . $allblocks . '[^>]*>)\s*</p>!', "$1", $pee);
	if ($br) {
		$pee = preg_replace_callback('/<(script|style).*?<\/\\1>/s', create_function('$matches', 'return str_replace("\n", "<WPPreserveNewline />", $matches[0]);'), $pee);
		$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); // optionally make line breaks
		$pee = str_replace('<WPPreserveNewline />', "\n", $pee);
	}
	$pee = preg_replace('!(</?' . $allblocks . '[^>]*>)\s*<br />!', "$1", $pee);
	$pee = preg_replace('!<br />(\s*</?(?:p|li|div|dl|dd|dt|th|pre|td|ul|ol)[^>]*>)!', '$1', $pee);
	//if (strpos($pee, '<pre') !== false) {
	//	mind the space between the ? and >.  Only there because of the comment.
	//	$pee = preg_replace_callback('!(<pre.*? >)(.*?)</pre>!is', 'clean_pre', $pee );
	//}
	$pee = preg_replace("|\n</p>$|", '</p>', $pee);

	return $pee;
}

Analysis

Summary: autop() takes arbitrary HTML and tries to format it so that the code is readable, and proper tag nesting is done. It does not prevent XSS, as it does not filter any tag’s attributes.

It does not try either to enforce element prohibition defined in the XHTML 1.1 Strict specification.

Prototype

  autop($pee, $br = 1)

The first argument $pee is the string to convert to “proper HTML”. $br, if set to TRUE (or 1), will ensure that:

  • extra blank lines are turned into break rules,
  • no break rule follows another break rule,
  • inline javascript and CSS are not changed.

This is the default behavior, and there’s few chances that people do not want that, so it should be eliminated from the rewritten version.

Code Flow

autop() uses a multiple-pass strategy to transform the original text step-by-step. Although that is safer, it might not be very efficient. There’s probably room for improvement (famous last words.)

Line by Line

$pee = $pee . “\n”; // just to make things a little easier, pad the end

Append a newline to the original string. The comment says it’s easier, but fails to indicate how it’s easier (especially as the original string is not “trimmed” on the way.

$pee = preg_replace(‘|<br />\s*<br />|’, “\n\n”, $pee);

Replace any sequence of two break rules only separated by spaces or new lines, with a double newline.

At this point, we’re left with “proper” break rules, and extra blank lines.

// Space things out a little

The 4-line section takes care of prepending a newline before each block tag, and appending a double newline after the block.

$allblocks = ‘(?:table|thead|tfoot|caption|colgroup|tbody|tr|td|th|div|dl|dd|dt|ul|ol|li|pre|select|form|map|area|blockquote|address|math|style|input|p|h[1-6]|hr)’;

The $allblocks variable lists all XHTML tags that are considered blocks, that is: are containers for inline elements (e.g., <div> or <p>), or themselves autonomous elements (e.g., <area>).

$pee = preg_replace(‘!(<’ . $allblocks . ‘[^>]*>)!’, “\n$1”, $pee);

Prepend a newline to each opening block tag.

$pee = preg_replace(‘!(</’ . $allblocks . ‘>)!’, “$1\n\n”, $pee);

Append a double newline to each closing block tag.

$pee = str_replace(array(“\r\n”, “\r”), “\n”, $pee); // cross-platform newlines

Use UN*X newlines in place of DOS carriage returns to simplify parsing, and only deal with newline characters.

(handling of <object> tag)

if (strpos($pee, '<object') !== false) {
    $pee = preg_replace('|\s*<param([^>]*)>\s*|', "<param$1>", $pee); // no pee inside object/embed
    $pee = preg_replace('|\s*</embed>\s*|', '</embed>', $pee);
}

If the original string contains an <object>, remove all spaces and blank lines around its inner elements: <p> is not allowed inside <object>.

$pee = preg_replace(”\n\n+”, “\n\n”, $pee); // take care of duplicates

Reduce two or more blank lines to only two.

$pee = preg_replace(’\n?(.+?)(?:\n\s*\n|\z)/s’, “<p>$1</p>\n”, $pee); / make paragraphs, including one at the end

Enclose into a single paragraph anything that may be preceded by a newline, and may be followed by two or more empty lines (any line that contains nothing or spaces, tabs, and newlines.)

The s flag means the regular expression matches newline characters. In practice, it recognizes a paragraph as any sequence of lines of text separated by a double newline, or reaching the end of the input string.

$pee = preg_replace(‘|<p>\s*?</p>|’, ”, $pee); // under certain strange conditions it could create a P of entirely whitespace

Remove the possibly empty paragraphs created by the previous rule “under strange conditions”.

$pee = preg_replace(‘!<p>([^<]+)\s*?(</(?:div|address|form)[^>]*>)!’, “<p>$1</p>$2”, $pee);

Find all occurrences of opening paragraphs followed by emptiness and a closing <div>, <address>, or <form> element, and insert a closing <p> tag before that closing tag.

$pee = preg_replace(‘|<p>|’, “$1<p>”, $pee);

That line is way strange. It replaces any opening paragraph tag with the first occurrence of nothing, followed by <p>. It could be more cryptic by using a back reference:

   $pee = preg_replace('|(<p>)|', "$2$1", $pee);

The object of that line is still unknown.

$pee = preg_replace(‘!<p>\s*(</?’ . $allblocks . ‘[^>]*>)\s*</p>!’, “$1”, $pee); // don’t pee all over a tag

If a block element is found within a paragraph, remove the enclosing <p> tag.

$pee = preg_replace(“|<p>(<li.+?)</p>|”, “$1”, $pee); // problem with nested lists

If a <li> element is found within a paragraph, remove the enclosing <p> tag.

(handling of <blockquote> elements)

   $pee = preg_replace('|<p><blockquote([^>]*)>|i', "<blockquote$1><p>", $pee);
   $pee = str_replace('</blockquote></p>', '</p></blockquote>', $pee);

If a <blockquote> element is found within a paragraph, swap the elements to enclose the paragraph inside the <blockquote>.

(handling of $allblocks nesting)

   $pee = preg_replace('!<p>\s*(</?' . $allblocks . '[^>]*>)!', "$1", $pee);
   $pee = preg_replace('!(</?' . $allblocks . '[^>]*>)\s*</p>!', "$1", $pee);

Detect opening paragraph tag followed by a closing block tag, and remove it. Ditto with closing paragraph tags, so that any element of the $allblocks family is never enclosed in a paragraph.

(preservation of inline CSS and JS)

if ($br) {
    $pee = preg_replace_callback('/<(script|style).*?<\/\\1>/s', create_function('$matches', 'return str_replace("\n", "<WPPreserveNewline />", $matches[0]);'), $pee);
    $pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); // optionally make line breaks
    $pee = str_replace('<WPPreserveNewline />', "\n", $pee);
}

$pee = preg_replace_callback(’<(script|style).*?<\\1>/s’, create_function(‘$matches’, ‘return str_replace(“\n”, “<WPPreserveNewline />”, $matches[0]);’), $pee);

Replace newlines in <script> and <style> tags with a custom marker that will be replaced back to a newline character after the rest of the replacements are completed.

$pee = preg_replace(‘|(?<!<br >)\s*\n|’, “<br />\n”, $pee); / optionally make line breaks

Compress all newline characters preceded by spaces, tabs, blank lines, or a <br /> tag, with a single <br /> tag followed by a newline.

$pee = str_replace(‘<WPPreserveNewline />’, “\n”, $pee);

Recover real newline characters from the marker put forth two rules before.

$pee = preg_replace(‘!(</?’ . $allblocks . ‘[^>]*>)\s*<br />!’, “$1”, $pee);

Remove all <br /> tags that follow a closing tag of the $allblocks family.

$pee = preg_replace(‘!<br >(\s*<?(?:p|li|div|dl|dd|dt|th|pre|td|ul|ol)[^>]*>)!’, ‘$1’, $pee);

Remove all <br /> tags that precede a closing tag of the $allblocks family.

(commented code)

   //if (strpos($pee, '<pre') !== false) {
   //	mind the space between the ? and >.  Only there because of the comment.
   //	$pee = preg_replace_callback('!(<pre.*? >)(.*?)</pre>!is', 'clean_pre', $pee );
   //}

This is commented, so the description is just for the sake of completeness: if a <pre > tag is encountered (mind the extra space), then apply a clean_pre callback. The match is case-insensitive and spans multiple lines.

$pee = preg_replace(“|\n&lt;/p&gt;$|”, ‘</p>’, $pee);

Finally, remove empty lines preceding a closing <p> tag. That pushes the closing tag next to the last character of the paragraph, making the code more readable.

return $pee

Hopefully, the ugly code is now beautified.

Conclusions

The autop() function is a complicated beast that seems to do the job well, but is certainly possible to break spectacularly. Glad to see it disappear and leave the space to a proper parser.

I’m curious to see someone explaining:

   $pee = preg_replace('|<p>|', "$1<p>", $pee);
@mrclay

mrclay commented Apr 28, 2015

Copy link
Copy Markdown

FYI Elgg now uses a significantly better DOM-based algorithm for this https://github.com/Elgg/Elgg/blob/master/engine/classes/ElggAutoP.php

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment