/**
* Create paragraphs from text with line spacing
* Borrowed from Wordpress.
*
* @param string $pee The string
* @param bool $br Add BRs?
*
* @todo Rewrite
* @return string
**/
function autop($pee, $br = 1) {
$pee = $pee . "\n"; // just to make things a little easier, pad the end
$pee = preg_replace('|<br />\s*<br />|', "\n\n", $pee);
// Space things out a little
$allblocks = '(?:table|thead|tfoot|caption|colgroup|tbody|tr|td|th|div|dl|dd|dt|ul|ol|li|pre|select|form|map|area|blockquote|address|math|style|input|p|h[1-6]|hr)';
$pee = preg_replace('!(<' . $allblocks . '[^>]*>)!', "\n$1", $pee);
$pee = preg_replace('!(</' . $allblocks . '>)!', "$1\n\n", $pee);
$pee = str_replace(array("\r\n", "\r"), "\n", $pee); // cross-platform newlines
if (strpos($pee, '<object') !== false) {
$pee = preg_replace('|\s*<param([^>]*)>\s*|', "<param$1>", $pee); // no pee inside object/embed
$pee = preg_replace('|\s*</embed>\s*|', '</embed>', $pee);
}
$pee = preg_replace("/\n\n+/", "\n\n", $pee); // take care of duplicates
$pee = preg_replace('/\n?(.+?)(?:\n\s*\n|\z)/s', "<p>$1</p>\n", $pee); // make paragraphs, including one at the end
$pee = preg_replace('|<p>\s*?</p>|', '', $pee); // under certain strange conditions it could create a P of entirely whitespace
$pee = preg_replace('!<p>([^<]+)\s*?(</(?:div|address|form)[^>]*>)!', "<p>$1</p>$2", $pee);
$pee = preg_replace('|<p>|', "$1<p>", $pee);
$pee = preg_replace('!<p>\s*(</?' . $allblocks . '[^>]*>)\s*</p>!', "$1", $pee); // don't pee all over a tag
$pee = preg_replace("|<p>(<li.+?)</p>|", "$1", $pee); // problem with nested lists
$pee = preg_replace('|<p><blockquote([^>]*)>|i', "<blockquote$1><p>", $pee);
$pee = str_replace('</blockquote></p>', '</p></blockquote>', $pee);
$pee = preg_replace('!<p>\s*(</?' . $allblocks . '[^>]*>)!', "$1", $pee);
$pee = preg_replace('!(</?' . $allblocks . '[^>]*>)\s*</p>!', "$1", $pee);
if ($br) {
$pee = preg_replace_callback('/<(script|style).*?<\/\\1>/s', create_function('$matches', 'return str_replace("\n", "<WPPreserveNewline />", $matches[0]);'), $pee);
$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); // optionally make line breaks
$pee = str_replace('<WPPreserveNewline />', "\n", $pee);
}
$pee = preg_replace('!(</?' . $allblocks . '[^>]*>)\s*<br />!', "$1", $pee);
$pee = preg_replace('!<br />(\s*</?(?:p|li|div|dl|dd|dt|th|pre|td|ul|ol)[^>]*>)!', '$1', $pee);
//if (strpos($pee, '<pre') !== false) {
// mind the space between the ? and >. Only there because of the comment.
// $pee = preg_replace_callback('!(<pre.*? >)(.*?)</pre>!is', 'clean_pre', $pee );
//}
$pee = preg_replace("|\n</p>$|", '</p>', $pee);
return $pee;
}Summary: autop() takes arbitrary HTML and tries to format it so that the code is readable, and proper tag nesting is done. It does not prevent XSS, as it does not filter any tag’s attributes.
It does not try either to enforce element prohibition defined in the XHTML 1.1 Strict specification.
autop($pee, $br = 1)
The first argument $pee is the string to convert to “proper
HTML”. $br, if set to TRUE (or 1), will ensure that:
- extra blank lines are turned into break rules,
- no break rule follows another break rule,
- inline javascript and CSS are not changed.
This is the default behavior, and there’s few chances that people do not want that, so it should be eliminated from the rewritten version.
autop() uses a multiple-pass strategy to transform the original
text step-by-step. Although that is safer, it might not be very
efficient. There’s probably room for improvement (famous last
words.)
Append a newline to the original string. The comment says it’s easier, but fails to indicate how it’s easier (especially as the original string is not “trimmed” on the way.
Replace any sequence of two break rules only separated by spaces or new lines, with a double newline.
At this point, we’re left with “proper” break rules, and extra blank lines.
The 4-line section takes care of prepending a newline before each block tag, and appending a double newline after the block.
$allblocks = ‘(?:table|thead|tfoot|caption|colgroup|tbody|tr|td|th|div|dl|dd|dt|ul|ol|li|pre|select|form|map|area|blockquote|address|math|style|input|p|h[1-6]|hr)’;
The $allblocks variable lists all XHTML tags that are
considered blocks, that is: are containers for inline elements
(e.g., <div> or <p>), or themselves autonomous elements
(e.g., <area>).
Prepend a newline to each opening block tag.
Append a double newline to each closing block tag.
Use UN*X newlines in place of DOS carriage returns to simplify parsing, and only deal with newline characters.
if (strpos($pee, '<object') !== false) {
$pee = preg_replace('|\s*<param([^>]*)>\s*|', "<param$1>", $pee); // no pee inside object/embed
$pee = preg_replace('|\s*</embed>\s*|', '</embed>', $pee);
}
If the original string contains an <object>, remove all spaces
and blank lines around its inner elements: <p> is not allowed
inside <object>.
Reduce two or more blank lines to only two.
$pee = preg_replace(’\n?(.+?)(?:\n\s*\n|\z)/s’, “<p>$1</p>\n”, $pee); / make paragraphs, including one at the end
Enclose into a single paragraph anything that may be preceded by a newline, and may be followed by two or more empty lines (any line that contains nothing or spaces, tabs, and newlines.)
The s flag means the regular expression matches newline
characters. In practice, it recognizes a paragraph as any
sequence of lines of text separated by a double newline, or
reaching the end of the input string.
$pee = preg_replace(‘|<p>\s*?</p>|’, ”, $pee); // under certain strange conditions it could create a P of entirely whitespace
Remove the possibly empty paragraphs created by the previous rule “under strange conditions”.
Find all occurrences of opening paragraphs followed by emptiness
and a closing <div>, <address>, or <form> element, and
insert a closing <p> tag before that closing tag.
That line is way strange. It replaces any opening paragraph tag
with the first occurrence of nothing, followed by <p>. It
could be more cryptic by using a back reference:
$pee = preg_replace('|(<p>)|', "$2$1", $pee);
The object of that line is still unknown.
$pee = preg_replace(‘!<p>\s*(</?’ . $allblocks . ‘[^>]*>)\s*</p>!’, “$1”, $pee); // don’t pee all over a tag
If a block element is found within a paragraph, remove the
enclosing <p> tag.
If a <li> element is found within a paragraph, remove the
enclosing <p> tag.
$pee = preg_replace('|<p><blockquote([^>]*)>|i', "<blockquote$1><p>", $pee);
$pee = str_replace('</blockquote></p>', '</p></blockquote>', $pee);
If a <blockquote> element is found within a paragraph, swap the
elements to enclose the paragraph inside the <blockquote>.
$pee = preg_replace('!<p>\s*(</?' . $allblocks . '[^>]*>)!', "$1", $pee);
$pee = preg_replace('!(</?' . $allblocks . '[^>]*>)\s*</p>!', "$1", $pee);
Detect opening paragraph tag followed by a closing block tag,
and remove it. Ditto with closing paragraph tags, so that any
element of the $allblocks family is never enclosed in a
paragraph.
if ($br) {
$pee = preg_replace_callback('/<(script|style).*?<\/\\1>/s', create_function('$matches', 'return str_replace("\n", "<WPPreserveNewline />", $matches[0]);'), $pee);
$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); // optionally make line breaks
$pee = str_replace('<WPPreserveNewline />', "\n", $pee);
}
$pee = preg_replace_callback(’<(script|style).*?<\\1>/s’, create_function(‘$matches’, ‘return str_replace(“\n”, “<WPPreserveNewline />”, $matches[0]);’), $pee);
Replace newlines in <script> and <style> tags with a custom
marker that will be replaced back to a newline character after
the rest of the replacements are completed.
Compress all newline characters preceded by spaces, tabs,
blank lines, or a <br /> tag, with a single <br /> tag
followed by a newline.
Recover real newline characters from the marker put forth two rules before.
Remove all <br /> tags that follow a closing tag of the
$allblocks family.
Remove all <br /> tags that precede a closing tag of the
$allblocks family.
//if (strpos($pee, '<pre') !== false) {
// mind the space between the ? and >. Only there because of the comment.
// $pee = preg_replace_callback('!(<pre.*? >)(.*?)</pre>!is', 'clean_pre', $pee );
//}
This is commented, so the description is just for the sake of
completeness: if a <pre > tag is encountered (mind the extra
space), then apply a clean_pre callback. The match is
case-insensitive and spans multiple lines.
Finally, remove empty lines preceding a closing <p> tag. That
pushes the closing tag next to the last character of the
paragraph, making the code more readable.
Hopefully, the ugly code is now beautified.
The autop() function is a complicated beast that seems to do the
job well, but is certainly possible to break spectacularly. Glad to
see it disappear and leave the space to a proper parser.
I’m curious to see someone explaining:
$pee = preg_replace('|<p>|', "$1<p>", $pee);
FYI Elgg now uses a significantly better DOM-based algorithm for this https://github.com/Elgg/Elgg/blob/master/engine/classes/ElggAutoP.php