Created
July 31, 2012 07:32
-
-
Save lutzissler/3214524 to your computer and use it in GitHub Desktop.
Tidy HTML inserted by copy/pasting from Microsoft office (PHP)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Regexps courtesy of 1st class media | |
// http://www.1stclassmedia.co.uk/developers/clean-ms-word-formatting.php | |
function tidy_office_html($str) { | |
$replacements = array( | |
'/<!--.*?-->/s' => '', | |
'/<o:p>\s*<\/o:p>/s' => '', | |
'/<o:p>.*?<\/o:p>/s' => " ", | |
'/\s*mso-[^:]+:[^;"]+;?/i' => '', | |
'/\s*MARGIN: 0cm 0cm 0pt\s*;/i' => '', | |
'/\s*MARGIN: 0cm 0cm 0pt\s*"/i' => '', | |
'/\s*TEXT-INDENT: 0cm\s*;/i' => '', | |
'/\s*TEXT-INDENT: 0cm\s*"/i' => '', | |
'/\s*TEXT-ALIGN: [^\s;]+;?"/i' => '', | |
'/\s*PAGE-BREAK-BEFORE: [^\s;]+;?"/i' => '', | |
'/\s*FONT-VARIANT: [^\s;]+;?"/i' => '', | |
'/\s*tab-stops:[^;"]*;?/i' => '', | |
'/\s*tab-stops:[^"]*/i' => '', | |
'/\s*face="[^"]*"/i' => '', | |
'/\s*face=[^ >]*/i' => '', | |
'/\s*FONT-FAMILY:[^;"]*;?/i' => '', | |
'/<(\w[^>]*) class=([^ |>]*)([^>]*)/i' => "<$1$3", | |
'/<(\w[^>]*) style="([^\"]*)"([^>]*)/i' => "<$1$3", | |
'/\s*style="\s*"/i' => '', | |
'/<SPAN\s*[^>]*>\s* \s*<\/SPAN>/i' => ' ', | |
'/<SPAN\s*[^>]*><\/SPAN>/i' => '', | |
'/<(\w[^>]*) lang=([^ |>]*)([^>]*)/i' => "<$1$3", | |
'/<SPAN\s*>(.*?)<\/SPAN>/i' => '$1', | |
'/<FONT\s*>(.*?)<\/FONT>/i' => '$1', | |
':<p> </p>:i' => '', | |
'/<\\?\?xml[^>]*>/i' => '', | |
'/<\/?\w+:[^>]*>/i' => '', | |
'/<([^\s>]+)[^>]*>\s*<\/\1>/s' => '', | |
); | |
foreach ($replacements as $pattern => $replacement) { | |
$str = preg_replace($pattern, $replacement, $str); | |
} | |
return $str; | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The reason behind the foreach loop is to prevent a duplicate internal loop within array_keys() and array_values() as in
I did not benchmark this though. For even more optimization (but decreased readability), the two arrays should be defined separately to get rid of the need for splitting them into keys and values.