-
-
Save xeoncross/9401853 to your computer and use it in GitHub Desktop.
<?php | |
// Ignore errors | |
libxml_use_internal_errors(true) AND libxml_clear_errors(); | |
// http://stackoverflow.com/q/10237238/99923 | |
// http://stackoverflow.com/q/12034235/99923 | |
// http://stackoverflow.com/q/8218230/99923 | |
// original input (unknown encoding) | |
$html = 'hi</b><p>سلام<div>の家庭に、9 ☆'; | |
print $html . PHP_EOL; | |
$doc = new DOMDocument(); | |
$doc->preserveWhiteSpace = false; | |
$doc->loadHTML($html); | |
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
$doc = new DOMDocument('1.0', 'UTF-8'); | |
$doc->loadHTML($html); | |
$doc->encoding = 'utf-8'; | |
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
$doc = new DOMDocument(); | |
$doc->loadHTML('<?xml encoding="utf-8"?>' . $html); | |
$doc->encoding = 'utf-8'; | |
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
$doc = new DOMDocument('1.0', 'UTF-8'); | |
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); | |
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
// Benchmark | |
print "Testing XML encoding spec" . PHP_EOL; | |
$time = microtime(TRUE); | |
for ($i=0; $i < 10000; $i++) { | |
$doc = new DOMDocument(); | |
$doc->loadHTML('<?xml encoding="utf-8"?>' . $html); | |
foreach ($doc->childNodes as $item) | |
if ($item->nodeType == XML_PI_NODE) | |
$doc->removeChild($item); // remove hack | |
$doc->encoding = 'utf-8'; | |
$doc->saveHTML(); | |
unset($doc); | |
} | |
print (microtime(TRUE) - $time) . " seconds" . PHP_EOL . PHP_EOL; | |
print "Testing mb_convert_encoding" . PHP_EOL; | |
$time = microtime(TRUE); | |
for ($i=0; $i < 10000; $i++) { | |
$doc = new DOMDocument(); | |
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); | |
$doc->saveHTML(); | |
unset($doc); | |
} | |
print (microtime(TRUE) - $time) . " seconds" . PHP_EOL . PHP_EOL; |
hi</b><p>سلام<div>の家庭に、9 ☆ | |
<html><body> | |
<p>hi</p> | |
<p>سÙا٠</p> | |
<div>ã®å®¶åºã«ã9 â</div> | |
</body></html> | |
<html><body> | |
<p>hi</p> | |
<p>سÙا٠</p> | |
<div>ã®å®¶åºã«ã9 â</div> | |
</body></html> | |
<html><body> | |
<p>hi</p> | |
<p>سلام</p> | |
<div>の家庭に、9 ☆</div> | |
</body></html> | |
<html><body> | |
<p>hi</p> | |
<p>سلام</p> | |
<div>の家庭に、9 ☆</div> | |
</body></html> | |
Testing XML encoding spec | |
0.45506000518799 seconds | |
Testing mb_convert_encoding | |
0.47111082077026 seconds |
Nice, that work for me.
In the age of PHP 8.2 deprecations, I am finding that adding <?xml encoding="utf-8" ?>
seems to be the best option. The only other thing that seems to work is mb_encode_numericentity($str, [0x80, 0x10FFFF, 0, ~0], 'UTF-8')
but I think I prefer the readability of the former and it also makes sense in that it is forcing domdocument to use UTF-8 when it seems to ignore other attempts to do so.
In the age of PHP 8.2 I would really recommend you all leave PHP for Go or Rust. They are an order of magnitude faster. Designed from the ground-up for multiple CPU cores. Better error handling. No goofy work-arounds for utf8, rock-solid standard libraries, best-in-class community libraries, dead-simple deployments, and the list goes on-and-on.
I maintained a PHP framework. After forcing PHP to do so many things it did poorly (NLP, Encryption, HTML parsing, etc...) Go and Rust are a breath of fresh air. PHP offers nothing at this point they don't have and do better.
Amazing examples! it works for me:
$dom = new DOMDocument('5.0', 'utf-8');
$dom->loadHTML(mb_convert_encoding($nav, 'HTML-ENTITIES', 'utf-8'));
return $dom->saveHTML();