-
-
Save xeoncross/9401853 to your computer and use it in GitHub Desktop.
<?php | |
// Ignore errors | |
libxml_use_internal_errors(true) AND libxml_clear_errors(); | |
// http://stackoverflow.com/q/10237238/99923 | |
// http://stackoverflow.com/q/12034235/99923 | |
// http://stackoverflow.com/q/8218230/99923 | |
// original input (unknown encoding) | |
$html = 'hi</b><p>سلام<div>の家庭に、9 ☆'; | |
print $html . PHP_EOL; | |
$doc = new DOMDocument(); | |
$doc->preserveWhiteSpace = false; | |
$doc->loadHTML($html); | |
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
$doc = new DOMDocument('1.0', 'UTF-8'); | |
$doc->loadHTML($html); | |
$doc->encoding = 'utf-8'; | |
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
$doc = new DOMDocument(); | |
$doc->loadHTML('<?xml encoding="utf-8"?>' . $html); | |
$doc->encoding = 'utf-8'; | |
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
$doc = new DOMDocument('1.0', 'UTF-8'); | |
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); | |
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
// Benchmark | |
print "Testing XML encoding spec" . PHP_EOL; | |
$time = microtime(TRUE); | |
for ($i=0; $i < 10000; $i++) { | |
$doc = new DOMDocument(); | |
$doc->loadHTML('<?xml encoding="utf-8"?>' . $html); | |
foreach ($doc->childNodes as $item) | |
if ($item->nodeType == XML_PI_NODE) | |
$doc->removeChild($item); // remove hack | |
$doc->encoding = 'utf-8'; | |
$doc->saveHTML(); | |
unset($doc); | |
} | |
print (microtime(TRUE) - $time) . " seconds" . PHP_EOL . PHP_EOL; | |
print "Testing mb_convert_encoding" . PHP_EOL; | |
$time = microtime(TRUE); | |
for ($i=0; $i < 10000; $i++) { | |
$doc = new DOMDocument(); | |
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); | |
$doc->saveHTML(); | |
unset($doc); | |
} | |
print (microtime(TRUE) - $time) . " seconds" . PHP_EOL . PHP_EOL; |
hi</b><p>سلام<div>の家庭に、9 ☆ | |
<html><body> | |
<p>hi</p> | |
<p>سÙا٠</p> | |
<div>ã®å®¶åºã«ã9 â</div> | |
</body></html> | |
<html><body> | |
<p>hi</p> | |
<p>سÙا٠</p> | |
<div>ã®å®¶åºã«ã9 â</div> | |
</body></html> | |
<html><body> | |
<p>hi</p> | |
<p>سلام</p> | |
<div>の家庭に、9 ☆</div> | |
</body></html> | |
<html><body> | |
<p>hi</p> | |
<p>سلام</p> | |
<div>の家庭に、9 ☆</div> | |
</body></html> | |
Testing XML encoding spec | |
0.45506000518799 seconds | |
Testing mb_convert_encoding | |
0.47111082077026 seconds |
Thank you so much, your code was very helpful!
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL;
Thank you!
<?php
libxml_use_internal_errors(true) && libxml_clear_errors(); // for html5
$document = new \DOMDocument('1.0', 'UTF-8');
$document->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$document->saveHTML($document->documentElement).PHP_EOL.PHP_EOL;
?>
I don't get smtng, but isn't this all what is needed:
...
$doc->encoding = 'UTF-8';
$doc->C14N();
...
Unfortunately, this was useless for me.
The two last examples works as longs as you do $document->saveHTML($document->documentElement)
;
If you want the whole document and do $document->saveHTML()
, you'll notice that you'll get html entities instead....
Amazing examples! it works for me:
$dom = new DOMDocument('5.0', 'utf-8');
$dom->loadHTML(mb_convert_encoding($nav, 'HTML-ENTITIES', 'utf-8'));
return $dom->saveHTML();
Nice, that work for me.
In the age of PHP 8.2 deprecations, I am finding that adding <?xml encoding="utf-8" ?>
seems to be the best option. The only other thing that seems to work is mb_encode_numericentity($str, [0x80, 0x10FFFF, 0, ~0], 'UTF-8')
but I think I prefer the readability of the former and it also makes sense in that it is forcing domdocument to use UTF-8 when it seems to ignore other attempts to do so.
In the age of PHP 8.2 I would really recommend you all leave PHP for Go or Rust. They are an order of magnitude faster. Designed from the ground-up for multiple CPU cores. Better error handling. No goofy work-arounds for utf8, rock-solid standard libraries, best-in-class community libraries, dead-simple deployments, and the list goes on-and-on.
I maintained a PHP framework. After forcing PHP to do so many things it did poorly (NLP, Encryption, HTML parsing, etc...) Go and Rust are a breath of fresh air. PHP offers nothing at this point they don't have and do better.
According to http://stackoverflow.com/a/37834812:
Use it for correct result
This operation
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');
It is bad way, because special symbols like < ; , > ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.