-
-
Save xeoncross/9401853 to your computer and use it in GitHub Desktop.
| <?php | |
| // Ignore errors | |
| libxml_use_internal_errors(true) AND libxml_clear_errors(); | |
| // http://stackoverflow.com/q/10237238/99923 | |
| // http://stackoverflow.com/q/12034235/99923 | |
| // http://stackoverflow.com/q/8218230/99923 | |
| // original input (unknown encoding) | |
| $html = 'hi</b><p>سلام<div>の家庭に、9 ☆'; | |
| print $html . PHP_EOL; | |
| $doc = new DOMDocument(); | |
| $doc->preserveWhiteSpace = false; | |
| $doc->loadHTML($html); | |
| print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
| $doc = new DOMDocument('1.0', 'UTF-8'); | |
| $doc->loadHTML($html); | |
| $doc->encoding = 'utf-8'; | |
| print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
| $doc = new DOMDocument(); | |
| $doc->loadHTML('<?xml encoding="utf-8"?>' . $html); | |
| $doc->encoding = 'utf-8'; | |
| print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
| $doc = new DOMDocument('1.0', 'UTF-8'); | |
| $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); | |
| print $doc->saveHTML($doc->documentElement) . PHP_EOL . PHP_EOL; | |
| // Benchmark | |
| print "Testing XML encoding spec" . PHP_EOL; | |
| $time = microtime(TRUE); | |
| for ($i=0; $i < 10000; $i++) { | |
| $doc = new DOMDocument(); | |
| $doc->loadHTML('<?xml encoding="utf-8"?>' . $html); | |
| foreach ($doc->childNodes as $item) | |
| if ($item->nodeType == XML_PI_NODE) | |
| $doc->removeChild($item); // remove hack | |
| $doc->encoding = 'utf-8'; | |
| $doc->saveHTML(); | |
| unset($doc); | |
| } | |
| print (microtime(TRUE) - $time) . " seconds" . PHP_EOL . PHP_EOL; | |
| print "Testing mb_convert_encoding" . PHP_EOL; | |
| $time = microtime(TRUE); | |
| for ($i=0; $i < 10000; $i++) { | |
| $doc = new DOMDocument(); | |
| $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); | |
| $doc->saveHTML(); | |
| unset($doc); | |
| } | |
| print (microtime(TRUE) - $time) . " seconds" . PHP_EOL . PHP_EOL; |
| hi</b><p>سلام<div>の家庭に、9 ☆ | |
| <html><body> | |
| <p>hi</p> | |
| <p>Ø³ÙØ§Ù </p> | |
| <div>ã®å®¶åºã«ã9 â</div> | |
| </body></html> | |
| <html><body> | |
| <p>hi</p> | |
| <p>Ø³ÙØ§Ù </p> | |
| <div>ã®å®¶åºã«ã9 â</div> | |
| </body></html> | |
| <html><body> | |
| <p>hi</p> | |
| <p>سلام</p> | |
| <div>の家庭に、9 ☆</div> | |
| </body></html> | |
| <html><body> | |
| <p>hi</p> | |
| <p>سلام</p> | |
| <div>の家庭に、9 ☆</div> | |
| </body></html> | |
| Testing XML encoding spec | |
| 0.45506000518799 seconds | |
| Testing mb_convert_encoding | |
| 0.47111082077026 seconds |
Unfortunately, this was useless for me.
The two last examples works as longs as you do $document->saveHTML($document->documentElement);
If you want the whole document and do $document->saveHTML(), you'll notice that you'll get html entities instead....
Amazing examples! it works for me:
$dom = new DOMDocument('5.0', 'utf-8');
$dom->loadHTML(mb_convert_encoding($nav, 'HTML-ENTITIES', 'utf-8'));
return $dom->saveHTML();
Nice, that work for me.
In the age of PHP 8.2 deprecations, I am finding that adding <?xml encoding="utf-8" ?> seems to be the best option. The only other thing that seems to work is mb_encode_numericentity($str, [0x80, 0x10FFFF, 0, ~0], 'UTF-8') but I think I prefer the readability of the former and it also makes sense in that it is forcing domdocument to use UTF-8 when it seems to ignore other attempts to do so.
In the age of PHP 8.2 I would really recommend you all leave PHP for Go or Rust. They are an order of magnitude faster. Designed from the ground-up for multiple CPU cores. Better error handling. No goofy work-arounds for utf8, rock-solid standard libraries, best-in-class community libraries, dead-simple deployments, and the list goes on-and-on.
I maintained a PHP framework. After forcing PHP to do so many things it did poorly (NLP, Encryption, HTML parsing, etc...) Go and Rust are a breath of fresh air. PHP offers nothing at this point they don't have and do better.
I don't get smtng, but isn't this all what is needed:
...
$doc->encoding = 'UTF-8';
$doc->C14N();
...