Strip HTML tags from a string. Does not use (much) regex.
Never loads any resources (<img>, <script>, etc.) referenced in the input.
The treatment of whitespace is probably consistent across browsers but is not guaranteed.
Rather than using fragile regexes, the DOM is used, and the resulting text nodes are pulled out.
This can be made safe by adding to the DOM with node.innerHTML, which does not run scripts.
node.innerHTML does load images and other resources, though. To prevent that, the input is first munged to replace src and href with srco and hrefo.
To exclude the contents of <script> and <style>, the DOM is recursively looped to ignore those tags and find textNodes.
- Look into DOM Mutation events.
- Any other elements/attributes to worry about?
Safari, Chrome, Firefox, IE6-11.
- Regex: ugly
<noscript>tag's.innerHTML: inconsistent across browsers. Firefox treats.innerHTMLas.innerText.<div>tag's.innerHTMLinside an<iframe security="restricted" sandbox="allow-same-origin">: clever, but.innerHTMLalready doesn't run scripts.<noscript>tag's.innerHTMLinside an<iframe security="restricted" sandbox="allow-same-origin">: inconsistent across browsers. Firefox still treats.innerHTMLas.innerText.
This is old and ugly :) Look at
DOMParserinstead.