-
-
Save djsubstance/8a230a4019a9e82744d7d22a0ab5e69e to your computer and use it in GitHub Desktop.
Regular expression for parsing HTML
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
~ | |
(?(DEFINE) | |
(?<entity> | |
& | |
( | |
[a-z][a-z0-9]+ # named entity | |
| | |
\#\d+ # decimal number | |
| | |
\#x[0-9a-f]+ # hexadecimal number | |
) | |
; | |
) | |
(?<attribute> | |
\s+ # at least one whitespace character before the attribute | |
[^\s"'<>=`/]+ # attribute name | |
( | |
\s*=\s* # equals sign before the value | |
( | |
" # value enclosed in double quotes | |
( | |
[^"] # any character except double quote | |
| | |
(?&entity) # or HTML entity | |
)* | |
" | |
| | |
' # value enclosed in single quotes | |
( | |
[^'] # any character except single quote | |
| | |
(?&entity) # or HTML entity | |
)* | |
' | |
| | |
[^\s"'<>=`]+ # value without quotes | |
) | |
)? # value is optional | |
) | |
(?<void_element> | |
< # start of tag | |
( # element name | |
img|hr|br|input|meta|area|embed|keygen|source|base|col | |
|link|param|basefont|frame|isindex|wbr|command|track | |
) | |
(?&attribute)* # optional attributes | |
\s* | |
/? # optional / | |
> # end of tag | |
) | |
(?<special_element> | |
< # start tag | |
(?<special_element_name> | |
script|style|textarea|title # element name | |
) | |
(?&attribute)* # optional attributes | |
\s* | |
> # end of start tag | |
(?> # atomic group | |
.*? # smallest possible number of any characters including new lines | |
</ # end tag | |
(?P=special_element_name) | |
) | |
\s* | |
> # end of end tag | |
) | |
(?<element> | |
< # start tag | |
(?<element_name> | |
[a-z][^\s/>]* # element name | |
) | |
(?&attribute)* # optional attributes | |
\s* | |
> # end of start tag | |
(?&content)* | |
</ # end tag | |
(?P=element_name) | |
\s* | |
> # end of end tag | |
) | |
(?<comment> | |
<!-- | |
(?> # atomic group | |
.*? # smallest possible number of any characters including new lines | |
--> | |
) | |
) | |
(?<doctype> | |
<!doctype | |
\s | |
[^>]* # any characters except '>' | |
> | |
) | |
) | |
\s* | |
(?&doctype)? # optional doctype | |
(?<content> | |
(?&void_element) # void element | |
| | |
(?&special_element) # special element | |
| | |
(?&element) # paired element | |
| | |
(?&comment) # comment | |
| | |
(?&entity) # entity | |
| | |
[^<] # character | |
)* | |
~xis |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment