Skip to content

Instantly share code, notes, and snippets.

@eleclerc
Created October 22, 2015 17:21
Show Gist options
  • Save eleclerc/5894799942938938ebe5 to your computer and use it in GitHub Desktop.
Save eleclerc/5894799942938938ebe5 to your computer and use it in GitHub Desktop.
diff --git a/modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java b/modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java
index ef84964..dfb563b 100644
--- a/modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java
+++ b/modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java
@@ -138,7 +138,7 @@ public class ExtractorHTML extends ContentExtractor implements InitializingBean
"(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2
"|((style[^>]*+)>.*?</style)" + // 3, 4
"|(((meta)|(?:\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\s+[^>]*+)" + // 5, 6, 7
- "|(!--(?!\\[if).*?--))>"; // 8
+ "|(!--(?!\\[if|>).*?--))>"; // 8
// version w/ problems with unclosed script tags
// static final String RELEVANT_TAG_EXTRACTOR =
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment