Skip to content

Instantly share code, notes, and snippets.

@koert
Last active March 26, 2020 16:10
Show Gist options
  • Select an option

  • Save koert/3698629 to your computer and use it in GitHub Desktop.

Select an option

Save koert/3698629 to your computer and use it in GitHub Desktop.
Extract plain text from HTML
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
public boolean readyForNewline;
@Override public void handleText(final char[] data, final int pos) {
String s = new String(data);
sb.append(s.trim());
readyForNewline = true;
}
@Override public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P || t == HTML.Tag.LI)) {
sb.append("\n");
readyForNewline = false;
}
}
@Override public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
handleStartTag(t, a, pos);
}
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);
String plainText = sb.toString();
@koert
Copy link
Copy Markdown
Author

koert commented Sep 11, 2012

@theanuradha
Copy link
Copy Markdown

|| t == HTML.Tag.Li needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment