Skip to content

Instantly share code, notes, and snippets.

View sgrillon14's full-sized avatar
🏠
Working from home

Stéphane GRILLON sgrillon14

🏠
Working from home
View GitHub Profile
@sgrillon14
sgrillon14 / web-scraping-java-jsoup-htmlunit-jaunt-uij-selenium-phantomjs.md
Last active March 3, 2018 22:33
Web Scraping with Java: JSoup - HtmlUnit - Jaunt - ui4j - Selenium - PhantomJS

JSoup

JSoup is a HTML parser, it can't control the web page, only parse the content. Supports only CSS Selectors. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest. Particularly the traversing of the HTML DOM tree is the major strength of JSoup. Can be used in web applications.

HtmlUnit

HtmlUnit is a "GUI-Less browser for Java programs". The HtmlUnit browser can simulate Chrome, Firefox or Internet Explorer behaviour. It is a light weight solution that doesn't have too many dependencies. Generally, it supports JavaScript and Cookies, but in some cases it may fail. HtmlUnit is used for testing, web scraping, and is the basis for other tools. You can simulate pretty much anything a browser can do like click events, submit events etc. It's much more than alone a HTML parser, is ideal for web application automated unit testing. Supports XPath, but the problem starts when you try to extrac