Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save sgrillon14/f3eafe82100964d9ed5857ef2c6884d3 to your computer and use it in GitHub Desktop.
Save sgrillon14/f3eafe82100964d9ed5857ef2c6884d3 to your computer and use it in GitHub Desktop.
Web Scraping with Java: JSoup - HtmlUnit - Jaunt - ui4j - Selenium - PhantomJS

JSoup

JSoup is a HTML parser, it can't control the web page, only parse the content. Supports only CSS Selectors. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest. Particularly the traversing of the HTML DOM tree is the major strength of JSoup. Can be used in web applications.

HtmlUnit

HtmlUnit is a "GUI-Less browser for Java programs". The HtmlUnit browser can simulate Chrome, Firefox or Internet Explorer behaviour. It is a light weight solution that doesn't have too many dependencies. Generally, it supports JavaScript and Cookies, but in some cases it may fail. HtmlUnit is used for testing, web scraping, and is the basis for other tools. You can simulate pretty much anything a browser can do like click events, submit events etc. It's much more than alone a HTML parser, is ideal for web application automated unit testing. Supports XPath, but the problem starts when you try to extract structured data from modern web applications that use JQuery and other Ajax features and use Div tags extensively. HtmlUnit and other XPath based html parsers will not work well with web applications. There is a little project on github available that extends HtmlUnit to support CSS resp. limited jQuery querying.

HTMLUnitDriver

HTML unit driver is the most light weight and fastest implementation headless browser of WebDriver. It is based on HtmlUnit.

Jaunt

Is similar to JSoup, and includes integrated working with REST APIs and JSON. It's fast but it doesn't support JavaScript. Is a commercial library.

ui4j

Ui4j is a web-automation library for Java. It is a thin wrapper library around the JavaFx WebKit Engine (including headless modes), and can be used for automating the use of web pages and for testing web pages. Pure Java 8 solution.

Selenium

Is a suite of tools to automate web browsers across many platforms. Nevertheless, it could be used for web scraping. Is composed of several components with each taking on a specific role in aiding the development of web application test automation.

Selenium WebDriver

A collection of language specific bindings to drive a browser.

Remote WebDriver

Separates where the tests are running from where the browser is. Allows tests to be run with browsers not available on the current OS (because the browser can be elsewhere). Can be used in the same that webdriver, the primary difference is that remote webdriver needs to be configured so that it can run the tests on a seperate machine. The RemoteWebDriver is composed of two pieces: a client and a server.

PhantomJS

Headless browser used for automating web page interaction. It provides a JavaScript API enabling automated navigation, screenshots, user behavior and assertions making it a common tool used to run browser-based unit tests in a headless system like a continuous integration environment. Based on WebKit.

PhantomJSDriver (or Ghostdriver)

Project that provides Selenium WebDriver bindings for Java. It controls a PhantomJS running in Remote WebDriver mode. In order to use PhantomJS with Seleniun, one has to use GhostDriver.

NoraUi (NOn-Regression Automation for User Interfaces)

NoraUi, for NOn-Regression Automation for User Interfaces, is a Java framework based on Selenium, Cucumber and Gherkin stack to create GUI testing projects that can be included in the continuous integration chain of single/multi applications web solution builds. It ensures applications non-regression throughout their lifes taking into account code evolutions and acceptance of defined business client criterias.

Sources: https://noraui.github.io/ https://dzone.com/articles/htmlunit-vs-jsoup-html-parsing
https://www.innoq.com/en/blog/webscraping/
http://stackoverflow.com/questions/3152138/what-are-the-pros-and-cons-of-the-leading-java-html-parsers
http://mph-web.de/web-scraping-jaunt-vs-jsoup/
http://stackoverflow.com/questions/814757/headless-internet-browser
https://seleniumhq.github.io/docs/remote.html
http://www.assertselenium.com/headless-testing/getting-started-with-ghostdriver-phantomjs/
http://www.guru99.com/selenium-with-htmlunit-driver-phantomjs.html
http://stackoverflow.com/questions/28008825/htmlunitdriver-htmlunit-vs-ghostdriver-phantomjs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment