Joe joskid

Web Crawling: Data Scraping vs. Data Crawling

HTML parsers: https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers

crawler4j : popular web crawler - https://github.com/yasserg/crawler4j
jSoup : java HTML parser - https://jsoup.org/
Jaunt : java web scraping & JSON querying - http://jaunt-api.com/

JSoup

JSoup is a HTML parser, it can't control the web page, only parse the content. Supports only CSS Selectors. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest. Particularly the traversing of the HTML DOM tree is the major strength of JSoup. Can be used in web applications.

HtmlUnit

HtmlUnit is a "GUI-Less browser for Java programs". The HtmlUnit browser can simulate Chrome, Firefox or Internet Explorer behaviour. It is a light weight solution that doesn't have too many dependencies. Generally, it supports JavaScript and Cookies, but in some cases it may fail. HtmlUnit is used for testing, web scraping, and is the basis for other tools. You can simulate pretty much anything a browser can do like click events, submit events etc. It's much more than alone a HTML parser, is ideal for web application automated unit testing. Supports XPath, but the problem starts when you try to extrac

Source: http://christonard.com/12-free-data-mining-books/

An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie & Tibshirani – This book is fantastic and has helped me quite a bit. It provides an overview of several methods, along with the R code for how to complete them. 426 Pages.
The Elements of Statistical Learning by Hastie, Tibshirani & Friedman – This is an in-depth overview of methods, complete with theory, derivations & code. I’d definitely consider this a graduate level text. I’d also consider it one of the best books available on the topic of data mining. 745 Pages.
A Programmer’s Guide to Data Mining by Ron Zacharski – This one is an online book, each chapter downloadable as a PDF. It’s also still in progress, with chapters being added a few times each year. Probabilistic Programming & Bayesian Methods for Hackers by Cam Davidson-Pilson – This book is absolutely fantastic. The author explains Bayesian statistics, provides several diverse examples of how to a

The program below can take one or more plain text files as input. It works with python2 and python3.

Let's say we have two files that may contain email addresses.

file_a.txt

foo bar
ok [email protected] sup
 [email protected],wyd
hello world!

	# (C) Kyle Kastner, June 2014
	# License: BSD 3 clause

	import numpy as np
	from scipy import sparse


	def minibatch_indices(X, minibatch_size):
	minibatch_indices = np.arange(0, len(X), minibatch_size)
	minibatch_indices = np.asarray(list(minibatch_indices) + [len(X)])

	# -- coding: utf-8 --
	"""
	Created on Sun Jun 23 22:20:58 2019

	@author: himansh
	"""
	#import libraries
	import sys
	import pandas as pd
	import numpy as np

	import numpy as np
	import re
	import itertools
	from collections import Counter


	def clean_str(string):
	"""
	Tokenization/string cleaning for all datasets except for SST.
	Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py

	# -- coding: utf-8 --
	"""Page Rank Algorithm.ipynb

	Automatically generated by Colaboratory.

	Original file is located at
	https://colab.research.google.com/drive/1oUC_418I6e2nv_2xBQ0sgXZtDfA98zuH
	"""

	cd /content/drive/My Drive/medium blogs/Page Rank Algorithm

	import pytesseract
	import os
	import sys


	def read_image(img_path, lang='eng'):
	"""
	Performs OCR on a single image

	:img_path: str, path to the image file

	# coding:utf-8

	from elasticsearch import Elasticsearch
	import json

	# Define config
	host = "127.0.0.1"
	port = 9200
	timeout = 1000
	index = "index"