Skip to content

Instantly share code, notes, and snippets.

@joskid
joskid / web-scraping-java-jsoup-htmlunit-jaunt-uij-selenium-phantomjs.md
Created March 6, 2021 04:04
Web Scraping with Java: JSoup - HtmlUnit - Jaunt - ui4j - Selenium - PhantomJS

JSoup

JSoup is a HTML parser, it can't control the web page, only parse the content. Supports only CSS Selectors. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest. Particularly the traversing of the HTML DOM tree is the major strength of JSoup. Can be used in web applications.

HtmlUnit

HtmlUnit is a "GUI-Less browser for Java programs". The HtmlUnit browser can simulate Chrome, Firefox or Internet Explorer behaviour. It is a light weight solution that doesn't have too many dependencies. Generally, it supports JavaScript and Cookies, but in some cases it may fail. HtmlUnit is used for testing, web scraping, and is the basis for other tools. You can simulate pretty much anything a browser can do like click events, submit events etc. It's much more than alone a HTML parser, is ideal for web application automated unit testing. Supports XPath, but the problem starts when you try to extrac

@joskid
joskid / Data Mining Books.md
Created March 3, 2021 02:25 — forked from dweinstein/Data Mining Books.md
Free Data Mining books

Source: http://christonard.com/12-free-data-mining-books/

  • An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie & Tibshirani – This book is fantastic and has helped me quite a bit. It provides an overview of several methods, along with the R code for how to complete them. 426 Pages.
  • The Elements of Statistical Learning by Hastie, Tibshirani & Friedman – This is an in-depth overview of methods, complete with theory, derivations & code. I’d definitely consider this a graduate level text. I’d also consider it one of the best books available on the topic of data mining. 745 Pages.
  • A Programmer’s Guide to Data Mining by Ron Zacharski – This one is an online book, each chapter downloadable as a PDF. It’s also still in progress, with chapters being added a few times each year. Probabilistic Programming & Bayesian Methods for Hackers by Cam Davidson-Pilson – This book is absolutely fantastic. The author explains Bayesian statistics, provides several diverse examples of how to a
@joskid
joskid / matrix_factorization.py
Created February 28, 2021 05:30 — forked from kastnerkyle/matrix_factorization.py
Matrix factorization code related to matrix completion
# (C) Kyle Kastner, June 2014
# License: BSD 3 clause
import numpy as np
from scipy import sparse
def minibatch_indices(X, minibatch_size):
minibatch_indices = np.arange(0, len(X), minibatch_size)
minibatch_indices = np.asarray(list(minibatch_indices) + [len(X)])
@joskid
joskid / ALS_implementation.py
Created February 28, 2021 05:29 — forked from himanshk96/ALS_implementation.py
Recommendation using ALS for implicit data. Code for Medium Blog
# -*- coding: utf-8 -*-
"""
Created on Sun Jun 23 22:20:58 2019
@author: himansh
"""
#import libraries
import sys
import pandas as pd
import numpy as np
import numpy as np
import re
import itertools
from collections import Counter
def clean_str(string):
"""
Tokenization/string cleaning for all datasets except for SST.
Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
@joskid
joskid / Page Rank Algorithm.py
Created February 18, 2021 23:08 — forked from ksdkamesh99/Page Rank Algorithm.py
Implementation of pagerank algorithm using python networkx library
# -*- coding: utf-8 -*-
"""Page Rank Algorithm.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1oUC_418I6e2nv_2xBQ0sgXZtDfA98zuH
"""
cd /content/drive/My Drive/medium blogs/Page Rank Algorithm
@joskid
joskid / ocr.py
Created February 17, 2021 16:28 — forked from SouravJohar/ocr.py
import pytesseract
import os
import sys
def read_image(img_path, lang='eng'):
"""
Performs OCR on a single image
:img_path: str, path to the image file
@joskid
joskid / example.md
Created February 3, 2021 20:15 — forked from dideler/example.md
A python script for extracting email addresses from text files.You can pass it multiple files. It prints the email addresses to stdout, one address per line.For ease of use, remove the .py extension and place it in your $PATH (e.g. /usr/local/bin/) to run it like a built-in command.
@joskid
joskid / scroll.py
Created February 2, 2021 14:35 — forked from hmldd/scroll.py
Example of Elasticsearch scrolling using Python client
# coding:utf-8
from elasticsearch import Elasticsearch
import json
# Define config
host = "127.0.0.1"
port = 9200
timeout = 1000
index = "index"