Florian Leitner fnl

Using Selenium and Python to screenshot a javascript-heavy page

As websites become more JavaScript heavy, it's harder to automate things like screenshotting for archival purposes. I've seen examples and suggestions to use PhantomJS for visual testing/archiving of websites, but have run into issues such as the non-rendering of webfonts. I've never tried out Selenium until today...and while I'm not thinking about performance implications yet, Selenium seems far more accurate than PhantomJS...which makes sense since it actually opens a real browser. And it's not too hard to script to do complex interactions: here's an [example of how to log in to Twitter, write a tweet, upload an image, and send a tweet via Selenium and DOM element selection](https://gist.github.com/dannguyen/8a6fa49253c1d6a0eb92

General Background and Overview

Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&rep=rep1&t

For ETS's SKLL project, we found out the hard way that Travis-CI's support for numpy and scipy is pretty abysmal. There are pre-installed versions of numpy for some versions of Python, but those are seriously out of date, and scipy is not there are at all. The two most popular approaches for working around this are to (1) build everything from scratch, or (2) use apt-get to install more recent (but still out of date) versions of numpy and scipy. Both of these approaches lead to longer build times, and with the second approach, you still don't have the most recent versions of anything. To circumvent these issues, we've switched to using Miniconda (Anaconda's lightweight cousin) to install everything.

A template for installing a simple Python package that relies on numpy and scipy using Miniconda is provided below. Since it's a common s

	#!/usr/bin/env python
	# -- coding: utf-8 --
	#
	# Copyright (C) 2019 RARE Technologies s.r.o.
	# Authors: Radim Rehurek <[email protected]>
	# MIT License

	"""
	Find private/shared memory of one or more processes, identified by their process ids (PIDs).

	#!/usr/bin/env python3
	"""
	Description
	"""
	from argparse import ArgumentParser
	import logging
	import os
	import sys

	from somewhere import main

	package spark.example

	import org.apache.spark.SparkContext
	import org.apache.spark.SparkContext._
	import org.apache.spark.SparkConf

	object SparkGrep {
	def main(args: Array[String]) {
	if (args.length < 3) {
	System.err.println("Usage: SparkGrep <host> <input_file> <match_term>")