Eric Harley ericharley

ericharley / gist:bd653fcf8228cba43979c97d6efcf8da

Created September 19, 2022 16:15

Quickly extract all links from a web page using the browser console

	// source https://towardsdatascience.com/quickly-extract-all-links-from-a-web-page-using-javascript-and-the-browser-console-49bb6f48127b
	var x = document.querySelectorAll("a");
	var myarray = []
	for (var i=0; i<x.length; i++){
	var nametext = x[i].textContent;
	var cleantext = nametext.replace(/\s+/g, ' ').trim();
	var cleanlink = x[i].href;
	myarray.push([cleantext,cleanlink]);
	};
	function make_table() {

ericharley / doit.py

Created November 9, 2018 20:00

python for common crawl

	import csv
	import gzip
	import requests
	from StringIO import StringIO

	# Parameters
	prefix = 'https://commoncrawl.s3.amazonaws.com/'
	fileout_extension = "pdf"

	def get_file(warc_filename, warc_record_offset, warc_record_length, content_digest):