Skip to content

Instantly share code, notes, and snippets.

View miodeqqq's full-sized avatar
👨‍💻
Coding...

Maciej Januszewski miodeqqq

👨‍💻
Coding...
View GitHub Profile
@miodeqqq
miodeqqq / pyspark_basics.py
Created December 1, 2016 19:24
PySpark basics: filtering, mapping, count
# -*- coding: utf-8 -*-
from pyspark import SparkContext, SparkConf
LOG_FILE = "hdfs://grid223-20:9000/input/taglogsbig/huge10g.log"
USERNAME = 'bob'
conf = SparkConf().setAppName("Maciej Januszewski").setMaster("spark://grid223-20:7077").set("spark.executor.memory", "3g").set("spark.driver.cores", 4);
@miodeqqq
miodeqqq / md5_hash_decrypt.py
Last active February 21, 2024 13:38
Python MD5 decrypt.
# -*- coding: utf-8 -*-
import hashlib
import sys
import time
# Using: ./hash.py hashcode
# For example: ./hash.py 9743a66f914cc249efca164485a19c5c
def timing(f):
@miodeqqq
miodeqqq / validate_robots.py
Created December 1, 2016 19:37
Python BS4 sitemap validator - checks HTTP Response for all links inside <loc> .. </loc> tags
#! /usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import time
from time import sleep
@miodeqqq
miodeqqq / get_captchas_with_selenium.py
Last active October 13, 2022 23:49
Python (Selenium) bot for downloading Google's captcha (after too many queries will raise a captcha image). Images set would be a great training (machine learning) for recognizing characters.
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import os
import sys
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
@miodeqqq
miodeqqq / timing_function.py
Created December 8, 2016 20:04
Python decorator for measuring time of execution (function).
#! /usr/bin/env python
# -*- coding: utf-8 -*-
def timing(f):
def wrap(*args):
time1 = time.time()
ret = f(*args)
time2 = time.time()
print('Function {} took --> {:0.1f} seconds'.format(f.__name__, (time2 - time1)))
@miodeqqq
miodeqqq / use_agents.py
Created December 8, 2016 20:06
Verified User-Agents to be used with scraping/parsing pages.
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
@miodeqqq
miodeqqq / pymongo_db.py
Created December 8, 2016 20:11
PyMongo basic setup for connecting with database. For example: find all PDF files and download them to local drive.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
from gridfs import GridFS
from pymongo import MongoClient
from bson.objectid import ObjectId
@miodeqqq
miodeqqq / check_is_empty_tar_gz.py
Created December 8, 2016 20:14
Python function to check if *.tar.gz file is empty.
def is_nonempty_gz_file(self, tarfile):
with open(tarfile, 'rb') as f:
try:
file_content = f.read(1)
return len(file_content) > 0
except:
pass
@miodeqqq
miodeqqq / request_with_proxies.py
Created December 8, 2016 20:22
Python Request (urllib2) with proxies.
#! /usr/bin/env python
# -*- coding: utf-8 -*-
from urllib2 import Request, URLError, urlopen, build_opener, ProxyHandler, install_opener
def request_with_proxy():
proxy = ProxyHandler(
@miodeqqq
miodeqqq / docker_clean_images.sh
Last active December 9, 2016 23:19
Docker - remove all the dangling/unused images.
#!/bin/bash
# Remove all the dangling images
DANGLING_IMAGES=$(docker images -qf "dangling=true")
if [[ -n $DANGLING_IMAGES ]]; then
docker rmi "$DANGLING_IMAGES"
fi
# Get all the images currently in use
USED_IMAGES=($( \