Skip to content

Instantly share code, notes, and snippets.

@vinovator
vinovator / forbes2kMiner.py
Last active March 2, 2018 22:46
Scrape JS rendered website using Selenium, PhantomJS and BeautifulSoup and wrangle the data using pandas. Extract Forbes 2000 list, process and import to csv file.
# forbes2kMiner.py
# Python 3.4
"""
Extracts the Forbes Global 2000 list of companies and imports into a CSV file
Since Forbes is a JS rendered site, selenium is used to mimic user action
BeautifulSoup is used to scrape html content
Since selenium is used, Firefox is needed as webdiver
"""
@vinovator
vinovator / jsonToCsv2.py
Last active November 3, 2015 13:59
Scans a JSON file and extracts the key value pairs to CSV
# jsonToCSV.py
# Python 2.7.6
'''
Place all the json payloads as separate text files in base folder
Program will extract each payload and generate single csv file
csv file will have key value pairs in separate columns
'''
import json
@vinovator
vinovator / timeZoneExplorer.py
Last active October 9, 2015 20:39
Simple query to fetch all common time zones and their current time
# Python 2.7.6
# timeZoneExplorer.py
from pytz import timezone, common_timezones # import all_timezones for more exhaustive list
from datetime import datetime
import os
# Log file will be created in the same folder as the python script
my_path = "."
log_path = os.path.join(my_path + "/" + "loc_log.txt")
@vinovator
vinovator / portScanner.py
Created October 8, 2015 15:39
Simple Python socket program to scan TCP ports
# python 2.7.6.
# portScanner.py
import socket
from datetime import datetime
import sys
# Here we are scanning your own terminal
# Replace this with gethostbyname("host") to scan a remote host
@vinovator
vinovator / pdfTextMiner.py
Last active April 20, 2023 03:47
A sample code which uses pdfminer module to extract text from pdf files
# pdfTextMiner.py
# Python 2.7.6
# For Python 3.x use pdfminer3k module
# This link has useful information on components of the program
# https://euske.github.io/pdfminer/programming.html
# http://denis.papathanasiou.org/posts/2010.08.04.post.html
''' Important classes to remember
PDFParser - fetches data from pdf file
@vinovator
vinovator / fileExplorer.py
Last active October 2, 2015 19:44
Loop through a folder path and extract all files and sub-folders. Get count of files by extension.
# fileExplorer.py
# python 2.7.6
import os
# defaultdict is used to have keys created if it doesn't exist or appended it if exists
from collections import defaultdict
folder_count = 0
file_count = 0
loop_count = 0
@vinovator
vinovator / getHttpHeader.py
Last active October 2, 2015 14:24
Get the request header and response header from a http request-response sequence. Assumes that the url accepts digest authentication
# getHttpHeader.py
# Python 2.7.6
import requests
from requests.auth import HTTPDigestAuth
import getpass # To mask the password typed in
# Replace with the correct URL
url = "http://some_url"
@vinovator
vinovator / RestfulPostClient.py
Last active October 2, 2015 14:04
A sample restful client for POST operation - assumes digest authentication
# RestfulPostClient.py
# Python 2.7.6
import requests
from requests.auth import HTTPDigestAuth
# import json # Json module is not required as we are directly passing json to requests
# Replace with the correct URL
url = "http://api_url"
@vinovator
vinovator / RestfulGetClient.py
Last active January 24, 2025 18:52
A sample code to invoke GET method of restful API with digest authentication
#Python 2.7.6
#RestfulClient.py
import requests
from requests.auth import HTTPDigestAuth
import json
# Replace with the correct URL
url = "http://api_url"
@vinovator
vinovator / pdfInvoiceMiner.py
Created July 26, 2015 19:03
From a set of invoice pdf files within a folder, extract the invoice number and client information and place them in an excel file
__author__ = 'Vinoth_Subramanian'
# Python3
# pdfInvoiceMiner.py
# Program to extract the client info and invoice no from a bunch of invoice pdf files
# pdfminer3k library is used to extract text from pdf
# PyPDF2 library does not extract the text from pdf properly
# place all the invoice pdf files within a folder named "INVOICE"
# place an excel file named "invoice_info.xlsx" in the parent folder of "INVOICE"
# First column - invoice no; Second column - client details