Skip to content

Instantly share code, notes, and snippets.

View coreyhermanson's full-sized avatar

Corey Hermanson coreyhermanson

  • Dallas, TX
View GitHub Profile
@coreyhermanson
coreyhermanson / easyURLconcat.py
Created June 30, 2016 16:22
Python script to generate search result page URLs
#!/usr/bin/env python
import pyperclip
import sys
"""
Example output for below configuration, pasted to your Clipboard:
http://www.exampledomain.com/all-stories/?page=1
http://www.exampledomain.com/all-stories/?page=2
http://www.exampledomain.com/all-stories/?page=3
@coreyhermanson
coreyhermanson / regex_in_list.py
Created June 30, 2016 16:40
Python script which takes a list of many regex and combines into a master regex. The script takes an input file, tests each line for a master regex match, then outputs lines where match=TRUE to an output file.
#!/usr/bin/env python
import re
input_file = 'infile.txt' # enter full file path, precede string with 'r' (r'PATH') if using Windows
output_file = 'outfile.txt' # enter full file path, precede string with 'r' (r'PATH') if using Windows
delete_counter = 0
# list of individual regex, which will be combined into a single regex in the next step
@coreyhermanson
coreyhermanson / list_to_clipboard.py
Created June 30, 2016 16:52
Takes a list of strings and copies them to the Clipboard, ready to paste. Useful function if you need to copy a list of results into Excel or Notepad++. Uses Python and pyperclip module.
#!/usr/bin/env python
import pyperclip
example_list = ["Line 1", "Line 2", "Line 3", "forever and ever"]
def list_to_clipboard(output_list):
""" Check if len(list) > 0, then copy to clipboard """
if len(output_list) > 0:
pyperclip.copy('\n'.join(output_list))
@coreyhermanson
coreyhermanson / brightplanet_harvestAPI_examples.JSON
Last active September 29, 2016 18:55
Example JSON requests for Website, Deep Web, and RSS harvests using the BrightPlanet Harvest API
EXAMPLE JSON PAYLOADS FOR BRIGHTPLANET HARVEST API
=================================================
1. Website harvest - scraping search results pages
2. Website harvest - harvesting a list of URLs, includes Xpath overwrite and Date-finding Xpath
3. Website harvest - scheduled harvest to monitor new documents
4. Deep Web harvest - query search engines (USE SPARINGLY - rate limits)
5. Deep Web harvest - query sources from multiple source groups
6. RSS harvest - monitor new documents daily using RSS feeds, includes Xpath overwrite and Date-finding Xpath
7. XPATH expressions - use these xpaths to manipulate which text is harvested from a web page
=================================================
@coreyhermanson
coreyhermanson / python_codebook.md
Last active June 9, 2021 12:06
Python CodeBook
@coreyhermanson
coreyhermanson / bp_twitterharvest
Last active March 22, 2017 19:02
Harvest API - Twitter Harvest
import requests
infile = r'C:\Users\Account\PythonFiles\generic_infile.txt' # full path to any file inside quotes
# Harvest Event Variables
api_key = "123abc" # STRING - 1 API key per Harvest API schema
searchable_items_per_event = 100 # INT - max queries OR max screenNames
name_of_event = "NewYork_Politics" # STRING - Program will pre-pend "TW_" and add "_#" to the end
filterQuery = None # STRING - ex: "nuclear AND (war OR energy)"
event_tags = ["source_Politics", "New York"] # LIST
@coreyhermanson
coreyhermanson / deepweb_examples.md
Created April 4, 2017 17:47
BrightPlanet Harvest API: Deep Web Project Examples

BrightPlanet Harvest API: Deep Web harvest examples

One-Time and Scheduled harvest examples

  1. Unscheduled: Deep Web harvest will execute immediately (no delay parameter), and run once (scheduleType="ONCE" and no interval parameter)
 {
  "id": "string",
  "harvestEventType": "DEEP",
@coreyhermanson
coreyhermanson / normalize_company.py
Created May 10, 2017 15:04
Strip company indicators from company terms
#!/usr/bin/env python
import pyperclip
import re
from list_clipboard_manipulations import list_to_clipboard
delete_counter = 0
good_list = list()
sort_alpha = False
@coreyhermanson
coreyhermanson / gist:85defceac4e5cd6548aef7e32ed89584
Created November 27, 2017 17:01
BrightPlanet Harvest API: Create RSS harvests from a spreadsheet of sources
import requests
import csv
input_file = r'YOUR_FULL_FILEPATH_HERE'
var_scheduled = "RECURRING"
var_initial_delay = 1.0 # float
var_time_between_scheduled_events = 12.0 # float
var_max_depth = 1
var_depth_external = 0
var_max_docsize = -1