Skip to content

Instantly share code, notes, and snippets.

@coreyhermanson
Created June 30, 2016 16:40
Show Gist options
  • Save coreyhermanson/8398d66e5353e1b04085097d09a77485 to your computer and use it in GitHub Desktop.
Save coreyhermanson/8398d66e5353e1b04085097d09a77485 to your computer and use it in GitHub Desktop.
Python script which takes a list of many regex and combines into a master regex. The script takes an input file, tests each line for a master regex match, then outputs lines where match=TRUE to an output file.
#!/usr/bin/env python
import re
input_file = 'infile.txt' # enter full file path, precede string with 'r' (r'PATH') if using Windows
output_file = 'outfile.txt' # enter full file path, precede string with 'r' (r'PATH') if using Windows
delete_counter = 0
# list of individual regex, which will be combined into a single regex in the next step
regexes = [
"\.fortress\.com",
"\.hf\.com",
"\.co\.uk",
"\.com\.mx",
"newhollandcapital\.com\.au",
"avenuecapital\.com"
]
# Make a master regex which combines all the individual regex in the regexes list
combined_regexes = "(" + ")|(".join(regexes) + ")"
# Goes line by line, applies combined regex to line, writes to outfile if match=TRUE
with open(input_file, 'r', encoding='UTF8') as inFile, open(output_file, 'w', newline='', encoding='UTF8') as outFile:
line_counter = 0
for line in inFile:
line_counter += 1
if re.search(combined_regexes, line):
delete_counter += 1
outFile.write(line)
if line_counter % 1000 == 0:
print("Currently on line: {}".format(line_counter))
print(str(delete_counter) + " matching domains were removed.")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment