Last active
August 26, 2024 15:24
-
-
Save ericleasemorgan/1a7722b21128d96a28762191690848bd to your computer and use it in GitHub Desktop.
some one-liners to extract urls, email address, and a dictionary from a text file
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# extract all urls from a text file | |
cat file.txt | egrep -o 'https?://[^ ]+' | sed -e 's/https/http/g' | sed -e 's/\W+$//g' | sort | uniq -c | sort -bnr | |
# extraxt domains from URL's found in text files | |
cat file.txt | egrep -o 'https?://[^ ]+' | sed -e 's/https/http/g' | sed -e 's/\W+$//g' | sed -e 's/http:\/\///g' | sed -e 's/\/.*$//g' | sort | uniq -c | sort -bnr | |
# extract email addresses | |
cat file.txt | grep -i -o '[A-Z0-9._%+-]\+@[A-Z0-9.-]\+\.[A-Z]\{2,4\}' | sort | uniq -c | sort -bnr | |
# list all words in a text file | |
cat file.txt | tr '[:space:]' '[\n*]' | grep -v "^\s*$" | sort | uniq -c | sort -bnr |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment