exploring targets by word frequency from masscan output

Doing a big big big masscan and grabbing headers, currently have targets in mind for a project but wanted to find a way to explore the other stuff active on the same ports. Used this deeply terrible one-liner to split up the HTTP banners into tokens and then count token frequency.

fgrep -i http masscan.json | sed 's/[,]$//'\
 | jq -s ".[].ports[].service.banner" | sed 's/[";:,<>()]//g'\
 | sed "s/[']//g" | sed -E 's/([\\]r|[\\]n)+/ /g'\
 | sed 's/[\/=]/ /g' | awk '{ for (i=1; i<=NF; i++) { print $i}}'\
 | tr '[:upper:]' '[:lower:]' | grep -E '^.{4,}$'\
 | grep -Ev '^[0-9.]+$' | sort | uniq -c | sort -n

fgrep -i http masscan.json - only care about records that found HTTP
| sed 's/[,]$//' - records have a trailing comma I need to strip
| jq -s '.[].ports[].service.banner' - iterate the records and extract just the HTTP banner
| sed | sed 's/[";:,<>()]//g' - strip quotes and other junk that mess with the frequency counting
| sed "s/[']//g" - strip out single quotes (not sure how to escape this in the line above lol)
| sed -E 's/([\\]r|[\\n])+/ /g' - convert the references to newlines into spaces
| sed 's/[\/=]/ /g' - convert foreslashes and equals signs into spaces
| awk '{ for (i=1; i<=NF; i++) { print $i}}' - print each token on a new line
| tr '[:upper:]' '[:lower:]' - convert each token to lowercase
| grep -E '^.{4,}$' - dump any tokens less than 4 characters long
| grep -Ev '^[0-9.]+$' - dump any tokens that are just numeric (looking at you, version numbers)
| sort | uniq -c | sort -n - sort the tokens, count the # of occurences of each unique token, sort the results by frequency

smidgedy/the_endless_pain.md