Doing a big big big masscan and grabbing headers, currently have targets in mind for a project but wanted to find a way to explore the other stuff active on the same ports. Used this deeply terrible one-liner to split up the HTTP banners into tokens and then count token frequency.
fgrep -i http masscan.json | sed 's/[,]$//'\
| jq -s ".[].ports[].service.banner" | sed 's/[";:,<>()]//g'\
| sed "s/[']//g" | sed -E 's/([\\]r|[\\]n)+/ /g'\
| sed 's/[\/=]/ /g' | awk '{ for (i=1; i<=NF; i++) { print $i}}'\
| tr '[:upper:]' '[:lower:]' | grep -E '^.{4,}$'\
| grep -Ev '^[0-9.]+$' | sort | uniq -c | sort -n
fgrep -i http masscan.json
- only care about records that found HTTP| sed 's/[,]$//'
- records have a trailing comma I need to strip| jq -s '.[].ports[].service.banner'
- iterate the records and extract just the HTTP banner| sed | sed 's/[";:,<>()]//g'
- strip quotes and other junk that mess with the frequency counting| sed "s/[']//g"
- strip out single quotes (not sure how to escape this in the line above lol)| sed -E 's/([\\]r|[\\n])+/ /g'
- convert the references to newlines into spaces| sed 's/[\/=]/ /g'
- convert foreslashes and equals signs into spaces| awk '{ for (i=1; i<=NF; i++) { print $i}}'
- print each token on a new line| tr '[:upper:]' '[:lower:]'
- convert each token to lowercase| grep -E '^.{4,}$'
- dump any tokens less than 4 characters long| grep -Ev '^[0-9.]+$'
- dump any tokens that are just numeric (looking at you, version numbers)| sort | uniq -c | sort -n
- sort the tokens, count the # of occurences of each unique token, sort the results by frequency