Doing a big big big masscan and grabbing headers, currently have targets in mind for a project but wanted to find a way to explore the other stuff active on the same ports. Used this deeply terrible one-liner to split up the HTTP banners into tokens and then count token frequency.
fgrep -i http masscan.json | sed 's/[,]$//'\
| jq -s ".[].ports[].service.banner" | sed 's/[";:,<>()]//g'\
| sed "s/[']//g" | sed -E 's/([\\]r|[\\]n)+/ /g'\
| sed 's/[\/=]/ /g' | awk '{ for (i=1; i<=NF; i++) { print $i}}'\
| tr '[:upper:]' '[:lower:]' | grep -E '^.{4,}$'\
| grep -Ev '^[0-9.]+$' | sort | uniq -c | sort -nfgrep -i http masscan.json- only care about records that found HTTP| sed 's/[,]$//'- records have a trailing comma I need to strip| jq -s '.[].ports[].service.banner'- iterate the records and extract just the HTTP banner| sed | sed 's/[";:,<>()]//g'- strip quotes and other junk that mess with the frequency counting| sed "s/[']//g"- strip out single quotes (not sure how to escape this in the line above lol)| sed -E 's/([\\]r|[\\n])+/ /g'- convert the references to newlines into spaces| sed 's/[\/=]/ /g'- convert foreslashes and equals signs into spaces| awk '{ for (i=1; i<=NF; i++) { print $i}}'- print each token on a new line| tr '[:upper:]' '[:lower:]'- convert each token to lowercase| grep -E '^.{4,}$'- dump any tokens less than 4 characters long| grep -Ev '^[0-9.]+$'- dump any tokens that are just numeric (looking at you, version numbers)| sort | uniq -c | sort -n- sort the tokens, count the # of occurences of each unique token, sort the results by frequency