tsellers-r7 · June 29, 2017 13:44
diff --git a/http_get_extract_by_title b/http_get_extract_by_title
 The following:
 - decompresses the file into a pipe
 - transforms the base64 encoded data in the 'data' field using the DAP 'transform' filter
 - uses the DAT 'decode_http_reply' filter to decode the HTTP response
 - emits this as JSON
 - Uses jq to check the '.data.http_title' field for "Invalid URL" (sorry for the dots in the name)
 - uses pigz to compress the stream into an output file so that you don't have to store uncompressed data at any point.

 pigz -dc <filename> | \
  dap json + transform data=base64decode + decode_http_reply data + json | \
  jq -c '. | select(."data.http_title"=="Invalid URL")' | \
  pigz -c > http_title_results.gz
  
 The above could be optimized with:
 - parallel since there is no requirement to maintain state between records
 - Using the DAP filters 'remove', 'flatten', etc to remove data you don't need
 - using grep before jq to only send records that contain '.data.http_title' to jq
 
 DAP Github link: https://github.com/rapid7/dap
	The following:
	- decompresses the file into a pipe
	- transforms the base64 encoded data in the 'data' field using the DAP 'transform' filter
	- uses the DAT 'decode_http_reply' filter to decode the HTTP response
	- emits this as JSON
	- Uses jq to check the '.data.http_title' field for "Invalid URL" (sorry for the dots in the name)
	- uses pigz to compress the stream into an output file so that you don't have to store uncompressed data at any point.

	pigz -dc <filename> \| \
	dap json + transform data=base64decode + decode_http_reply data + json \| \
	jq -c '. \| select(."data.http_title"=="Invalid URL")' \| \
	pigz -c > http_title_results.gz

	The above could be optimized with:
	- parallel since there is no requirement to maintain state between records
	- Using the DAP filters 'remove', 'flatten', etc to remove data you don't need
	- using grep before jq to only send records that contain '.data.http_title' to jq

	DAP Github link: https://github.com/rapid7/dap