-
-
Save jthomerson/ca06245d316d485252579a7d42630095 to your computer and use it in GitHub Desktop.
#!/bin/bash | |
# This script will delete *all* documents in a CloudSearch domain. | |
# USE WITH EXTREME CAUTION | |
# Note: depends on the AWS CLI SDK being installed, as well as jq | |
# For jq, see: https://stedolan.github.io/jq/ and https://jqplay.org/ | |
if [[ ! $# -eq 2 || $1 != "--doc-domain" || ! $2 =~ ^https://.*$ ]]; then | |
echo "Must define --doc-domain argument (e.g. --doc-domain https://somedomain.aws.com)"; | |
exit 1; | |
fi | |
CS_DOMAIN=$2 | |
TMP_DELETE_FILE=/tmp/delete-all-cloudsearch-documents.json | |
TMP_RESULTS_FILE=/tmp/delete-all-cloudsearch-documents-tmp-results.json | |
while [ 1 -eq 1 ]; do | |
aws cloudsearchdomain search \ | |
--endpoint-url=$CS_DOMAIN \ | |
--size=10000 \ | |
--query-parser=structured \ | |
--search-query="matchall" > ${TMP_RESULTS_FILE} | |
cat ${TMP_RESULTS_FILE} | jq '[.hits.hit[] | {type: "delete", id: .id}]' > ${TMP_DELETE_FILE} | |
CNT_TOTAL=$(cat ${TMP_RESULTS_FILE} | jq '.hits.found') | |
CNT_DOCS=$(cat ${TMP_DELETE_FILE} | jq '. | length') | |
if [[ $CNT_DOCS -gt 0 ]]; then | |
echo "About to delete ${CNT_DOCS} documents of ${CNT_TOTAL} total in index" | |
aws cloudsearchdomain upload-documents \ | |
--endpoint-url=$CS_DOMAIN \ | |
--content-type='application/json' \ | |
--documents=${TMP_DELETE_FILE} | |
else | |
echo "No more docs to delete" | |
exit 0 | |
fi | |
done |
I tried the script (only the search part) but it did not work, because the result file was not in Json format. It says "Invalid numeric literal at line 1, column 5". And the file starts with:
HITS 8581 0
HIT b187f653b61b08e5ee5f54c662b280e4ad368f5c1d631e32ce3b2cbf31c81ae4ba4b39360fc859fa364da32788549a5543fd4efb734f12438e3a4b4238bc5212
BOOK Studio ASDoc
BOOST 0.029440219
So, how do I get Cloudsearch to deliver a JSON file as the response?
@mafritsch did you try aws help
and set the --output json
option?
@jthomerson Thank you very much. --output json
was the missing option!
Thanks so much for making and publishing this script. Very helpful! In my case I added a filter query to target specific data.
I am deleting millions of records and am finding that it stops periodically, saying there are no more records to delete. I haven't researched the cause yet, but suspect it's an "eventual consistency" issue, thinking the same record ids are being returned that were just deleted. Was wondering if paginating the results with a cursor might help.
I am deleting millions of records and am finding that it stops periodically, saying there are no more records to delete.
Results looked like:
"hits": {
"found": 9803889,
"start": 0,
"hit": [ ]
(no records listed in the hit list)
I found that adding a random sort really helped (--sort="_rand asc"
). There may be a better solution, but this was an easy change that helped in my case.
worked like a charm. thank you so much 👍