-
-
Save jthomerson/ca06245d316d485252579a7d42630095 to your computer and use it in GitHub Desktop.
#!/bin/bash | |
# This script will delete *all* documents in a CloudSearch domain. | |
# USE WITH EXTREME CAUTION | |
# Note: depends on the AWS CLI SDK being installed, as well as jq | |
# For jq, see: https://stedolan.github.io/jq/ and https://jqplay.org/ | |
if [[ ! $# -eq 2 || $1 != "--doc-domain" || ! $2 =~ ^https://.*$ ]]; then | |
echo "Must define --doc-domain argument (e.g. --doc-domain https://somedomain.aws.com)"; | |
exit 1; | |
fi | |
CS_DOMAIN=$2 | |
TMP_DELETE_FILE=/tmp/delete-all-cloudsearch-documents.json | |
TMP_RESULTS_FILE=/tmp/delete-all-cloudsearch-documents-tmp-results.json | |
while [ 1 -eq 1 ]; do | |
aws cloudsearchdomain search \ | |
--endpoint-url=$CS_DOMAIN \ | |
--size=10000 \ | |
--query-parser=structured \ | |
--search-query="matchall" > ${TMP_RESULTS_FILE} | |
cat ${TMP_RESULTS_FILE} | jq '[.hits.hit[] | {type: "delete", id: .id}]' > ${TMP_DELETE_FILE} | |
CNT_TOTAL=$(cat ${TMP_RESULTS_FILE} | jq '.hits.found') | |
CNT_DOCS=$(cat ${TMP_DELETE_FILE} | jq '. | length') | |
if [[ $CNT_DOCS -gt 0 ]]; then | |
echo "About to delete ${CNT_DOCS} documents of ${CNT_TOTAL} total in index" | |
aws cloudsearchdomain upload-documents \ | |
--endpoint-url=$CS_DOMAIN \ | |
--content-type='application/json' \ | |
--documents=${TMP_DELETE_FILE} | |
else | |
echo "No more docs to delete" | |
exit 0 | |
fi | |
done |
@jthomerson Thank you very much. --output json
was the missing option!
Thanks so much for making and publishing this script. Very helpful! In my case I added a filter query to target specific data.
I am deleting millions of records and am finding that it stops periodically, saying there are no more records to delete. I haven't researched the cause yet, but suspect it's an "eventual consistency" issue, thinking the same record ids are being returned that were just deleted. Was wondering if paginating the results with a cursor might help.
I am deleting millions of records and am finding that it stops periodically, saying there are no more records to delete.
Results looked like:
"hits": {
"found": 9803889,
"start": 0,
"hit": [ ]
(no records listed in the hit list)
I found that adding a random sort really helped (--sort="_rand asc"
). There may be a better solution, but this was an easy change that helped in my case.
@mafritsch did you try
aws help
and set the--output json
option?