-
-
Save carceneaux/b75d483e3e0cb798ae60c424300d5a0b to your computer and use it in GitHub Desktop.
#!/bin/bash | |
# | |
# Written by Chris Arceneaux | |
# GitHub: https://github.com/carceneaux | |
# Email: [email protected] | |
# Website: http://arsano.ninja | |
# | |
# Note: This code is a stop-gap to erase Job Artifacts for a project. I HIGHLY recommend you leverage | |
# "artifacts:expire_in" in your .gitlab-ci.yml | |
# | |
# https://docs.gitlab.com/ee/ci/yaml/#artifactsexpire_in | |
# | |
# Software Requirements: curl, jq | |
# | |
# This code has been released under the terms of the Apache-2.0 license | |
# http://opensource.org/licenses/Apache-2.0 | |
# project_id, find it here: https://gitlab.com/[organization name]/[repository name] at the top underneath repository name | |
project_id="207" | |
# token, find it here: https://gitlab.com/profile/personal_access_tokens | |
token="9hjGYpwmsMfBxT-Ghuu7" | |
server="gitlab.com" | |
# Retrieving Jobs list page count | |
total_pages=$(curl -sD - -o /dev/null -X GET \ | |
"https://$server/api/v4/projects/$project_id/jobs?per_page=100" \ | |
-H "PRIVATE-TOKEN: ${token}" | grep -Fi X-Total-Pages | sed 's/[^0-9]*//g') | |
# Creating list of Job IDs for the Project specified with Artifacts | |
job_ids=() | |
echo "" | |
echo "Creating list of all Jobs that currently have Artifacts..." | |
echo "Total Pages: ${total_pages}" | |
for ((i=2;i<=${total_pages};i++)) #starting with page 2 skipping most recent 100 Jobs | |
do | |
echo "Processing Page: ${i}/${total_pages}" | |
response=$(curl -s -X GET \ | |
"https://$server/api/v4/projects/$project_id/jobs?per_page=100&page=${i}" \ | |
-H "PRIVATE-TOKEN: ${token}") | |
length=$(echo $response | jq '. | length') | |
for ((j=0;j<${length};j++)) | |
do | |
if [[ $(echo $response | jq ".[${j}].artifacts_file | length") > 0 ]]; then | |
echo "Job found: $(echo $response | jq ".[${j}].id")" | |
job_ids+=($(echo $response | jq ".[${j}].id")) | |
fi | |
done | |
done | |
# Loop through each Job erasing the Artifact(s) | |
echo "" | |
echo "${#job_ids[@]} Jobs found. Commencing removal of Artifacts..." | |
for job_id in ${job_ids[@]}; | |
do | |
response=$(curl -s -X DELETE \ | |
-H "PRIVATE-TOKEN:${token}" \ | |
"https://$server/api/v4/projects/$project_id/jobs/$job_id/artifacts") | |
echo "Processing Job ID: ${job_id} - Status: $(echo $response | jq '.status')" | |
done |
The response can have a json with breaking lines \n
. Consider removing it like ${response//\\n/}
response=${response//\\n/}
length=$(echo $response | jq '. | length')
Also, you can easy simulate next page checking [ $length -ne 0 ]
and having the page loop to 1000 or more.
I made the following Python script, which works for over 10k jobs, too:
#!/usr/bin/env python3
import time
import requests
project_id = '...'
token = '...'
server = 'gitlab.com'
print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=2"
while url:
print(f"Processing page: {url}")
response = requests.get(
url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for job in response_json:
if job.get('artifacts_file', None):
job_id = job['id']
delete_response = requests.delete(
f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
headers={
'private-token': token,
},
)
print(f"Processing job ID: {job_id} - status: {delete_response.status_code}")
url = response.links.get('next', {}).get('url', None)
The if job.get('artifacts_file', None):
needs to be changed to if job.get('artifacts', None):
in the current version of the API, at least I don't see artifacts_file
in any of the JSON responses.
I see it here: https://docs.gitlab.com/ee/api/jobs.html
I don't know why but none of the jobs on our server had artifacts_file
but artifacts
instead where they were listed including their sizes etc.
"artifacts_file"
worked for me, but it's trivial to support both, I also tweaked the output so you can see what job failed if any, and made it start at the first page:
#!/usr/bin/env python3
import time
import requests
project_id = '...'
token = '...'
server = 'gitlab.com'
start_page = 1
print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
print(f"Processing page: {url}")
response = requests.get(
url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for job in response_json:
if job.get('artifacts_file', None) or job.get('artifacts', None):
job_id = job['id']
print(f"Processing job ID: {job_id}", end="")
delete_response = requests.delete(
f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
headers={
'private-token': token,
},
)
print(f" - status: {delete_response.status_code}")
url = response.links.get('next', {}).get('url', None)
While the script deletes jobs' artifacts, you can also delete project's artifacts by adding this code:
url = f"https://{server}/api/v4/projects/{project_id}/artifacts"
delete_response = requests.delete(
url,
headers={
'private-token': token,
}
)
print(f" - status: {delete_response.status_code}")
This does not work if you're project has more than 10000 jobs, due to the removal of X-Total-Pages
header from the Gitlab API responses.
Yes, I just found out that the X-Total-Pages
header is now missing for performance reasons. Fortunately when a page number is too high, an empty json list ([]
) is returned so it is quite easy to use a loop such as (here in bash):
PER_PAGE=100
PAGE=1
while JOBS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" "$GITLAB_INSTANCE/$PROJECT_ID/jobs?per_page=$PER_PAGE&page=$PAGE&sort=asc") && [ "$JOBS" != "[]" ]
do
for JOB in $(echo $JOBS | jq .[].id)
do
[...]
done
PAGE=$((PAGE+1))
done
Here's my slightly improved version for the 'do it in python' section (ignores job.log
files which seem to be non-deletable, uses command line arguments to load the settings):
#!/usr/bin/env python3
import time
import requests
import sys
server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]
print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
print(f"Processing page: {url}")
response = requests.get(
url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for job in response_json:
artifacts = job.get('artifacts_file', None)
if not artifacts:
artifacts = job.get('artifacts', None)
has_artifacts = False
for artifact in artifacts:
if artifact['filename'] != 'job.log':
has_artifacts = True
break
if has_artifacts:
job_id = job['id']
print(f"Processing job ID: {job_id}", end="")
delete_response = requests.delete(
f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
headers={
'private-token': token,
},
)
print(f" - status: {delete_response.status_code}")
url = response.links.get('next', {}).get('url', None)
Here's my slightly improved version for the 'do it in python' section (ignores
job.log
files which seem to be non-deletable, uses command line arguments to load the settings):#!/usr/bin/env python3 import time import requests import sys server = sys.argv[1] project_id = sys.argv[2] token = sys.argv[3] start_page = sys.argv[4] print("Creating list of all jobs that currently have artifacts...") # We skip the first page. url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}" while url: print(f"Processing page: {url}") response = requests.get( url, headers={ 'private-token': token, }, ) if response.status_code in [500, 429]: print(f"Status {response.status_code}, retrying.") time.sleep(10) continue response.raise_for_status() response_json = response.json() for job in response_json: artifacts = job.get('artifacts_file', None) if not artifacts: artifacts = job.get('artifacts', None) has_artifacts = False for artifact in artifacts: if artifact['filename'] != 'job.log': has_artifacts = True break if has_artifacts: job_id = job['id'] print(f"Processing job ID: {job_id}", end="") delete_response = requests.delete( f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts", headers={ 'private-token': token, }, ) print(f" - status: {delete_response.status_code}") url = response.links.get('next', {}).get('url', None)
I get this error:
remove_artifacts.py", line 38, in <module>
if artifact['filename'] != 'job.log':
~~~~~~~~^^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'
@Tim-Schwalbe: Apologies, yes, I overlooked this case. I have amended the script to ignore artifacts_file
, as this file seems to be contained in artifacts
anyway.
I have improved my version a bit, it now automatically selects expired artifacts for deletion that (in my opinion) should be deleted in the first place, because they belong to jobs that were run on:
- merge requests that have been merged or closed;
- branches that have been merged.
It will also take a list of project ids as the last argument, making it easy to use in a cron job: Usage: {sys.argv[0]} <server> <token> <project id>...
#!/usr/bin/env python3
import time
import requests
import sys
from datetime import datetime, timezone
from dateutil import parser
import re
if len(sys.argv) < 4:
print(f'Usage: {sys.argv[0]} <server> <token> <project id>...')
exit(1)
server = sys.argv[1]
token = sys.argv[2]
project_ids = []
for i in range(3, len(sys.argv)):
project_ids.append(sys.argv[i])
now = datetime.now(timezone.utc)
overall_space_savings = 0
for project_id in project_ids:
print(f'Processing project {project_id}:')
merge_request_url = f"https://{server}/api/v4/projects/{project_id}/merge_requests?scope=all&per_page=100&page=1"
merge_requests = {}
while merge_request_url:
response = requests.get(
merge_request_url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for merge_request in response_json:
iid = merge_request.get('iid', None)
if iid:
merge_requests[int(iid)] = merge_request['state']
merge_request_url = response.links.get('next', {}).get('url', None)
branch_url = f"https://{server}/api/v4/projects/{project_id}/repository/branches?per_page=100&page=1"
unmerged_branches = []
while branch_url:
response = requests.get(
branch_url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for branch in response_json:
is_merged = branch['merged']
if not is_merged:
unmerged_branches.append(branch['name'])
branch_url = response.links.get('next', {}).get('url', None)
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=1"
job_count = 0
artifact_count = 0
artifact_size = 0
deleted_artifact_count = 0
deleted_artifact_size = 0
while url:
response = requests.get(
url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for job in response_json:
job_count += 1
artifacts = job.get('artifacts', None)
artifacts_expire_at_string = job.get('artifacts_expire_at', None)
artifacts_expire_at = None
if artifacts_expire_at_string:
artifacts_expire_at = parser.parse(artifacts_expire_at_string)
has_expired_artifacts = False
deleted_job_artifact_count = 0
deleted_job_artifact_size = 0
if artifacts:
for artifact in artifacts:
if artifact['filename'] != 'job.log':
size = artifact['size']
artifact_count += 1
artifact_size += size
if not artifacts_expire_at or artifacts_expire_at < now:
has_expired_artifacts = True
deleted_job_artifact_count += 1
deleted_job_artifact_size += size
delete_artifacts = False
if has_expired_artifacts:
ref = job['ref']
merge_request_iid_match = re.search(r'refs\/merge-requests\/(\d+)\/head', ref)
if merge_request_iid_match:
merge_request_iid = merge_request_iid_match.group(1)
if merge_request_iid:
merge_request_status = merge_requests.get(int(merge_request_iid))
if merge_request_status in ['merged', 'closed', None]:
delete_artifacts = True
deleted_artifact_count += deleted_job_artifact_count
deleted_artifact_size += deleted_job_artifact_size
elif ref not in unmerged_branches:
delete_artifacts = True
deleted_artifact_count += deleted_job_artifact_count
deleted_artifact_size += deleted_job_artifact_size
if delete_artifacts:
job_id = job['id']
print(f"Processing job ID: {job_id}", end="")
delete_response = requests.delete(
f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
headers={
'private-token': token,
},
)
print(f" - status: {delete_response.status_code}\033[K", end = "\r")
print(f'Processed page {url}.\033[K', end = "\r")
url = response.links.get('next', {}).get('url', None)
overall_space_savings += deleted_artifact_size
print()
print(f'Jobs analysed: {job_count}');
print(f'Pre artifact count: {artifact_count}');
print(f'Pre artifact size [MB]: {artifact_size / (1024 * 1024)}')
print(f'Post artifact count: {artifact_count - deleted_artifact_count}')
print(f'Post artifact size [MB]: {(artifact_size - deleted_artifact_size) / (1024 * 1024)}')
print()
print(f'Overall savings [MB]: {overall_space_savings / (1024 * 1024)}')
@mikeller I suggest you write your script code in a gist, or even fork this one here and replace it with you python code =)
Each gist indicates which forks have activity, making it easy to find interesting changes from others.
@voiski: Good point, done: https://gist.github.com/mikeller/ee7a668a83e4b9bc61646bddb4a2ade6
New version that takes a GitLab group id as a parameter and then cleans up all repositories in the group: https://gist.github.com/mikeller/7034d99bc27c361fc6a2df84e19c36ff
@Atarity Check the source code and you will find the hint
#starting with page 2 skipping most recent 100 Jobs
thus it is intended that the first page of artifacts are not removed.