-
-
Save carceneaux/b75d483e3e0cb798ae60c424300d5a0b to your computer and use it in GitHub Desktop.
#!/bin/bash | |
# | |
# Written by Chris Arceneaux | |
# GitHub: https://github.com/carceneaux | |
# Email: [email protected] | |
# Website: http://arsano.ninja | |
# | |
# Note: This code is a stop-gap to erase Job Artifacts for a project. I HIGHLY recommend you leverage | |
# "artifacts:expire_in" in your .gitlab-ci.yml | |
# | |
# https://docs.gitlab.com/ee/ci/yaml/#artifactsexpire_in | |
# | |
# Software Requirements: curl, jq | |
# | |
# This code has been released under the terms of the Apache-2.0 license | |
# http://opensource.org/licenses/Apache-2.0 | |
# project_id, find it here: https://gitlab.com/[organization name]/[repository name] at the top underneath repository name | |
project_id="207" | |
# token, find it here: https://gitlab.com/profile/personal_access_tokens | |
token="9hjGYpwmsMfBxT-Ghuu7" | |
server="gitlab.com" | |
# Retrieving Jobs list page count | |
total_pages=$(curl -sD - -o /dev/null -X GET \ | |
"https://$server/api/v4/projects/$project_id/jobs?per_page=100" \ | |
-H "PRIVATE-TOKEN: ${token}" | grep -Fi X-Total-Pages | sed 's/[^0-9]*//g') | |
# Creating list of Job IDs for the Project specified with Artifacts | |
job_ids=() | |
echo "" | |
echo "Creating list of all Jobs that currently have Artifacts..." | |
echo "Total Pages: ${total_pages}" | |
for ((i=2;i<=${total_pages};i++)) #starting with page 2 skipping most recent 100 Jobs | |
do | |
echo "Processing Page: ${i}/${total_pages}" | |
response=$(curl -s -X GET \ | |
"https://$server/api/v4/projects/$project_id/jobs?per_page=100&page=${i}" \ | |
-H "PRIVATE-TOKEN: ${token}") | |
length=$(echo $response | jq '. | length') | |
for ((j=0;j<${length};j++)) | |
do | |
if [[ $(echo $response | jq ".[${j}].artifacts_file | length") > 0 ]]; then | |
echo "Job found: $(echo $response | jq ".[${j}].id")" | |
job_ids+=($(echo $response | jq ".[${j}].id")) | |
fi | |
done | |
done | |
# Loop through each Job erasing the Artifact(s) | |
echo "" | |
echo "${#job_ids[@]} Jobs found. Commencing removal of Artifacts..." | |
for job_id in ${job_ids[@]}; | |
do | |
response=$(curl -s -X DELETE \ | |
-H "PRIVATE-TOKEN:${token}" \ | |
"https://$server/api/v4/projects/$project_id/jobs/$job_id/artifacts") | |
echo "Processing Job ID: ${job_id} - Status: $(echo $response | jq '.status')" | |
done |
It is not removed artifacts from the 1st page of my pipelines list for some reason. It also missed
.status
attribute in console log. The rest is as advertised, thanks!
@Atarity Check the source code and you will find the hint #starting with page 2 skipping most recent 100 Jobs
thus it is intended that the first page of artifacts are not removed.
The response can have a json with breaking lines \n
. Consider removing it like ${response//\\n/}
response=${response//\\n/}
length=$(echo $response | jq '. | length')
Also, you can easy simulate next page checking [ $length -ne 0 ]
and having the page loop to 1000 or more.
I made the following Python script, which works for over 10k jobs, too:
#!/usr/bin/env python3
import time
import requests
project_id = '...'
token = '...'
server = 'gitlab.com'
print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=2"
while url:
print(f"Processing page: {url}")
response = requests.get(
url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for job in response_json:
if job.get('artifacts_file', None):
job_id = job['id']
delete_response = requests.delete(
f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
headers={
'private-token': token,
},
)
print(f"Processing job ID: {job_id} - status: {delete_response.status_code}")
url = response.links.get('next', {}).get('url', None)
The if job.get('artifacts_file', None):
needs to be changed to if job.get('artifacts', None):
in the current version of the API, at least I don't see artifacts_file
in any of the JSON responses.
I see it here: https://docs.gitlab.com/ee/api/jobs.html
I don't know why but none of the jobs on our server had artifacts_file
but artifacts
instead where they were listed including their sizes etc.
"artifacts_file"
worked for me, but it's trivial to support both, I also tweaked the output so you can see what job failed if any, and made it start at the first page:
#!/usr/bin/env python3
import time
import requests
project_id = '...'
token = '...'
server = 'gitlab.com'
start_page = 1
print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
print(f"Processing page: {url}")
response = requests.get(
url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for job in response_json:
if job.get('artifacts_file', None) or job.get('artifacts', None):
job_id = job['id']
print(f"Processing job ID: {job_id}", end="")
delete_response = requests.delete(
f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
headers={
'private-token': token,
},
)
print(f" - status: {delete_response.status_code}")
url = response.links.get('next', {}).get('url', None)
While the script deletes jobs' artifacts, you can also delete project's artifacts by adding this code:
url = f"https://{server}/api/v4/projects/{project_id}/artifacts"
delete_response = requests.delete(
url,
headers={
'private-token': token,
}
)
print(f" - status: {delete_response.status_code}")
This does not work if you're project has more than 10000 jobs, due to the removal of X-Total-Pages
header from the Gitlab API responses.
Yes, I just found out that the X-Total-Pages
header is now missing for performance reasons. Fortunately when a page number is too high, an empty json list ([]
) is returned so it is quite easy to use a loop such as (here in bash):
PER_PAGE=100
PAGE=1
while JOBS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" "$GITLAB_INSTANCE/$PROJECT_ID/jobs?per_page=$PER_PAGE&page=$PAGE&sort=asc") && [ "$JOBS" != "[]" ]
do
for JOB in $(echo $JOBS | jq .[].id)
do
[...]
done
PAGE=$((PAGE+1))
done
Here's my slightly improved version for the 'do it in python' section (ignores job.log
files which seem to be non-deletable, uses command line arguments to load the settings):
#!/usr/bin/env python3
import time
import requests
import sys
server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]
print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
print(f"Processing page: {url}")
response = requests.get(
url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for job in response_json:
artifacts = job.get('artifacts_file', None)
if not artifacts:
artifacts = job.get('artifacts', None)
has_artifacts = False
for artifact in artifacts:
if artifact['filename'] != 'job.log':
has_artifacts = True
break
if has_artifacts:
job_id = job['id']
print(f"Processing job ID: {job_id}", end="")
delete_response = requests.delete(
f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
headers={
'private-token': token,
},
)
print(f" - status: {delete_response.status_code}")
url = response.links.get('next', {}).get('url', None)
Here's my slightly improved version for the 'do it in python' section (ignores
job.log
files which seem to be non-deletable, uses command line arguments to load the settings):#!/usr/bin/env python3 import time import requests import sys server = sys.argv[1] project_id = sys.argv[2] token = sys.argv[3] start_page = sys.argv[4] print("Creating list of all jobs that currently have artifacts...") # We skip the first page. url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}" while url: print(f"Processing page: {url}") response = requests.get( url, headers={ 'private-token': token, }, ) if response.status_code in [500, 429]: print(f"Status {response.status_code}, retrying.") time.sleep(10) continue response.raise_for_status() response_json = response.json() for job in response_json: artifacts = job.get('artifacts_file', None) if not artifacts: artifacts = job.get('artifacts', None) has_artifacts = False for artifact in artifacts: if artifact['filename'] != 'job.log': has_artifacts = True break if has_artifacts: job_id = job['id'] print(f"Processing job ID: {job_id}", end="") delete_response = requests.delete( f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts", headers={ 'private-token': token, }, ) print(f" - status: {delete_response.status_code}") url = response.links.get('next', {}).get('url', None)
I get this error:
remove_artifacts.py", line 38, in <module>
if artifact['filename'] != 'job.log':
~~~~~~~~^^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'
@Tim-Schwalbe: Apologies, yes, I overlooked this case. I have amended the script to ignore artifacts_file
, as this file seems to be contained in artifacts
anyway.
I have improved my version a bit, it now automatically selects expired artifacts for deletion that (in my opinion) should be deleted in the first place, because they belong to jobs that were run on:
- merge requests that have been merged or closed;
- branches that have been merged.
It will also take a list of project ids as the last argument, making it easy to use in a cron job: Usage: {sys.argv[0]} <server> <token> <project id>...
#!/usr/bin/env python3
import time
import requests
import sys
from datetime import datetime, timezone
from dateutil import parser
import re
if len(sys.argv) < 4:
print(f'Usage: {sys.argv[0]} <server> <token> <project id>...')
exit(1)
server = sys.argv[1]
token = sys.argv[2]
project_ids = []
for i in range(3, len(sys.argv)):
project_ids.append(sys.argv[i])
now = datetime.now(timezone.utc)
overall_space_savings = 0
for project_id in project_ids:
print(f'Processing project {project_id}:')
merge_request_url = f"https://{server}/api/v4/projects/{project_id}/merge_requests?scope=all&per_page=100&page=1"
merge_requests = {}
while merge_request_url:
response = requests.get(
merge_request_url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for merge_request in response_json:
iid = merge_request.get('iid', None)
if iid:
merge_requests[int(iid)] = merge_request['state']
merge_request_url = response.links.get('next', {}).get('url', None)
branch_url = f"https://{server}/api/v4/projects/{project_id}/repository/branches?per_page=100&page=1"
unmerged_branches = []
while branch_url:
response = requests.get(
branch_url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for branch in response_json:
is_merged = branch['merged']
if not is_merged:
unmerged_branches.append(branch['name'])
branch_url = response.links.get('next', {}).get('url', None)
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=1"
job_count = 0
artifact_count = 0
artifact_size = 0
deleted_artifact_count = 0
deleted_artifact_size = 0
while url:
response = requests.get(
url,
headers={
'private-token': token,
},
)
if response.status_code in [500, 429]:
print(f"Status {response.status_code}, retrying.")
time.sleep(10)
continue
response.raise_for_status()
response_json = response.json()
for job in response_json:
job_count += 1
artifacts = job.get('artifacts', None)
artifacts_expire_at_string = job.get('artifacts_expire_at', None)
artifacts_expire_at = None
if artifacts_expire_at_string:
artifacts_expire_at = parser.parse(artifacts_expire_at_string)
has_expired_artifacts = False
deleted_job_artifact_count = 0
deleted_job_artifact_size = 0
if artifacts:
for artifact in artifacts:
if artifact['filename'] != 'job.log':
size = artifact['size']
artifact_count += 1
artifact_size += size
if not artifacts_expire_at or artifacts_expire_at < now:
has_expired_artifacts = True
deleted_job_artifact_count += 1
deleted_job_artifact_size += size
delete_artifacts = False
if has_expired_artifacts:
ref = job['ref']
merge_request_iid_match = re.search(r'refs\/merge-requests\/(\d+)\/head', ref)
if merge_request_iid_match:
merge_request_iid = merge_request_iid_match.group(1)
if merge_request_iid:
merge_request_status = merge_requests.get(int(merge_request_iid))
if merge_request_status in ['merged', 'closed', None]:
delete_artifacts = True
deleted_artifact_count += deleted_job_artifact_count
deleted_artifact_size += deleted_job_artifact_size
elif ref not in unmerged_branches:
delete_artifacts = True
deleted_artifact_count += deleted_job_artifact_count
deleted_artifact_size += deleted_job_artifact_size
if delete_artifacts:
job_id = job['id']
print(f"Processing job ID: {job_id}", end="")
delete_response = requests.delete(
f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
headers={
'private-token': token,
},
)
print(f" - status: {delete_response.status_code}\033[K", end = "\r")
print(f'Processed page {url}.\033[K', end = "\r")
url = response.links.get('next', {}).get('url', None)
overall_space_savings += deleted_artifact_size
print()
print(f'Jobs analysed: {job_count}');
print(f'Pre artifact count: {artifact_count}');
print(f'Pre artifact size [MB]: {artifact_size / (1024 * 1024)}')
print(f'Post artifact count: {artifact_count - deleted_artifact_count}')
print(f'Post artifact size [MB]: {(artifact_size - deleted_artifact_size) / (1024 * 1024)}')
print()
print(f'Overall savings [MB]: {overall_space_savings / (1024 * 1024)}')
@mikeller I suggest you write your script code in a gist, or even fork this one here and replace it with you python code =)
Each gist indicates which forks have activity, making it easy to find interesting changes from others.
@voiski: Good point, done: https://gist.github.com/mikeller/ee7a668a83e4b9bc61646bddb4a2ade6
New version that takes a GitLab group id as a parameter and then cleans up all repositories in the group: https://gist.github.com/mikeller/7034d99bc27c361fc6a2df84e19c36ff
It is not removed artifacts from the 1st page of my pipelines list for some reason. It also missed
.status
attribute in console log. The rest is as advertised, thanks!