Skip to content

Instantly share code, notes, and snippets.

@carceneaux
Last active July 17, 2024 09:39
Show Gist options
  • Save carceneaux/b75d483e3e0cb798ae60c424300d5a0b to your computer and use it in GitHub Desktop.
Save carceneaux/b75d483e3e0cb798ae60c424300d5a0b to your computer and use it in GitHub Desktop.
Script for removing GitLab Job Artifacts.
#!/bin/bash
#
# Written by Chris Arceneaux
# GitHub: https://github.com/carceneaux
# Email: [email protected]
# Website: http://arsano.ninja
#
# Note: This code is a stop-gap to erase Job Artifacts for a project. I HIGHLY recommend you leverage
# "artifacts:expire_in" in your .gitlab-ci.yml
#
# https://docs.gitlab.com/ee/ci/yaml/#artifactsexpire_in
#
# Software Requirements: curl, jq
#
# This code has been released under the terms of the Apache-2.0 license
# http://opensource.org/licenses/Apache-2.0
# project_id, find it here: https://gitlab.com/[organization name]/[repository name] at the top underneath repository name
project_id="207"
# token, find it here: https://gitlab.com/profile/personal_access_tokens
token="9hjGYpwmsMfBxT-Ghuu7"
server="gitlab.com"
# Retrieving Jobs list page count
total_pages=$(curl -sD - -o /dev/null -X GET \
"https://$server/api/v4/projects/$project_id/jobs?per_page=100" \
-H "PRIVATE-TOKEN: ${token}" | grep -Fi X-Total-Pages | sed 's/[^0-9]*//g')
# Creating list of Job IDs for the Project specified with Artifacts
job_ids=()
echo ""
echo "Creating list of all Jobs that currently have Artifacts..."
echo "Total Pages: ${total_pages}"
for ((i=2;i<=${total_pages};i++)) #starting with page 2 skipping most recent 100 Jobs
do
echo "Processing Page: ${i}/${total_pages}"
response=$(curl -s -X GET \
"https://$server/api/v4/projects/$project_id/jobs?per_page=100&page=${i}" \
-H "PRIVATE-TOKEN: ${token}")
length=$(echo $response | jq '. | length')
for ((j=0;j<${length};j++))
do
if [[ $(echo $response | jq ".[${j}].artifacts_file | length") > 0 ]]; then
echo "Job found: $(echo $response | jq ".[${j}].id")"
job_ids+=($(echo $response | jq ".[${j}].id"))
fi
done
done
# Loop through each Job erasing the Artifact(s)
echo ""
echo "${#job_ids[@]} Jobs found. Commencing removal of Artifacts..."
for job_id in ${job_ids[@]};
do
response=$(curl -s -X DELETE \
-H "PRIVATE-TOKEN:${token}" \
"https://$server/api/v4/projects/$project_id/jobs/$job_id/artifacts")
echo "Processing Job ID: ${job_id} - Status: $(echo $response | jq '.status')"
done
@kbaran1998
Copy link

While the script deletes jobs' artifacts, you can also delete project's artifacts by adding this code:

url = f"https://{server}/api/v4/projects/{project_id}/artifacts"
delete_response = requests.delete(
    url,
    headers={
        'private-token': token,
    }
)
print(f" - status: {delete_response.status_code}")

@Muffinman
Copy link

This does not work if you're project has more than 10000 jobs, due to the removal of X-Total-Pages header from the Gitlab API responses.

@cmuller
Copy link

cmuller commented Jul 7, 2023

Yes, I just found out that the X-Total-Pages header is now missing for performance reasons. Fortunately when a page number is too high, an empty json list ([]) is returned so it is quite easy to use a loop such as (here in bash):

PER_PAGE=100
PAGE=1
while JOBS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" "$GITLAB_INSTANCE/$PROJECT_ID/jobs?per_page=$PER_PAGE&page=$PAGE&sort=asc") && [ "$JOBS" != "[]" ]
do
   for JOB in $(echo $JOBS | jq .[].id)
   do
      [...]
   done
   PAGE=$((PAGE+1))
done

@mikeller
Copy link

mikeller commented Dec 7, 2023

Here's my slightly improved version for the 'do it in python' section (ignores job.log files which seem to be non-deletable, uses command line arguments to load the settings):

#!/usr/bin/env python3

import time
import requests
import sys

server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        artifacts = job.get('artifacts_file', None)
        if not artifacts:
            artifacts = job.get('artifacts', None)

        has_artifacts = False
        for artifact in artifacts:
            if artifact['filename'] != 'job.log':
                has_artifacts = True
                break

        if has_artifacts:
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

@Tim-Schwalbe
Copy link

Here's my slightly improved version for the 'do it in python' section (ignores job.log files which seem to be non-deletable, uses command line arguments to load the settings):

#!/usr/bin/env python3

import time
import requests
import sys

server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        artifacts = job.get('artifacts_file', None)
        if not artifacts:
            artifacts = job.get('artifacts', None)

        has_artifacts = False
        for artifact in artifacts:
            if artifact['filename'] != 'job.log':
                has_artifacts = True
                break

        if has_artifacts:
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

I get this error:

remove_artifacts.py", line 38, in <module>
    if artifact['filename'] != 'job.log':
       ~~~~~~~~^^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'

@mikeller
Copy link

@Tim-Schwalbe: Apologies, yes, I overlooked this case. I have amended the script to ignore artifacts_file, as this file seems to be contained in artifacts anyway.

I have improved my version a bit, it now automatically selects expired artifacts for deletion that (in my opinion) should be deleted in the first place, because they belong to jobs that were run on:

  • merge requests that have been merged or closed;
  • branches that have been merged.

It will also take a list of project ids as the last argument, making it easy to use in a cron job: Usage: {sys.argv[0]} <server> <token> <project id>...

#!/usr/bin/env python3

import time
import requests
import sys
from datetime import datetime, timezone
from dateutil import parser
import re

if len(sys.argv) < 4:
    print(f'Usage: {sys.argv[0]} <server> <token> <project id>...')

    exit(1)

server = sys.argv[1]
token = sys.argv[2]
project_ids = []
for i in range(3, len(sys.argv)):
    project_ids.append(sys.argv[i])


now = datetime.now(timezone.utc)

overall_space_savings = 0
for project_id in project_ids:
    print(f'Processing project {project_id}:')

    merge_request_url = f"https://{server}/api/v4/projects/{project_id}/merge_requests?scope=all&per_page=100&page=1"
    merge_requests = {}
    while merge_request_url:
        response = requests.get(
            merge_request_url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()

        for merge_request in response_json:
            iid = merge_request.get('iid', None)
            if iid:
                merge_requests[int(iid)] = merge_request['state']

        merge_request_url = response.links.get('next', {}).get('url', None)

    branch_url = f"https://{server}/api/v4/projects/{project_id}/repository/branches?per_page=100&page=1"
    unmerged_branches = []
    while branch_url:
        response = requests.get(
            branch_url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()

        for branch in response_json:
            is_merged = branch['merged']
            if not is_merged:
                unmerged_branches.append(branch['name'])

        branch_url = response.links.get('next', {}).get('url', None)


    url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=1"

    job_count = 0
    artifact_count = 0
    artifact_size = 0
    deleted_artifact_count = 0
    deleted_artifact_size = 0
    while url:
        response = requests.get(
            url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()
        for job in response_json:
            job_count += 1

            artifacts = job.get('artifacts', None)
            artifacts_expire_at_string = job.get('artifacts_expire_at', None)
            artifacts_expire_at = None
            if artifacts_expire_at_string:
                    artifacts_expire_at = parser.parse(artifacts_expire_at_string)

            has_expired_artifacts = False
            deleted_job_artifact_count = 0
            deleted_job_artifact_size = 0
            if artifacts:
                for artifact in artifacts:
                    if artifact['filename'] != 'job.log':
                        size = artifact['size']

                        artifact_count += 1
                        artifact_size += size

                        if not artifacts_expire_at or artifacts_expire_at < now:
                            has_expired_artifacts = True
                            deleted_job_artifact_count += 1
                            deleted_job_artifact_size += size


            delete_artifacts = False
            if has_expired_artifacts:
                ref = job['ref']
                merge_request_iid_match = re.search(r'refs\/merge-requests\/(\d+)\/head', ref)
                if merge_request_iid_match:
                    merge_request_iid = merge_request_iid_match.group(1)
                    if merge_request_iid:
                        merge_request_status = merge_requests.get(int(merge_request_iid))
                        if merge_request_status in ['merged', 'closed', None]:
                            delete_artifacts = True
                            deleted_artifact_count += deleted_job_artifact_count
                            deleted_artifact_size += deleted_job_artifact_size

                elif ref not in unmerged_branches:
                    delete_artifacts = True
                    deleted_artifact_count += deleted_job_artifact_count
                    deleted_artifact_size += deleted_job_artifact_size

            if delete_artifacts:
                job_id = job['id']
                print(f"Processing job ID: {job_id}", end="")
                delete_response = requests.delete(
                    f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                    headers={
                        'private-token': token,
                    },
                )
                print(f" - status: {delete_response.status_code}\033[K", end = "\r")


        print(f'Processed page {url}.\033[K', end = "\r")

        url = response.links.get('next', {}).get('url', None)

    overall_space_savings += deleted_artifact_size

    print()
    print(f'Jobs analysed: {job_count}');
    print(f'Pre artifact count: {artifact_count}');
    print(f'Pre artifact size [MB]: {artifact_size / (1024 * 1024)}')
    print(f'Post artifact count: {artifact_count - deleted_artifact_count}')
    print(f'Post artifact size [MB]: {(artifact_size - deleted_artifact_size) / (1024 * 1024)}')
    print()

print(f'Overall savings [MB]: {overall_space_savings / (1024 * 1024)}')

@voiski
Copy link

voiski commented Dec 13, 2023

@mikeller I suggest you write your script code in a gist, or even fork this one here and replace it with you python code =)

Each gist indicates which forks have activity, making it easy to find interesting changes from others.

@mikeller
Copy link

@mikeller
Copy link

New version that takes a GitLab group id as a parameter and then cleans up all repositories in the group: https://gist.github.com/mikeller/7034d99bc27c361fc6a2df84e19c36ff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment