carceneaux/remove_gitlab_artifacts.sh

kbaran1998 · 2022-04-19T11:46:29Z

While the script deletes jobs' artifacts, you can also delete project's artifacts by adding this code:

url = f"https://{server}/api/v4/projects/{project_id}/artifacts"
delete_response = requests.delete(
    url,
    headers={
        'private-token': token,
    }
)
print(f" - status: {delete_response.status_code}")

Muffinman · 2023-06-01T11:48:34Z

This does not work if you're project has more than 10000 jobs, due to the removal of X-Total-Pages header from the Gitlab API responses.

cmuller · 2023-07-07T09:45:50Z

Yes, I just found out that the X-Total-Pages header is now missing for performance reasons. Fortunately when a page number is too high, an empty json list ([]) is returned so it is quite easy to use a loop such as (here in bash):

PER_PAGE=100
PAGE=1
while JOBS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" "$GITLAB_INSTANCE/$PROJECT_ID/jobs?per_page=$PER_PAGE&page=$PAGE&sort=asc") && [ "$JOBS" != "[]" ]
do
   for JOB in $(echo $JOBS | jq .[].id)
   do
      [...]
   done
   PAGE=$((PAGE+1))
done

mikeller · 2023-12-07T04:08:08Z

Here's my slightly improved version for the 'do it in python' section (ignores job.log files which seem to be non-deletable, uses command line arguments to load the settings):

#!/usr/bin/env python3

import time
import requests
import sys

server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        artifacts = job.get('artifacts_file', None)
        if not artifacts:
            artifacts = job.get('artifacts', None)

        has_artifacts = False
        for artifact in artifacts:
            if artifact['filename'] != 'job.log':
                has_artifacts = True
                break

        if has_artifacts:
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

Tim-Schwalbe · 2023-12-07T11:08:58Z

Here's my slightly improved version for the 'do it in python' section (ignores job.log files which seem to be non-deletable, uses command line arguments to load the settings):

#!/usr/bin/env python3

import time
import requests
import sys

server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        artifacts = job.get('artifacts_file', None)
        if not artifacts:
            artifacts = job.get('artifacts', None)

        has_artifacts = False
        for artifact in artifacts:
            if artifact['filename'] != 'job.log':
                has_artifacts = True
                break

        if has_artifacts:
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

I get this error:

remove_artifacts.py", line 38, in <module>
    if artifact['filename'] != 'job.log':
       ~~~~~~~~^^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'

mikeller · 2023-12-13T20:33:41Z

@Tim-Schwalbe: Apologies, yes, I overlooked this case. I have amended the script to ignore artifacts_file, as this file seems to be contained in artifacts anyway.

I have improved my version a bit, it now automatically selects expired artifacts for deletion that (in my opinion) should be deleted in the first place, because they belong to jobs that were run on:

merge requests that have been merged or closed;
branches that have been merged.

It will also take a list of project ids as the last argument, making it easy to use in a cron job: Usage: {sys.argv[0]} <server> <token> <project id>...

#!/usr/bin/env python3

import time
import requests
import sys
from datetime import datetime, timezone
from dateutil import parser
import re

if len(sys.argv) < 4:
    print(f'Usage: {sys.argv[0]} <server> <token> <project id>...')

    exit(1)

server = sys.argv[1]
token = sys.argv[2]
project_ids = []
for i in range(3, len(sys.argv)):
    project_ids.append(sys.argv[i])


now = datetime.now(timezone.utc)

overall_space_savings = 0
for project_id in project_ids:
    print(f'Processing project {project_id}:')

    merge_request_url = f"https://{server}/api/v4/projects/{project_id}/merge_requests?scope=all&per_page=100&page=1"
    merge_requests = {}
    while merge_request_url:
        response = requests.get(
            merge_request_url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()

        for merge_request in response_json:
            iid = merge_request.get('iid', None)
            if iid:
                merge_requests[int(iid)] = merge_request['state']

        merge_request_url = response.links.get('next', {}).get('url', None)

    branch_url = f"https://{server}/api/v4/projects/{project_id}/repository/branches?per_page=100&page=1"
    unmerged_branches = []
    while branch_url:
        response = requests.get(
            branch_url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()

        for branch in response_json:
            is_merged = branch['merged']
            if not is_merged:
                unmerged_branches.append(branch['name'])

        branch_url = response.links.get('next', {}).get('url', None)


    url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=1"

    job_count = 0
    artifact_count = 0
    artifact_size = 0
    deleted_artifact_count = 0
    deleted_artifact_size = 0
    while url:
        response = requests.get(
            url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()
        for job in response_json:
            job_count += 1

            artifacts = job.get('artifacts', None)
            artifacts_expire_at_string = job.get('artifacts_expire_at', None)
            artifacts_expire_at = None
            if artifacts_expire_at_string:
                    artifacts_expire_at = parser.parse(artifacts_expire_at_string)

            has_expired_artifacts = False
            deleted_job_artifact_count = 0
            deleted_job_artifact_size = 0
            if artifacts:
                for artifact in artifacts:
                    if artifact['filename'] != 'job.log':
                        size = artifact['size']

                        artifact_count += 1
                        artifact_size += size

                        if not artifacts_expire_at or artifacts_expire_at < now:
                            has_expired_artifacts = True
                            deleted_job_artifact_count += 1
                            deleted_job_artifact_size += size


            delete_artifacts = False
            if has_expired_artifacts:
                ref = job['ref']
                merge_request_iid_match = re.search(r'refs\/merge-requests\/(\d+)\/head', ref)
                if merge_request_iid_match:
                    merge_request_iid = merge_request_iid_match.group(1)
                    if merge_request_iid:
                        merge_request_status = merge_requests.get(int(merge_request_iid))
                        if merge_request_status in ['merged', 'closed', None]:
                            delete_artifacts = True
                            deleted_artifact_count += deleted_job_artifact_count
                            deleted_artifact_size += deleted_job_artifact_size

                elif ref not in unmerged_branches:
                    delete_artifacts = True
                    deleted_artifact_count += deleted_job_artifact_count
                    deleted_artifact_size += deleted_job_artifact_size

            if delete_artifacts:
                job_id = job['id']
                print(f"Processing job ID: {job_id}", end="")
                delete_response = requests.delete(
                    f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                    headers={
                        'private-token': token,
                    },
                )
                print(f" - status: {delete_response.status_code}\033[K", end = "\r")


        print(f'Processed page {url}.\033[K', end = "\r")

        url = response.links.get('next', {}).get('url', None)

    overall_space_savings += deleted_artifact_size

    print()
    print(f'Jobs analysed: {job_count}');
    print(f'Pre artifact count: {artifact_count}');
    print(f'Pre artifact size [MB]: {artifact_size / (1024 * 1024)}')
    print(f'Post artifact count: {artifact_count - deleted_artifact_count}')
    print(f'Post artifact size [MB]: {(artifact_size - deleted_artifact_size) / (1024 * 1024)}')
    print()

print(f'Overall savings [MB]: {overall_space_savings / (1024 * 1024)}')

voiski · 2023-12-13T21:00:32Z

@mikeller I suggest you write your script code in a gist, or even fork this one here and replace it with you python code =)

Each gist indicates which forks have activity, making it easy to find interesting changes from others.

mikeller · 2023-12-13T22:20:42Z

@voiski: Good point, done: https://gist.github.com/mikeller/ee7a668a83e4b9bc61646bddb4a2ade6

mikeller · 2024-02-20T21:04:12Z

New version that takes a GitLab group id as a parameter and then cleans up all repositories in the group: https://gist.github.com/mikeller/7034d99bc27c361fc6a2df84e19c36ff

cnblogs-dudu · 2025-01-12T14:07:00Z

The simplest way is using gitlab cli glab

glab ci delete --older-than 48h --paginate --per-page 200

See Clean up GitLab CI Build artifacts

	#!/bin/bash
	#
	# Written by Chris Arceneaux
	# GitHub: https://github.com/carceneaux
	# Email: [email protected]
	# Website: http://arsano.ninja
	#
	# Note: This code is a stop-gap to erase Job Artifacts for a project. I HIGHLY recommend you leverage
	# "artifacts:expire_in" in your .gitlab-ci.yml
	#
	# https://docs.gitlab.com/ee/ci/yaml/#artifactsexpire_in
	#
	# Software Requirements: curl, jq
	#
	# This code has been released under the terms of the Apache-2.0 license
	# http://opensource.org/licenses/Apache-2.0


	# project_id, find it here: https://gitlab.com/[organization name]/[repository name] at the top underneath repository name
	project_id="207"

	# token, find it here: https://gitlab.com/profile/personal_access_tokens
	token="9hjGYpwmsMfBxT-Ghuu7"
	server="gitlab.com"

	# Retrieving Jobs list page count
	total_pages=$(curl -sD - -o /dev/null -X GET \
	"https://$server/api/v4/projects/$project_id/jobs?per_page=100" \
	-H "PRIVATE-TOKEN: ${token}" \| grep -Fi X-Total-Pages \| sed 's/[^0-9]*//g')

	# Creating list of Job IDs for the Project specified with Artifacts
	job_ids=()
	echo ""
	echo "Creating list of all Jobs that currently have Artifacts..."
	echo "Total Pages: ${total_pages}"
	for ((i=2;i<=${total_pages};i++)) #starting with page 2 skipping most recent 100 Jobs
	do
	echo "Processing Page: ${i}/${total_pages}"
	response=$(curl -s -X GET \
	"https://$server/api/v4/projects/$project_id/jobs?per_page=100&page=${i}" \
	-H "PRIVATE-TOKEN: ${token}")
	length=$(echo $response \| jq '. \| length')
	for ((j=0;j<${length};j++))
	do
	if [[ $(echo $response \| jq ".[${j}].artifacts_file \| length") > 0 ]]; then
	echo "Job found: $(echo $response \| jq ".[${j}].id")"
	job_ids+=($(echo $response \| jq ".[${j}].id"))
	fi
	done
	done

	# Loop through each Job erasing the Artifact(s)
	echo ""
	echo "${#job_ids[@]} Jobs found. Commencing removal of Artifacts..."
	for job_id in ${job_ids[@]};
	do
	response=$(curl -s -X DELETE \
	-H "PRIVATE-TOKEN:${token}" \
	"https://$server/api/v4/projects/$project_id/jobs/$job_id/artifacts")
	echo "Processing Job ID: ${job_id} - Status: $(echo $response \| jq '.status')"
	done

carceneaux/remove_gitlab_artifacts.sh

kbaran1998 commented Apr 19, 2022

Uh oh!

Muffinman commented Jun 1, 2023

Uh oh!

cmuller commented Jul 7, 2023

Uh oh!

mikeller commented Dec 7, 2023

Uh oh!

Tim-Schwalbe commented Dec 7, 2023

Uh oh!

mikeller commented Dec 13, 2023

Uh oh!

voiski commented Dec 13, 2023

Uh oh!

mikeller commented Dec 13, 2023

Uh oh!

mikeller commented Feb 20, 2024

Uh oh!

cnblogs-dudu commented Jan 12, 2025

Uh oh!