carceneaux/remove_gitlab_artifacts.sh

philipptempel · 2020-07-21T12:10:47Z

It is not removed artifacts from the 1st page of my pipelines list for some reason. It also missed .status attribute in console log. The rest is as advertised, thanks!

@Atarity Check the source code and you will find the hint #starting with page 2 skipping most recent 100 Jobs thus it is intended that the first page of artifacts are not removed.

voiski · 2020-09-23T19:00:44Z

The response can have a json with breaking lines \n. Consider removing it like ${response//\\n/}

response=${response//\\n/}
length=$(echo $response | jq '. | length')

Also, you can easy simulate next page checking [ $length -ne 0 ] and having the page loop to 1000 or more.

mitar · 2021-01-07T19:50:40Z

I made the following Python script, which works for over 10k jobs, too:

#!/usr/bin/env python3

import time

import requests

project_id = '...'
token = '...'
server = 'gitlab.com'

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=2"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        if job.get('artifacts_file', None):
            job_id = job['id']
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f"Processing job ID: {job_id} - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

tamasgal · 2021-11-05T21:38:55Z

The if job.get('artifacts_file', None): needs to be changed to if job.get('artifacts', None): in the current version of the API, at least I don't see artifacts_file in any of the JSON responses.

mitar · 2021-11-07T17:35:28Z

I see it here: https://docs.gitlab.com/ee/api/jobs.html

tamasgal · 2021-11-07T18:15:01Z

I don't know why but none of the jobs on our server had artifacts_file but artifacts instead where they were listed including their sizes etc.

willstott101 · 2021-12-09T10:06:49Z

"artifacts_file" worked for me, but it's trivial to support both, I also tweaked the output so you can see what job failed if any, and made it start at the first page:

#!/usr/bin/env python3

import time

import requests

project_id = '...'
token = '...'
server = 'gitlab.com'
start_page = 1

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        if job.get('artifacts_file', None) or job.get('artifacts', None):
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

kbaran1998 · 2022-04-19T11:46:29Z

While the script deletes jobs' artifacts, you can also delete project's artifacts by adding this code:

url = f"https://{server}/api/v4/projects/{project_id}/artifacts"
delete_response = requests.delete(
    url,
    headers={
        'private-token': token,
    }
)
print(f" - status: {delete_response.status_code}")

Muffinman · 2023-06-01T11:48:34Z

This does not work if you're project has more than 10000 jobs, due to the removal of X-Total-Pages header from the Gitlab API responses.

cmuller · 2023-07-07T09:45:50Z

Yes, I just found out that the X-Total-Pages header is now missing for performance reasons. Fortunately when a page number is too high, an empty json list ([]) is returned so it is quite easy to use a loop such as (here in bash):

PER_PAGE=100
PAGE=1
while JOBS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" "$GITLAB_INSTANCE/$PROJECT_ID/jobs?per_page=$PER_PAGE&page=$PAGE&sort=asc") && [ "$JOBS" != "[]" ]
do
   for JOB in $(echo $JOBS | jq .[].id)
   do
      [...]
   done
   PAGE=$((PAGE+1))
done

mikeller · 2023-12-07T04:08:08Z

Here's my slightly improved version for the 'do it in python' section (ignores job.log files which seem to be non-deletable, uses command line arguments to load the settings):

#!/usr/bin/env python3

import time
import requests
import sys

server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        artifacts = job.get('artifacts_file', None)
        if not artifacts:
            artifacts = job.get('artifacts', None)

        has_artifacts = False
        for artifact in artifacts:
            if artifact['filename'] != 'job.log':
                has_artifacts = True
                break

        if has_artifacts:
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

Tim-Schwalbe · 2023-12-07T11:08:58Z

Here's my slightly improved version for the 'do it in python' section (ignores job.log files which seem to be non-deletable, uses command line arguments to load the settings):

#!/usr/bin/env python3

import time
import requests
import sys

server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        artifacts = job.get('artifacts_file', None)
        if not artifacts:
            artifacts = job.get('artifacts', None)

        has_artifacts = False
        for artifact in artifacts:
            if artifact['filename'] != 'job.log':
                has_artifacts = True
                break

        if has_artifacts:
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

I get this error:

remove_artifacts.py", line 38, in <module>
    if artifact['filename'] != 'job.log':
       ~~~~~~~~^^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'

mikeller · 2023-12-13T20:33:41Z

@Tim-Schwalbe: Apologies, yes, I overlooked this case. I have amended the script to ignore artifacts_file, as this file seems to be contained in artifacts anyway.

I have improved my version a bit, it now automatically selects expired artifacts for deletion that (in my opinion) should be deleted in the first place, because they belong to jobs that were run on:

merge requests that have been merged or closed;
branches that have been merged.

It will also take a list of project ids as the last argument, making it easy to use in a cron job: Usage: {sys.argv[0]} <server> <token> <project id>...

#!/usr/bin/env python3

import time
import requests
import sys
from datetime import datetime, timezone
from dateutil import parser
import re

if len(sys.argv) < 4:
    print(f'Usage: {sys.argv[0]} <server> <token> <project id>...')

    exit(1)

server = sys.argv[1]
token = sys.argv[2]
project_ids = []
for i in range(3, len(sys.argv)):
    project_ids.append(sys.argv[i])


now = datetime.now(timezone.utc)

overall_space_savings = 0
for project_id in project_ids:
    print(f'Processing project {project_id}:')

    merge_request_url = f"https://{server}/api/v4/projects/{project_id}/merge_requests?scope=all&per_page=100&page=1"
    merge_requests = {}
    while merge_request_url:
        response = requests.get(
            merge_request_url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()

        for merge_request in response_json:
            iid = merge_request.get('iid', None)
            if iid:
                merge_requests[int(iid)] = merge_request['state']

        merge_request_url = response.links.get('next', {}).get('url', None)

    branch_url = f"https://{server}/api/v4/projects/{project_id}/repository/branches?per_page=100&page=1"
    unmerged_branches = []
    while branch_url:
        response = requests.get(
            branch_url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()

        for branch in response_json:
            is_merged = branch['merged']
            if not is_merged:
                unmerged_branches.append(branch['name'])

        branch_url = response.links.get('next', {}).get('url', None)


    url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=1"

    job_count = 0
    artifact_count = 0
    artifact_size = 0
    deleted_artifact_count = 0
    deleted_artifact_size = 0
    while url:
        response = requests.get(
            url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()
        for job in response_json:
            job_count += 1

            artifacts = job.get('artifacts', None)
            artifacts_expire_at_string = job.get('artifacts_expire_at', None)
            artifacts_expire_at = None
            if artifacts_expire_at_string:
                    artifacts_expire_at = parser.parse(artifacts_expire_at_string)

            has_expired_artifacts = False
            deleted_job_artifact_count = 0
            deleted_job_artifact_size = 0
            if artifacts:
                for artifact in artifacts:
                    if artifact['filename'] != 'job.log':
                        size = artifact['size']

                        artifact_count += 1
                        artifact_size += size

                        if not artifacts_expire_at or artifacts_expire_at < now:
                            has_expired_artifacts = True
                            deleted_job_artifact_count += 1
                            deleted_job_artifact_size += size


            delete_artifacts = False
            if has_expired_artifacts:
                ref = job['ref']
                merge_request_iid_match = re.search(r'refs\/merge-requests\/(\d+)\/head', ref)
                if merge_request_iid_match:
                    merge_request_iid = merge_request_iid_match.group(1)
                    if merge_request_iid:
                        merge_request_status = merge_requests.get(int(merge_request_iid))
                        if merge_request_status in ['merged', 'closed', None]:
                            delete_artifacts = True
                            deleted_artifact_count += deleted_job_artifact_count
                            deleted_artifact_size += deleted_job_artifact_size

                elif ref not in unmerged_branches:
                    delete_artifacts = True
                    deleted_artifact_count += deleted_job_artifact_count
                    deleted_artifact_size += deleted_job_artifact_size

            if delete_artifacts:
                job_id = job['id']
                print(f"Processing job ID: {job_id}", end="")
                delete_response = requests.delete(
                    f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                    headers={
                        'private-token': token,
                    },
                )
                print(f" - status: {delete_response.status_code}\033[K", end = "\r")


        print(f'Processed page {url}.\033[K', end = "\r")

        url = response.links.get('next', {}).get('url', None)

    overall_space_savings += deleted_artifact_size

    print()
    print(f'Jobs analysed: {job_count}');
    print(f'Pre artifact count: {artifact_count}');
    print(f'Pre artifact size [MB]: {artifact_size / (1024 * 1024)}')
    print(f'Post artifact count: {artifact_count - deleted_artifact_count}')
    print(f'Post artifact size [MB]: {(artifact_size - deleted_artifact_size) / (1024 * 1024)}')
    print()

print(f'Overall savings [MB]: {overall_space_savings / (1024 * 1024)}')

voiski · 2023-12-13T21:00:32Z

@mikeller I suggest you write your script code in a gist, or even fork this one here and replace it with you python code =)

Each gist indicates which forks have activity, making it easy to find interesting changes from others.

mikeller · 2023-12-13T22:20:42Z

@voiski: Good point, done: https://gist.github.com/mikeller/ee7a668a83e4b9bc61646bddb4a2ade6

mikeller · 2024-02-20T21:04:12Z

New version that takes a GitLab group id as a parameter and then cleans up all repositories in the group: https://gist.github.com/mikeller/7034d99bc27c361fc6a2df84e19c36ff

cnblogs-dudu · 2025-01-12T14:07:00Z

The simplest way is using gitlab cli glab

glab ci delete --older-than 48h --paginate --per-page 200

See Clean up GitLab CI Build artifacts

	#!/bin/bash
	#
	# Written by Chris Arceneaux
	# GitHub: https://github.com/carceneaux
	# Email: [email protected]
	# Website: http://arsano.ninja
	#
	# Note: This code is a stop-gap to erase Job Artifacts for a project. I HIGHLY recommend you leverage
	# "artifacts:expire_in" in your .gitlab-ci.yml
	#
	# https://docs.gitlab.com/ee/ci/yaml/#artifactsexpire_in
	#
	# Software Requirements: curl, jq
	#
	# This code has been released under the terms of the Apache-2.0 license
	# http://opensource.org/licenses/Apache-2.0


	# project_id, find it here: https://gitlab.com/[organization name]/[repository name] at the top underneath repository name
	project_id="207"

	# token, find it here: https://gitlab.com/profile/personal_access_tokens
	token="9hjGYpwmsMfBxT-Ghuu7"
	server="gitlab.com"

	# Retrieving Jobs list page count
	total_pages=$(curl -sD - -o /dev/null -X GET \
	"https://$server/api/v4/projects/$project_id/jobs?per_page=100" \
	-H "PRIVATE-TOKEN: ${token}" \| grep -Fi X-Total-Pages \| sed 's/[^0-9]*//g')

	# Creating list of Job IDs for the Project specified with Artifacts
	job_ids=()
	echo ""
	echo "Creating list of all Jobs that currently have Artifacts..."
	echo "Total Pages: ${total_pages}"
	for ((i=2;i<=${total_pages};i++)) #starting with page 2 skipping most recent 100 Jobs
	do
	echo "Processing Page: ${i}/${total_pages}"
	response=$(curl -s -X GET \
	"https://$server/api/v4/projects/$project_id/jobs?per_page=100&page=${i}" \
	-H "PRIVATE-TOKEN: ${token}")
	length=$(echo $response \| jq '. \| length')
	for ((j=0;j<${length};j++))
	do
	if [[ $(echo $response \| jq ".[${j}].artifacts_file \| length") > 0 ]]; then
	echo "Job found: $(echo $response \| jq ".[${j}].id")"
	job_ids+=($(echo $response \| jq ".[${j}].id"))
	fi
	done
	done

	# Loop through each Job erasing the Artifact(s)
	echo ""
	echo "${#job_ids[@]} Jobs found. Commencing removal of Artifacts..."
	for job_id in ${job_ids[@]};
	do
	response=$(curl -s -X DELETE \
	-H "PRIVATE-TOKEN:${token}" \
	"https://$server/api/v4/projects/$project_id/jobs/$job_id/artifacts")
	echo "Processing Job ID: ${job_id} - Status: $(echo $response \| jq '.status')"
	done

carceneaux/remove_gitlab_artifacts.sh

philipptempel commented Jul 21, 2020 •

edited

Loading

voiski commented Sep 23, 2020 •

edited

Loading

mitar commented Jan 7, 2021 •

edited

Loading

tamasgal commented Nov 5, 2021

mitar commented Nov 7, 2021

tamasgal commented Nov 7, 2021

willstott101 commented Dec 9, 2021 •

edited

Loading

kbaran1998 commented Apr 19, 2022

Muffinman commented Jun 1, 2023

cmuller commented Jul 7, 2023

mikeller commented Dec 7, 2023

Tim-Schwalbe commented Dec 7, 2023

mikeller commented Dec 13, 2023

voiski commented Dec 13, 2023

mikeller commented Dec 13, 2023

mikeller commented Feb 20, 2024

cnblogs-dudu commented Jan 12, 2025

carceneaux/remove_gitlab_artifacts.sh

philipptempel commented Jul 21, 2020 • edited Loading

voiski commented Sep 23, 2020 • edited Loading

mitar commented Jan 7, 2021 • edited Loading

tamasgal commented Nov 5, 2021

mitar commented Nov 7, 2021

tamasgal commented Nov 7, 2021

willstott101 commented Dec 9, 2021 • edited Loading

kbaran1998 commented Apr 19, 2022

Muffinman commented Jun 1, 2023

cmuller commented Jul 7, 2023

mikeller commented Dec 7, 2023

Tim-Schwalbe commented Dec 7, 2023

mikeller commented Dec 13, 2023

voiski commented Dec 13, 2023

mikeller commented Dec 13, 2023

mikeller commented Feb 20, 2024

cnblogs-dudu commented Jan 12, 2025

philipptempel commented Jul 21, 2020 •

edited

Loading

voiski commented Sep 23, 2020 •

edited

Loading

mitar commented Jan 7, 2021 •

edited

Loading

willstott101 commented Dec 9, 2021 •

edited

Loading