Skip to content

Instantly share code, notes, and snippets.

@lynsei
Created March 5, 2025 14:24
Show Gist options
  • Save lynsei/b7353e49825238ef31d7d189970b3e5e to your computer and use it in GitHub Desktop.
Save lynsei/b7353e49825238ef31d7d189970b3e5e to your computer and use it in GitHub Desktop.
[jfrog] Data Usage Info [clickhouse + jfrog]

JFrog Data Overuse Analysis

1. Audit JFrog Usage

1.1 Check JFrog Storage and Transfer Reports

  • Navigate to JFrog Platform UIAdministrationStorage or Transfer sections.
  • Identify which repositories and artifacts consumed the most bandwidth.

1.2 Enable JFrog Logs & Insights

  • If not enabled, activate JFrog Insight to get usage statistics on frequently pulled artifacts.
  • Use JFrog Access Logs to track downloads.

1.3 Analyze API Requests & Automated Jobs

  • Review automated jobs in GitHub Actions or other CI/CD pipelines for excessive data pulls.
  • Check cron jobs or scripts that might be executing unnecessary downloads.

2. Optimize Data Pulling & Retention Policies

2.1 Implement Download Throttling & Caching

  • Use Artifactory Virtual Repositories to minimize external pulls.
  • Set up a proxy cache (Azure Blob Storage, Cloudflare, or Nginx cache) for frequently accessed artifacts.

2.2 Review Artifact Retention Policies

  • Configure JFrog to auto-delete old artifacts that are no longer used.
  • Set policies to retain only the last N versions of a package.

2.3 Enforce Smarter CI/CD Behavior

  • Modify GitHub Actions & CI jobs to avoid unnecessary downloads.
  • Utilize artifact caching (actions/cache, Confluent Cloud Schema Registry, or ClickHouse for metadata tracking).

2.4 Enable Storage Optimization Features

  • If JFrog garbage collection is not running, enable it to remove orphaned artifacts.
  • Compress artifacts before storage (if applicable).

3. Implement Monitoring & Cost Controls

3.1 Enable JFrog Webhooks for Monitoring

  • Set up notifications when data transfer exceeds a defined threshold.
  • Use Dynatrace or Azure Monitor to track excessive API requests.

3.2 Use JFrog API for Real-time Analytics

  • Retrieve storage usage using JFrog's REST API:
    curl -u "user:password" -X GET "https://jfrog-instance/api/storageinfo"
  • Identify the largest repositories and artifacts being accessed.

3.3 Set Up Azure Budget Alerts

  • If billing is through Azure, configure cost alerts when JFrog storage/egress spikes.

4. Generate a Detailed JFrog Report

4.1 Retrieve JFrog Storage and Transfer Metrics

A. Get Storage Information

curl -u "username:password" -X GET "https://jfrog-instance/api/storageinfo"

B. List Large Artifacts

curl -u "username:password" -X GET "https://jfrog-instance/api/search/usage?repos=b2bc-connect-DevOps"

C. Get Download Statistics

curl -u "username:password" -X GET "https://jfrog-instance/api/storage/downloads?limit=100"

4.2 Analyze JFrog Logs for Unnecessary Data Pulls

A. Extract Most Requested Artifacts

awk '{print $7}' request.log | sort | uniq -c | sort -nr | head -20

B. Find Top Users/Pipelines

awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -10

4.3 Convert Data into CSV for Further Analysis

curl -u "username:password" -X GET "https://jfrog-instance/api/storageinfo" | jq -r '. | @csv' > jfrog_storage_report.csv

5. Automate Reports & Alerts

5.1 Schedule Reports Using a Cron Job

echo "curl -u 'username:password' -X GET 'https://jfrog-instance/api/storageinfo' > jfrog_report.json" | crontab -

5.2 Alert If a User Exceeds 10TB in a Day

SELECT username, sum(size_bytes) / 1e12 AS egress_TB
FROM artifactory_logs
WHERE method = 'GET' AND status_code = 200 AND toDate(timestamp) = today()
GROUP BY username
HAVING egress_TB > 10;

6. ClickHouse Queries for Artifactory Egress Reports

6.1 Ensure Artifactory Logs are Ingested into ClickHouse

A. Create a ClickHouse Table for Logs

CREATE TABLE artifactory_logs (
    timestamp DateTime,
    ip_address String,
    method String,
    url_path String,
    status_code UInt16,
    size_bytes UInt64,
    username String
) ENGINE = MergeTree()
ORDER BY timestamp;

6.2 Ingest Logs into ClickHouse

cat request.log | clickhouse-client --query="INSERT INTO artifactory_logs FORMAT TSV"

7. Query Artifactory Egress Usage

7.1 Total Egress Data Over Time

SELECT toDate(timestamp) AS date, sum(size_bytes) / 1e12 AS egress_TB
FROM artifactory_logs
WHERE method = 'GET' AND status_code = 200
GROUP BY date
ORDER BY date DESC;

7.2 Egress Breakdown by User

SELECT username, sum(size_bytes) / 1e12 AS egress_TB
FROM artifactory_logs
WHERE method = 'GET' AND status_code = 200
GROUP BY username
ORDER BY egress_TB DESC
LIMIT 10;

7.3 Egress Breakdown by Repository

SELECT splitByChar('/', url_path)[2] AS repo_name, sum(size_bytes) / 1e12 AS egress_TB
FROM artifactory_logs
WHERE method = 'GET' AND status_code = 200
GROUP BY repo_name
ORDER BY egress_TB DESC
LIMIT 10;

8. Automate Egress Reporting

8.1 Create a Materialized View for Faster Queries

CREATE MATERIALIZED VIEW egress_summary_mv ENGINE = SummingMergeTree()
ORDER BY (timestamp, username)
AS SELECT 
    toDate(timestamp) AS date,
    username,
    ip_address,
    sum(size_bytes) AS total_bytes
FROM artifactory_logs
WHERE method = 'GET' AND status_code = 200
GROUP BY date, username, ip_address;

8.2 Schedule Reports Using a Cron Job

echo "SELECT date, username, sum(total_bytes) / 1e12 AS egress_TB FROM egress_summary_mv GROUP BY date, username ORDER BY date DESC" | clickhouse-client > daily_egress_report.csv

9. Set Up Alerts for High Egress Usage

9.1 Alert If a Repository Causes a Spike

WITH last_7_days AS (
    SELECT toDate(timestamp) AS date, repo_name, sum(size_bytes) / 1e12 AS egress_TB
    FROM artifactory_logs
    WHERE method = 'GET' AND status_code = 200
    GROUP BY date, repo_name
)
SELECT repo_name, avg(egress_TB) AS avg_egress, max(egress_TB) AS max_egress
FROM last_7_days
GROUP BY repo_name
HAVING max_egress > 2 * avg_egress;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment