Skip to content

Instantly share code, notes, and snippets.

@jonaskahn
Last active September 6, 2025 02:18
Show Gist options
  • Save jonaskahn/582a8ee6be90b9d626f5f336592735ae to your computer and use it in GitHub Desktop.
Save jonaskahn/582a8ee6be90b9d626f5f336592735ae to your computer and use it in GitHub Desktop.

Realistic GPU Requirements: 1K, 5K, 10K Active Users

Step 1: H100 Concurrent Request Capacity

Single H100 Physical Limits:

  • Memory constraint: 80GB VRAM total
  • Model footprint: ~62GB (MXFP4 quantized GPT-OSS-120B)
  • System overhead: ~3GB
  • Available for KV cache: ~15GB

KV Cache Requirements (with GQA optimization):

  • 4K context: ~1.2GB per request → 12 concurrent requests
  • 8K context: ~2.4GB per request → 6 concurrent requests
  • 16K context: ~4.8GB per request → 3 concurrent requests
  • 32K context: ~9.6GB per request → 1-2 concurrent requests

Step 2: Enterprise User Behavior Analysis

Realistic Interaction Cycle:

Query → Generation → Reading → Thinking → Next Query
  ↓         ↓           ↓          ↓         ↓
Send     5-12 sec    15-45 sec  60-180 sec  Repeat

Detailed Timing Breakdown:

  • Request processing: 5-12 seconds (GPU utilization time)
  • User reading: 15-45 seconds (response consumption)
  • User thinking: 60-180 seconds (formulating next query)
  • Total cycle time: 80-237 seconds (average: ~150 seconds)

GPU Utilization Rate:

  • Active GPU time: 8 seconds average
  • Total cycle time: 150 seconds average
  • Utilization per user: 8/150 = 5.3% per user

Step 3: Peak Usage Pattern Analysis

Enterprise Usage Distribution (Starting from Active Users):

  • Active user base: 100% (baseline)
  • Peak 4-hour window: 45-60% of active users online
  • Peak simultaneous usage: 20-30% of peak window users
  • Concurrent query rate: 12-18% of simultaneous users

Simultaneous Peak Calculations:

Active Users Peak Window Online Peak Simultaneous Concurrent Queries
1,000 450-600 90-180 11-32 queries
5,000 2,250-3,000 450-900 54-162 queries
10,000 4,500-6,000 900-1,800 108-324 queries

Step 4: Base GPU Requirements

Standard Context (8K) Scenario:

Active Users Concurrent Queries H100 Capacity Base GPUs Needed
1,000 11-32 6 requests 2-6 GPUs
5,000 54-162 6 requests 9-27 GPUs
10,000 108-324 6 requests 18-54 GPUs

Extended Context (32K) Scenario:

Active Users Concurrent Queries H100 Capacity Base GPUs Needed
1,000 11-32 1.5 requests 7-21 GPUs
5,000 54-162 1.5 requests 36-108 GPUs
10,000 108-324 1.5 requests 72-216 GPUs

Step 5: Enterprise Reliability Multipliers

Production Environment Requirements:

  • Failover redundancy: 1.5x (50% backup capacity)
  • Geographic distribution: 1.3x (multi-region latency optimization)
  • Traffic burst handling: 1.4x (unexpected load spikes)
  • Maintenance windows: 1.1x (rolling updates)
  • Performance buffer: 1.2x (load balancing inefficiencies)

Total Enterprise Multiplier: 1.5 × 1.3 × 1.4 × 1.1 × 1.2 = 3.4x

Step 6: Production Deployment Requirements

Standard Enterprise Deployment (8K Context)

Active Users Base GPUs Production GPUs Monthly Cost*
1,000 2-6 7-20 $38K-$108K
5,000 9-27 31-92 $167K-$497K
10,000 18-54 61-184 $330K-$993K

Knowledge-Intensive Deployment (32K Context)

Active Users Base GPUs Production GPUs Monthly Cost*
1,000 7-21 24-71 $130K-$383K
5,000 36-108 122-367 $659K-$1.98M
10,000 72-216 245-735 $1.32M-$3.97M

*Based on $5.40/hour per H100 (enterprise cloud pricing)

Step 7: Workload-Specific Scenarios

Light Usage Pattern (Customer Support)

  • Query frequency: 3-5 queries/user/day
  • Concurrent rate: 8-12% of active users
  • Context needs: 4K-8K tokens

Medium Usage Pattern (Knowledge Work)

  • Query frequency: 8-15 queries/user/day
  • Concurrent rate: 15-20% of active users
  • Context needs: 8K-16K tokens

Heavy Usage Pattern (Development/Research)

  • Query frequency: 20-40 queries/user/day
  • Concurrent rate: 25-35% of active users
  • Context needs: 16K-32K tokens

Step 8: Realistic Deployment Scenarios

Conservative Enterprise (1K Active Users)

Use Case: Internal productivity tool, light-medium usage

  • Context: 8K tokens
  • Usage pattern: Medium (15% concurrent rate)
  • Deployment: 7-20 H100 GPUs
  • Annual Cost: $456K-$1.30M

Growth Enterprise (5K Active Users)

Use Case: Customer-facing application, medium usage

  • Context: 8K-16K tokens
  • Usage pattern: Medium-Heavy (20% concurrent rate)
  • Deployment: 31-180 H100 GPUs
  • Architecture: Multi-region deployment
  • Annual Cost: $2.00M-$11.6M

Large Enterprise (10K Active Users)

Use Case: Mission-critical platform, heavy usage

  • Context: 16K-32K tokens
  • Usage pattern: Heavy (25-30% concurrent rate)
  • Deployment: 184-735 H100 GPUs
  • Architecture: Global deployment, maximum reliability
  • Annual Cost: $11.9M-$47.6M

Step 9: Optimization Impact Analysis

Memory Optimization Stack Effects:

  • Grouped Query Attention (GQA): Reduces GPU requirements by 60-75%
  • PagedAttention: Increases effective capacity by 40-60%
  • Dynamic batching: Improves throughput by 150-250%
  • INT8 quantization: Additional 30-40% memory savings

Optimized Requirements (with full optimization stack):

Active Users Context Optimized GPUs Cost Reduction
1,000 8K 3-8 GPUs 70-80%
5,000 8K 12-37 GPUs 65-75%
10,000 8K 24-74 GPUs 60-70%
10,000 32K 98-294 GPUs 60-70%

Final Recommendation Matrix

Deployment Type 1K Active 5K Active 10K Active
Minimal Viable (4K-8K) 3-8 GPUs 12-37 GPUs 24-74 GPUs
Standard Enterprise (8K-16K) 7-35 GPUs 31-180 GPUs 61-350 GPUs
Premium/Global (16K-32K) 24-71 GPUs 122-550 GPUs 245-735 GPUs


Realistic Infrastructure Planning: 1K, 5K, 10K Peak Concurrent Users

Step 1: Peak Concurrent User Infrastructure Requirements

Target Peak Concurrent Loads:

  • 1,000 peak concurrent users
  • 5,000 peak concurrent users
  • 10,000 peak concurrent users

H100 Capacity per Context Length:

  • 4K context: 12 concurrent requests per H100
  • 8K context: 6 concurrent requests per H100
  • 16K context: 3 concurrent requests per H100
  • 32K context: 1.5 concurrent requests per H100

Step 2: Base GPU Requirements for Peak Load

Infrastructure Needed for Peak Concurrent Users:

Peak Concurrent 8K Context 16K Context 32K Context
1,000 users 167 H100s 334 H100s 667 H100s
5,000 users 834 H100s 1,667 H100s 3,334 H100s
10,000 users 1,667 H100s 3,334 H100s 6,667 H100s

Step 3: Enterprise Production Multipliers

Production Environment Requirements:

  • Failover redundancy: 1.5x (geographic backup)
  • Traffic burst handling: 1.4x (demand spikes beyond peak)
  • Maintenance windows: 1.2x (rolling updates)
  • Performance buffer: 1.3x (load balancing overhead)

Total Enterprise Multiplier: 1.5 × 1.4 × 1.2 × 1.3 = 3.3x

Step 4: Production Deployment Infrastructure

Enterprise-Ready Infrastructure Requirements:

Peak Concurrent 8K Context 16K Context 32K Context
1,000 users 551 H100s 1,102 H100s 2,201 H100s
5,000 users 2,752 H100s 5,501 H100s 11,002 H100s
10,000 users 5,501 H100s 11,002 H100s 22,001 H100s

Step 5: Reverse Calculation - Total Active User Capacity

Enterprise Usage Patterns (Reverse Analysis):

  • Peak concurrent rate: 12-18% of simultaneous users
  • Peak simultaneous rate: 20-30% of peak window users
  • Peak window rate: 45-60% of active users online

Calculation Formula:

Active Users = Peak Concurrent ÷ (0.18 × 0.30 × 0.60)
Active Users = Peak Concurrent ÷ 0.032
Active Users = Peak Concurrent × 31.25

Conservative Calculation (Lower Multiplier):

Active Users = Peak Concurrent ÷ (0.12 × 0.20 × 0.45)
Active Users = Peak Concurrent ÷ 0.011
Active Users = Peak Concurrent × 91

Step 6: Active User Capacity Analysis

Total Active Users Supported by Infrastructure:

Peak Concurrent Conservative Estimate Aggressive Estimate Realistic Range
1,000 users 91,000 active users 31,250 active users 30K-90K active
5,000 users 455,000 active users 156,250 active users 150K-450K active
10,000 users 910,000 active users 312,500 active users 300K-900K active

Step 7: Infrastructure Investment Scenarios

Scenario A: Support 1K Peak Concurrent Users

Standard Deployment (8K Context):

  • Production GPUs: 551 H100s
  • Monthly Cost: $2.98M
  • Active User Capacity: 30,000-90,000 users
  • Cost per Active User: $33-$99/month

Knowledge-Intensive (32K Context):

  • Production GPUs: 2,201 H100s
  • Monthly Cost: $11.9M
  • Active User Capacity: 30,000-90,000 users
  • Cost per Active User: $132-$396/month

Scenario B: Support 5K Peak Concurrent Users

Standard Deployment (8K Context):

  • Production GPUs: 2,752 H100s
  • Monthly Cost: $14.9M
  • Active User Capacity: 150,000-450,000 users
  • Cost per Active User: $33-$99/month

Knowledge-Intensive (32K Context):

  • Production GPUs: 11,002 H100s
  • Monthly Cost: $59.4M
  • Active User Capacity: 150,000-450,000 users
  • Cost per Active User: $132-$396/month

Scenario C: Support 10K Peak Concurrent Users

Standard Deployment (8K Context):

  • Production GPUs: 5,501 H100s
  • Monthly Cost: $29.7M
  • Active User Capacity: 300,000-900,000 users
  • Cost per Active User: $33-$99/month

Knowledge-Intensive (32K Context):

  • Production GPUs: 22,001 H100s
  • Monthly Cost: $119M
  • Active User Capacity: 300,000-900,000 users
  • Cost per Active User: $132-$396/month

Step 8: Business Model Implications

Revenue Requirements for Viability:

Peak Concurrent Context Monthly Cost Break-even per User* Users Needed**
1K 8K $2.98M $99 30,000+
1K 32K $11.9M $396 30,000+
5K 8K $14.9M $99 150,000+
5K 32K $59.4M $396 150,000+
10K 8K $29.7M $99 300,000+
10K 32K $119M $396 300,000+

*Conservative pricing for break-even **Minimum active users for economic viability

Step 9: Optimization Impact on Capacity

Memory Optimization Stack Benefits:

  • Grouped Query Attention (GQA): 4x capacity increase
  • PagedAttention: 60% efficiency improvement
  • Dynamic batching: 2.5x throughput improvement
  • Combined optimization: 6-8x capacity increase

Optimized Active User Capacity:

Peak Concurrent Standard Capacity Optimized Capacity Improvement
1K 30K-90K users 180K-720K users 6-8x
5K 150K-450K users 900K-3.6M users 6-8x
10K 300K-900K users 1.8M-7.2M users 6-8x

Step 10: Practical Deployment Recommendations

Conservative Enterprise Approach

Target: 1K peak concurrent, 8K context

  • Infrastructure: 551 H100s ($2.98M/month)
  • User capacity: 30,000-90,000 active users
  • Business model: $50-100/user/month subscription

Growth-Oriented Approach

Target: 5K peak concurrent, 8K context

  • Infrastructure: 2,752 H100s ($14.9M/month)
  • User capacity: 150,000-450,000 active users
  • Business model: $35-99/user/month subscription

Scale Platform Approach

Target: 10K peak concurrent, optimized deployment

  • Infrastructure: 1,375 H100s with optimization ($7.4M/month)
  • User capacity: 1.8M-7.2M active users
  • Business model: $5-25/user/month freemium model

Final Infrastructure Planning Matrix

Target Peak Recommended Context Production GPUs Monthly Cost Active User Capacity
1K Concurrent 8K-16K 550-1,100 $3.0M-$5.9M 30K-90K active
5K Concurrent 8K-16K 2,750-5,500 $14.9M-$29.7M 150K-450K active
10K Concurrent 8K-16K 5,500-11,000 $29.7M-$59.4M 300K-900K active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment