LRU Cache Operational Runbook
Version: 1.0 Date: 2026-04-07 Status: Production Ready
Table of Contents
Quick Reference
Configuration
Environment Variable:
export EKA_CI_GRAPH_LRU_CAPACITY=100000
Config File (~/.config/ekaci/ekaci.toml):
graph_lru_capacity = 100000
Default: 100,000 nodes
Key Metrics
| Metric | Description | Healthy Range |
|---|---|---|
eka_ci_graph_cache_utilization | Cache fullness (0.0-1.0) | 0.5 - 0.8 |
eka_ci_graph_cache_reloads_total | Cache misses (counter) | < 100/day |
eka_ci_graph_pinned_nodes_total | Protected nodes | 50 - 500 |
eka_ci_graph_nodes_total | Total nodes | < capacity |
Log Messages
Normal Operation:
INFO Cache status: 45000/100000 nodes (45.0% utilized), 123 pinned
Warning (80% utilization):
WARN Cache utilization elevated (82.3%): Monitor for potential capacity issues
Critical (90% utilization):
WARN Cache utilization HIGH (93.1%): Consider increasing EKA_CI_GRAPH_LRU_CAPACITY (current: 100000)
Monitoring
Grafana Dashboard
Panel 1: Cache Utilization (Gauge)
eka_ci_graph_cache_utilization * 100
- Unit: Percent
- Thresholds:
- Green: < 70%
- Yellow: 70-85%
- Red: > 85%
Panel 2: Cache Size (Graph)
sum(eka_ci_graph_nodes_total)
- Unit: Nodes
- Show: Current, Max capacity
Panel 3: Cache Reload Rate (Graph)
rate(eka_ci_graph_cache_reloads_total[5m]) * 60
- Unit: Reloads/min
- Alert: > 10/min for 15 minutes
Panel 4: Reload Latency (Graph)
histogram_quantile(0.50, rate(eka_ci_graph_cache_reload_duration_seconds_bucket[5m]))
histogram_quantile(0.90, rate(eka_ci_graph_cache_reload_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(eka_ci_graph_cache_reload_duration_seconds_bucket[5m]))
- Unit: Seconds
- Labels: p50, p90, p99
Panel 5: Pinned Nodes (Stat)
eka_ci_graph_pinned_nodes_total
- Unit: Nodes
- Description: Active builds
Panel 6: Eviction Candidates by Tier (Stacked Graph)
eka_ci_graph_eviction_candidates_total{tier="tier1_transitive_failure"}
eka_ci_graph_eviction_candidates_total{tier="tier2_completed_failure"}
eka_ci_graph_eviction_candidates_total{tier="tier3_completed_success"}
Key Performance Indicators (KPIs)
Healthy System:
- Utilization: 50-70%
- Reload rate: < 5/min
- Reload latency (p99): < 50ms
- Pinned nodes: 50-200
Concerning:
- Utilization: > 80%
- Reload rate: > 10/min
- Reload latency (p99): > 100ms
- Pinned nodes: > 1000
Critical:
- Utilization: > 90%
- Reload rate: > 50/min
- Reload latency (p99): > 500ms
- Cache thrashing
Capacity Tuning
Determining Optimal Capacity
Formula:
Optimal Capacity = (Peak Node Count × 1.5) + Buffer
Example:
- Peak node count: 60,000
- Optimal capacity: 60,000 × 1.5 = 90,000
- Add buffer: 90,000 + 10,000 = 100,000
Capacity Sizing Guide
| Workload | Node Count | Recommended Capacity | Memory Usage |
|---|---|---|---|
| Small | < 10k | 20,000 | ~22 MB |
| Medium | 10k - 50k | 75,000 | ~83 MB |
| Large | 50k - 100k | 150,000 | ~165 MB |
| Very Large | 100k - 200k | 300,000 | ~330 MB |
Increasing Capacity
When to increase:
- Utilization consistently > 80%
- Reload rate > 10/min
- Warnings in logs every 5 minutes
How to increase:
-
Calculate new capacity:
New Capacity = Current Capacity × 1.5 -
Set environment variable:
export EKA_CI_GRAPH_LRU_CAPACITY=150000 -
Restart service:
systemctl restart eka-ci -
Monitor for 1 hour:
eka_ci_graph_cache_utilization -
Verify:
- Utilization < 70%
- Reload rate < 5/min
- No warnings
Decreasing Capacity
When to decrease:
- Utilization consistently < 30%
- Memory usage high (> 200 MB)
- Zero cache reloads for 24+ hours
How to decrease:
-
Calculate new capacity:
New Capacity = Peak Node Count × 1.3 -
Set environment variable:
export EKA_CI_GRAPH_LRU_CAPACITY=75000 -
Restart service:
systemctl restart eka-ci -
Monitor closely for 24 hours:
- Watch reload rate (should stay < 10/min)
- Monitor utilization (should be 50-70%)
Troubleshooting
Problem 1: High Utilization (> 90%)
Symptoms:
- Log warnings every 5 minutes
- Potential cache thrashing
- Slow build dispatch
Diagnosis:
# Check utilization
eka_ci_graph_cache_utilization
# Check growth rate
rate(sum(eka_ci_graph_nodes_total)[1h])
Solution:
-
Immediate: Increase capacity by 50%
export EKA_CI_GRAPH_LRU_CAPACITY=150000 systemctl restart eka-ci -
Long-term: Calculate proper capacity based on workload
Prevention:
- Set alert for 85% utilization
- Review capacity quarterly
Problem 2: High Reload Rate (> 10/min)
Symptoms:
- Frequent cache misses
- Elevated database load
- Slow API responses
Diagnosis:
# Reload rate
rate(eka_ci_graph_cache_reloads_total[5m]) * 60
# Which nodes are being reloaded?
# Check logs for "Cache miss: reloading"
Possible Causes:
Cause 1: Capacity Too Small
- Utilization > 85%
- Solution: Increase capacity
Cause 2: Workload Pattern Changed
- Many terminal nodes evicted, then accessed again
- Solution: Increase tier age thresholds
Cause 3: Hot Path Not Protected
is_buildable()nodes being evicted- Solution: Ensure
touch_buildable_check()is called
Problem 3: High Memory Usage
Symptoms:
- Process memory > 500 MB
- OOM risk
- Swap usage
Diagnosis:
# Memory estimate
eka_ci_graph_memory_bytes_estimate
# Utilization
eka_ci_graph_cache_utilization
Solutions:
If utilization < 50%:
- Cause: Capacity too large
- Fix: Decrease capacity to match peak workload
If utilization > 80%:
- Cause: Legitimate high usage
- Fix: Add more RAM or optimize elsewhere
Problem 4: Zero Reloads Despite Low Utilization
Symptoms:
- Utilization < 30%
- Zero cache reloads for days
- High memory usage
Diagnosis:
# Reload count
eka_ci_graph_cache_reloads_total
# Utilization
eka_ci_graph_cache_utilization
Cause: Capacity oversized
Solution:
- Decrease capacity to improve efficiency
- Free up memory for other services
Problem 5: Slow Reload Latency (p99 > 100ms)
Symptoms:
- High reload latency
- Slow API responses
- Database contention
Diagnosis:
# Reload latency
histogram_quantile(0.99, rate(eka_ci_graph_cache_reload_duration_seconds_bucket[5m]))
# Reload rate
rate(eka_ci_graph_cache_reloads_total[5m])
Possible Causes:
Cause 1: High Reload Rate
- Too many concurrent reloads
- Database overwhelmed
- Solution: Increase capacity to reduce reload frequency
Cause 2: Database Slow
- Check database metrics
- Optimize queries
- Add indexes if needed
Cause 3: Large Nodes
- Nodes with many dependencies
- Solution: Optimize edge loading (future work)
Alerts
Prometheus Alert Rules
groups:
- name: lru_cache_alerts
rules:
# Critical: High utilization
- alert: LRUCacheUtilizationHigh
expr: eka_ci_graph_cache_utilization > 0.90
for: 15m
labels:
severity: warning
annotations:
summary: "LRU cache utilization is high ({{ $value | humanizePercentage }})"
description: "Cache is {{ $value | humanizePercentage }} full. Consider increasing capacity."
# Warning: Elevated utilization
- alert: LRUCacheUtilizationElevated
expr: eka_ci_graph_cache_utilization > 0.80
for: 1h
labels:
severity: info
annotations:
summary: "LRU cache utilization is elevated ({{ $value | humanizePercentage }})"
description: "Cache is {{ $value | humanizePercentage }} full. Monitor for growth."
# Critical: High reload rate
- alert: LRUCacheReloadRateHigh
expr: rate(eka_ci_graph_cache_reloads_total[5m]) * 60 > 10
for: 15m
labels:
severity: warning
annotations:
summary: "Cache reload rate is high ({{ $value }} reloads/min)"
description: "Frequent cache misses detected. Capacity may be too small."
# Warning: Slow reloads
- alert: LRUCacheReloadSlow
expr: histogram_quantile(0.99, rate(eka_ci_graph_cache_reload_duration_seconds_bucket[5m])) > 0.1
for: 15m
labels:
severity: info
annotations:
summary: "Cache reloads are slow (p99: {{ $value }}s)"
description: "Database may be under load or capacity is causing thrashing."
# Info: Many pinned nodes
- alert: LRUCacheManyPinnedNodes
expr: eka_ci_graph_pinned_nodes_total > 1000
for: 30m
labels:
severity: info
annotations:
summary: "Many nodes pinned ({{ $value }})"
description: "High number of active builds. This is normal during large builds."
Performance Optimization
Best Practices
-
Set Capacity to 1.5× Peak Usage
- Provides headroom for growth
- Minimizes reload rate
- Optimal utilization: 60-70%
-
Call
touch_buildable_check()Afteris_buildable()- Protects hot path nodes
- Prevents thrashing on active builds
#![allow(unused)] fn main() { if graph_handle.is_buildable(&drv_id) { graph_handle.touch_buildable_check(&drv_id); // ... dispatch build ... } } -
Monitor Utilization Trends
- Review every quarter
- Adjust capacity as workload changes
- Plan for growth
-
Avoid Frequent Restarts
- LRU cache is warmed up over time
- Restarts cause cold cache (100% reload rate initially)
- Allow 1 hour for warmup
Capacity Planning
Formula for Growth:
Future Capacity = Current Peak × Growth Factor × Headroom
Where:
- Growth Factor = Expected growth (1.2 = 20% growth)
- Headroom = Safety margin (1.5 = 50% headroom)
Example:
- Current peak: 50,000 nodes
- Expected 20% growth: 50,000 × 1.2 = 60,000
- With 50% headroom: 60,000 × 1.5 = 90,000
Common Scenarios
Scenario 1: Large Build (200k drvs)
Expected Behavior:
- Utilization rises to 80-90%
- Pinned nodes: 500-2000 (active builds)
- Reload rate: 5-10/min (terminal nodes evicted)
- Warnings logged (normal)
Action: Monitor, no action needed unless reload rate > 20/min
Scenario 2: Idle System
Expected Behavior:
- Utilization: 5-10% (only completed builds)
- Pinned nodes: 0-5
- Reload rate: 0/min
- No warnings
Action: Consider decreasing capacity to save memory
Scenario 3: Continuous Integration
Expected Behavior:
- Utilization: 40-60% (steady state)
- Pinned nodes: 50-200 (concurrent builds)
- Reload rate: < 5/min
- No warnings
Action: Optimal state, no action needed
Maintenance
Quarterly Review
-
Check peak utilization (last 90 days):
max_over_time(eka_ci_graph_cache_utilization[90d]) -
Check reload rate:
avg_over_time(rate(eka_ci_graph_cache_reloads_total[1h])[90d:1h]) * 60 -
Adjust capacity if needed:
- If peak > 80%: Increase by 50%
- If peak < 40%: Decrease by 25%
Version Upgrades
Before upgrade:
- Note current capacity setting
- Export metrics for comparison
After upgrade:
- Verify capacity setting persists
- Compare metrics (should be similar)
- Monitor for 24 hours
Emergency Procedures
Cache Thrashing (Reload Rate > 50/min)
Immediate Action:
-
Double capacity:
export EKA_CI_GRAPH_LRU_CAPACITY=200000 systemctl restart eka-ci -
Monitor for 15 minutes
-
If still thrashing, double again
Follow-up:
- Investigate root cause
- Review workload patterns
- Consider permanent capacity increase
Out of Memory
Immediate Action:
-
Restart service (clears cache):
systemctl restart eka-ci -
Reduce capacity by 50%:
export EKA_CI_GRAPH_LRU_CAPACITY=50000 systemctl start eka-ci -
Monitor memory usage
Follow-up:
- Identify memory leak (if any)
- Right-size capacity for available RAM
- Consider adding more RAM
Support
Logs to Collect
# Cache status logs (last hour)
journalctl -u eka-ci --since "1 hour ago" | grep "Cache status"
# Warnings (last 24 hours)
journalctl -u eka-ci --since "1 day ago" | grep -E "WARN|ERROR"
# Cache misses (last hour)
journalctl -u eka-ci --since "1 hour ago" | grep "Cache miss"
Metrics to Export
# Current state
curl http://localhost:8080/metrics | grep eka_ci_graph
# Or via Prometheus query
eka_ci_graph_cache_utilization
eka_ci_graph_cache_reloads_total
eka_ci_graph_nodes_total
Summary
Key Takeaways:
- Monitor utilization - Keep between 50-80%
- Watch reload rate - Should be < 5/min normally
- Tune capacity - 1.5× peak usage is optimal
- Set alerts - For 85% utilization and high reload rate
- Review quarterly - Adjust as workload changes
Healthy System Checklist:
- ✅ Utilization: 50-70%
- ✅ Reload rate: < 5/min
- ✅ No warnings in logs
- ✅ Pinned nodes: 50-500
- ✅ Reload latency (p99): < 50ms