Intelligent Caching System¶
Overview¶
RustyBT implements a two-tier intelligent caching system that dramatically accelerates backtests by storing fetched data locally. The system automatically manages cache entries, tracks backtest linkages, and provides detailed statistics.
Architecture¶
Two-Tier Cache Design¶
- Hot Cache (In-Memory)
- Storage: LRU cache with Polars DataFrames in memory
- Size Limit: Configurable (default: 1GB)
- Performance: <0.01s access time
-
Eviction: Least Recently Used (LRU)
-
Cold Cache (Disk)
- Storage: Parquet files with Snappy compression
- Size Limit: Configurable (default: 10GB)
- Performance: <1s access time
- Eviction: LRU, size-based, or hybrid
Cache Flow¶
Request Data
↓
Check Hot Cache (in-memory)
↓ miss
Check Cold Cache (disk)
↓ miss
Fetch from Data Adapter
↓
Store in Hot + Cold Cache
↓
Return Data
Usage¶
Basic Usage¶
from rustybt.data.polars.cache_manager import CacheManager
# Initialize cache manager
cache = CacheManager(
db_path="data/bundles/quandl/metadata.db",
cache_directory="data/bundles/quandl/cache",
hot_cache_size_mb=1024, # 1GB hot cache
cold_cache_size_mb=10240, # 10GB cold cache
eviction_policy="lru"
)
# Generate cache key
cache_key = cache.generate_cache_key(
symbols=["AAPL", "MSFT"],
start_date="2023-01-01",
end_date="2023-12-31",
resolution="1d",
data_source="yfinance"
)
# Check cache (returns None if miss)
df = cache.get_cached_data(cache_key)
if df is None:
# Fetch from data source
df = fetch_data_from_yfinance(...)
# Store in cache with backtest linkage
cache.put_cached_data(
cache_key,
df,
dataset_id=1,
backtest_id="backtest-001"
)
Cache Statistics¶
# Get cache statistics
stats = cache.get_cache_statistics()
print(f"Hit Rate: {stats['hit_rate']:.2%}")
print(f"Total Size: {stats['total_size_mb']:.2f} MB")
print(f"Entry Count: {stats['entry_count']}")
print(f"Average Access Count: {stats['avg_access_count']:.2f}")
# Session statistics
print(f"Session Hits: {stats['session_stats']['hit_count']}")
print(f"Session Misses: {stats['session_stats']['miss_count']}")
print(f"Hot Cache Hits: {stats['session_stats']['hot_hits']}")
print(f"Cold Cache Hits: {stats['session_stats']['cold_hits']}")
# Query statistics for date range
stats = cache.get_cache_statistics(
start_date="2023-01-01",
end_date="2023-12-31"
)
Cache Management¶
# Clear specific cache entry
cache.clear_cache(cache_key="abc123def456")
# Clear all entries for a backtest
cache.clear_cache(backtest_id="backtest-001")
# Clear entire cache
cache.clear_cache()
# Record daily statistics
cache.record_daily_statistics()
Configuration¶
Cache Manager Options¶
| Parameter | Type | Default | Description |
|---|---|---|---|
db_path |
str | Required | Path to SQLite metadata database |
cache_directory |
str | Required | Directory for cached Parquet files |
hot_cache_size_mb |
int | 1024 | Hot cache size in MB (1GB) |
cold_cache_size_mb |
int | 10240 | Cold cache size in MB (10GB) |
eviction_policy |
str | "lru" | Eviction policy: "lru", "size", "hybrid" |
Eviction Policies¶
- LRU (Least Recently Used)
- Evicts entries with oldest
last_accessedtimestamp -
Best for: Temporal access patterns (recent data accessed frequently)
-
Size-Based
- Evicts largest entries first
-
Best for: Maximizing number of cache entries
-
Hybrid
- Combines size-based and LRU
- Evicts by size first, then by LRU if needed
- Best for: General-purpose caching
Cache Key Generation¶
Cache keys are deterministic SHA256 hashes (first 16 characters) of:
{
"symbols": sorted(["AAPL", "MSFT"]), # Sorted for consistency
"start_date": "2023-01-01",
"end_date": "2023-12-31",
"resolution": "1d",
"data_source": "yfinance"
}
Properties:
- Deterministic: Same parameters → same cache key
- Order-independent: ["AAPL", "MSFT"] = ["MSFT", "AAPL"]
- Unique: Different parameters → different cache keys
Cache Invalidation¶
Checksum-Based Validation¶
Every cached Parquet file has a SHA256 checksum stored in metadata. On cache hit:
- Calculate checksum of Parquet file
- Compare with stored checksum
- If mismatch: delete cache entry, treat as cache miss
Manual Invalidation¶
# Clear specific entry
cache.clear_cache(cache_key="abc123def456")
# Clear by backtest
cache.clear_cache(backtest_id="backtest-001")
# Clear all
cache.clear_cache()
Performance Targets¶
| Cache Type | Target | Typical Performance |
|---|---|---|
| Hot Cache Hit | <0.01s | ~0.0008s (0.8ms) |
| Cold Cache Hit | <1s | ~0.0007s (0.7ms) |
| Cache Miss + Fetch | Varies | Depends on data source |
Database Schema¶
cache_entries Table¶
| Column | Type | Description |
|---|---|---|
cache_key |
TEXT | Primary key (SHA256 hash, 16 chars) |
dataset_id |
INTEGER | Foreign key to datasets table |
parquet_path |
TEXT | Relative path to cached Parquet file |
checksum |
TEXT | SHA256 checksum of Parquet file |
created_at |
INTEGER | Unix timestamp |
last_accessed |
INTEGER | Unix timestamp |
access_count |
INTEGER | Number of cache hits |
size_bytes |
INTEGER | File size in bytes |
backtest_cache_links Table¶
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
backtest_id |
TEXT | User-provided backtest identifier |
cache_key |
TEXT | Foreign key to cache_entries |
linked_at |
INTEGER | Unix timestamp |
cache_statistics Table¶
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
stat_date |
INTEGER | Unix timestamp (day granularity) |
hit_count |
INTEGER | Number of cache hits |
miss_count |
INTEGER | Number of cache misses |
total_size_mb |
REAL | Total cache size in MB |
Backtest Linkage¶
Every cache entry can be linked to one or more backtests:
# Store data with backtest linkage
cache.put_cached_data(
cache_key,
df,
dataset_id=1,
backtest_id="backtest-001"
)
# Query which backtests used specific data
import sqlalchemy as sa
from sqlalchemy.orm import Session
with Session(cache.metadata_catalog.engine) as session:
stmt = sa.select(cache.metadata_catalog.backtest_cache_links).where(
cache.metadata_catalog.backtest_cache_links.c.cache_key == cache_key
)
results = session.execute(stmt).fetchall()
for row in results:
print(f"Backtest: {row.backtest_id}, Linked: {row.linked_at}")
Troubleshooting¶
Cache Miss When Expected Hit¶
Symptoms: Data is fetched from source even though it should be cached.
Possible Causes: 1. Cache entry evicted due to size limit 2. Checksum mismatch (corrupted file) 3. Cache key parameters don't match exactly
Solutions:
# Check if cache entry exists
cache_entry = cache.lookup_cache(cache_key)
if cache_entry is None:
print("Cache entry not found (miss or evicted)")
else:
print(f"Cache entry found: {cache_entry}")
# Increase cache size limits
cache.cold_cache_size_mb = 20480 # 20GB
Slow Cache Hits¶
Symptoms: Cache hits take >1 second.
Possible Causes: 1. Cold cache access (disk I/O) 2. Large Parquet files 3. Disk performance issues
Solutions:
# Check cache statistics
stats = cache.get_cache_statistics()
print(f"Hot Hits: {stats['session_stats']['hot_hits']}")
print(f"Cold Hits: {stats['session_stats']['cold_hits']}")
# If mostly cold hits, increase hot cache size
cache.hot_cache = LRUCache(max_size_bytes=2 * 1024 * 1024 * 1024) # 2GB
Checksum Mismatch Errors¶
Symptoms: Logs show cache_checksum_mismatch errors.
Possible Causes: 1. File corruption 2. Manual file modification 3. Filesystem issues
Solutions:
# Clear corrupted entries (automatic on mismatch)
# Or manually clear entire cache
cache.clear_cache()
# Re-fetch data
df = fetch_data_from_source(...)
cache.put_cached_data(cache_key, df, dataset_id)
Cache Size Grows Unbounded¶
Symptoms: Cache directory grows beyond configured limit.
Possible Causes: 1. Eviction not triggered 2. Many large DataFrames
Solutions:
# Manually trigger eviction
cache._check_cold_cache_eviction()
# Check total size
total_size_mb = cache._get_total_cache_size_mb()
print(f"Total Cache Size: {total_size_mb:.2f} MB")
# Adjust eviction policy
cache.eviction_policy = "size" # Evict largest entries first
Best Practices¶
- Set Appropriate Cache Sizes
- Hot cache: 10-20% of available RAM
-
Cold cache: Based on disk space availability
-
Use Backtest Linkage
- Always provide
backtest_idfor traceability -
Makes it easy to clear cache per backtest
-
Monitor Cache Statistics
- Track hit rate to optimize cache size
-
Record daily statistics for historical analysis
-
Choose Right Eviction Policy
- LRU: For temporal access patterns
- Size: For maximizing cache entries
-
Hybrid: For general use
-
Regular Maintenance
- Clear old backtests periodically
- Monitor disk space usage
- Review cache statistics
Examples¶
Example 1: Basic Backtest Caching¶
from rustybt.data.polars.cache_manager import CacheManager
# Initialize cache
cache = CacheManager(
db_path="data/metadata.db",
cache_directory="data/cache"
)
# Backtest loop
for backtest_id in ["bt-001", "bt-002", "bt-003"]:
symbols = ["AAPL", "MSFT", "GOOGL"]
for symbol in symbols:
# Generate cache key
cache_key = cache.generate_cache_key(
symbols=[symbol],
start_date="2023-01-01",
end_date="2023-12-31",
resolution="1d",
data_source="yfinance"
)
# Check cache
df = cache.get_cached_data(cache_key)
if df is None:
# Cache miss: fetch from source
df = yfinance.download(symbol, start="2023-01-01", end="2023-12-31")
# Store in cache
cache.put_cached_data(
cache_key,
df,
dataset_id=1,
backtest_id=backtest_id
)
# Run backtest with df
run_backtest(df, backtest_id)
# Print statistics
stats = cache.get_cache_statistics()
print(f"Cache Hit Rate: {stats['hit_rate']:.2%}")
Example 2: Multi-Resolution Caching¶
# Cache different resolutions of same symbol
resolutions = ["1m", "5m", "1h", "1d"]
symbol = "AAPL"
for resolution in resolutions:
cache_key = cache.generate_cache_key(
symbols=[symbol],
start_date="2023-01-01",
end_date="2023-01-31",
resolution=resolution,
data_source="yfinance"
)
df = cache.get_cached_data(cache_key)
if df is None:
# Fetch and aggregate to resolution
df = fetch_and_resample(symbol, resolution)
cache.put_cached_data(cache_key, df, dataset_id=1)
Example 3: Cache Statistics Dashboard¶
import time
# Run backtests
for i in range(100):
cache_key = cache.generate_cache_key(
symbols=["AAPL"],
start_date="2023-01-01",
end_date="2023-12-31",
resolution="1d",
data_source="yfinance"
)
start = time.time()
df = cache.get_cached_data(cache_key)
latency = (time.time() - start) * 1000 # ms
print(f"Backtest {i+1}: Latency = {latency:.2f}ms")
# Record statistics
cache.record_daily_statistics()
# Print dashboard
stats = cache.get_cache_statistics()
print("\n=== Cache Statistics ===")
print(f"Hit Rate: {stats['hit_rate']:.2%}")
print(f"Total Hits: {stats['hit_count']}")
print(f"Total Misses: {stats['miss_count']}")
print(f"Cache Size: {stats['total_size_mb']:.2f} MB")
print(f"Entry Count: {stats['entry_count']}")
print(f"Avg Access Count: {stats['avg_access_count']:.2f}")
print("\n=== Session Statistics ===")
print(f"Hot Cache Hits: {stats['session_stats']['hot_hits']}")
print(f"Cold Cache Hits: {stats['session_stats']['cold_hits']}")