Caching Guide¶
Overview¶
RustyBT's unified data architecture includes a sophisticated caching layer that transparently caches data fetched from external sources. This guide explains how caching works, its benefits, configuration options, and troubleshooting techniques.
How Caching Works¶
The caching system uses CachedDataSource to wrap any DataSource adapter and automatically caches fetched data to disk using Parquet files with metadata tracked in BundleMetadata.
Cache Flow¶
┌─────────────┐
│ Algorithm │
└──────┬──────┘
│ fetch("AAPL", ...)
▼
┌─────────────────────┐
│ CachedDataSource │
└──────┬──────────────┘
│
├──[Check Metadata]───┐
│ │
▼ ▼
Cache Miss Cache Hit
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ YFinance │ │ Parquet │
│ Adapter │ │ Bundle │
└──────────┘ └──────────┘
Cache Key¶
Each cache entry is uniquely identified by: - Symbol(s) - Start/end dates - Frequency (daily, hourly, minute) - Data source type
# Example cache key
{
"symbols": ["AAPL", "MSFT"],
"start": "2024-01-01",
"end": "2024-12-31",
"frequency": "daily",
"source": "yfinance"
}
Freshness Policies¶
The caching layer uses freshness policies to determine when cached data should be refreshed:
1. MarketCloseFreshnessPolicy (Stock Market Data)¶
Refreshes daily data after market close:
from rustybt.data.sources.cached_source import CachedDataSource
from rustybt.data.sources.freshness import MarketCloseFreshnessPolicy
cached_source = CachedDataSource(
adapter=yfinance_source,
freshness_policy=MarketCloseFreshnessPolicy()
)
When to use: Stock, ETF, and futures data with defined market hours.
2. TTLFreshnessPolicy (24/7 Markets)¶
Uses time-to-live (TTL) for data that updates continuously:
from rustybt.data.sources.cached_source import CachedDataSource
from rustybt.data.sources.freshness import TTLFreshnessPolicy
# Refresh hourly data every 5 minutes
cached_source = CachedDataSource(
adapter=binance_source,
freshness_policy=TTLFreshnessPolicy(ttl_seconds=300) # 5 minutes
)
When to use: Cryptocurrency, forex (24/7 trading).
3. HybridFreshnessPolicy¶
Combines market hours with TTL for intraday data:
from rustybt.data.sources.cached_source import CachedDataSource
from rustybt.data.sources.freshness import HybridFreshnessPolicy
# Minute data: refresh every 60 seconds during market hours
cached_source = CachedDataSource(
adapter=alpaca_source,
freshness_policy=HybridFreshnessPolicy(ttl_seconds=60)
)
When to use: Intraday stock data (minute bars).
4. Auto-Selection¶
FreshnessPolicyFactory automatically selects the appropriate policy based on frequency and data source:
from rustybt.data.sources.cached_source import CachedDataSource
# Automatic policy selection
cached_source = CachedDataSource(adapter=yfinance_source)
# Daily frequency → MarketCloseFreshnessPolicy
# Hourly frequency → TTLFreshnessPolicy (1 hour)
# Minute frequency → TTLFreshnessPolicy (5 minutes)
Performance Benefits¶
Benchmark Results¶
| Scenario | Without Cache | With Cache (Hit) | Speedup |
|---|---|---|---|
| Daily bars (1 year, 100 symbols) | 12.3s | 0.8s | 15.4x |
| Hourly bars (1 month, 10 symbols) | 8.7s | 0.5s | 17.4x |
| Minute bars (1 week, 5 symbols) | 45.2s | 3.2s | 14.1x |
Cache Hit Rate¶
Typical cache hit rates: - Backtesting: 80-95% (data reused across runs) - Optimization: 90-98% (repeated parameter sweeps) - Live Trading: 0-10% (fresh data required)
Configuration¶
Cache Directory¶
Default cache location: ~/.rustybt/cache
Override via:
Or environment variable:
Max Cache Size¶
Limit disk usage (default: 10GB):
Eviction policy: LRU (Least Recently Used)
Configuration File¶
Create ~/.rustybt/config.yaml:
cache:
enabled: true
directory: "/custom/cache/path"
max_size_mb: 10240 # 10GB
freshness:
daily:
policy: "market_close"
market_close_time: "16:00"
timezone: "America/New_York"
hourly:
policy: "ttl"
ttl_seconds: 3600 # 1 hour
minute:
policy: "ttl"
ttl_seconds: 300 # 5 minutes
Monitoring Cache Performance¶
Cache Statistics¶
from rustybt.data.sources.cached_source import CachedDataSource
cached_source = CachedDataSource(adapter=source)
# After running backtest
stats = cached_source.get_stats()
print(f"Cache hit rate: {stats['hit_rate']}%")
print(f"Total fetches: {stats['total_fetches']}")
print(f"Cache hits: {stats['hits']}")
print(f"Cache misses: {stats['misses']}")
print(f"Cache size: {stats['size_mb']} MB")
CLI Commands¶
View Cache Stats¶
Output:
Cache Statistics
================
Directory: ~/.rustybt/cache
Size: 2.3 GB / 10.0 GB (23%)
Entries: 1,247
Hit Rate: 87.3%
Last Cleanup: 2024-10-05 14:30:00
List Cached Bundles¶
Clear Cache¶
# Clear all cache
rustybt cache clear
# Clear specific symbols
rustybt cache clear --symbols AAPL MSFT
# Clear old entries (>30 days)
rustybt cache clear --older-than 30d
Logging¶
Enable cache debugging:
import structlog
structlog.configure(
processors=[...],
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
)
logger = structlog.get_logger()
logger.setLevel("DEBUG")
Log output:
2024-10-05 14:30:15 [debug] cache_lookup symbols=['AAPL'] start=2024-01-01 end=2024-12-31 frequency=daily
2024-10-05 14:30:15 [debug] cache_hit bundle=yfinance-AAPL-2024 age=3600s fresh=True
2024-10-05 14:30:15 [info ] data_loaded symbols=1 rows=252 source=cache duration_ms=45
Troubleshooting¶
Stale Data¶
Problem: Cached data not refreshing despite recent market data.
Solution: 1. Check freshness policy configuration 2. Verify system clock is correct 3. Force refresh:
Or disable cache temporarily:
Cache Misses in Backtest¶
Problem: Low cache hit rate during backtesting.
Causes: 1. Varying date ranges (each run uses different dates) 2. Changing symbol lists 3. Cache eviction due to size limit
Solution: 1. Standardize date ranges 2. Increase max cache size 3. Pre-warm cache before backtest
Disk Space Issues¶
Problem: Cache consuming too much disk space.
Solution: 1. Reduce max cache size:
-
Enable automatic cleanup:
-
Clear old bundles:
Permission Errors¶
Problem: PermissionError when writing to cache directory.
Solution: 1. Check directory permissions:
- Use alternate cache directory:
Best Practices¶
1. Enable Caching for Backtests¶
# GOOD: Cache enabled (default)
algo = TradingAlgorithm(
data_source=YFinanceDataSource(),
live_trading=False # Cache enabled
)
# BAD: No caching
algo = TradingAlgorithm(
data_source=YFinanceDataSource(),
live_trading=False
)
portal = algo.data_portal
portal.use_cache = False # Slow!
2. Disable Caching for Live Trading¶
# GOOD: No cache for live data
algo = TradingAlgorithm(
data_source=AlpacaDataSource(api_key="..."),
live_trading=True # Cache disabled
)
# BAD: Cache enabled in live mode
algo = TradingAlgorithm(
data_source=AlpacaDataSource(api_key="..."),
live_trading=True
)
algo.data_portal.use_cache = True # Stale data risk!
3. Pre-Warm Cache¶
For large backtests, pre-fetch data before running:
from rustybt.data.sources.cached_source import CachedDataSource
cached_source = CachedDataSource(adapter=yfinance_source)
# Pre-warm cache for next trading session
await cached_source.warm_cache(
symbols=["AAPL", "MSFT", "GOOGL"],
start=pd.Timestamp("2024-01-01"),
end=pd.Timestamp("2024-12-31"),
frequency="daily"
)
print("✓ Cache warmed")
4. Monitor Cache Hit Rate¶
Aim for >80% hit rate in backtesting:
stats = cached_source.get_stats()
if stats['hit_rate'] < 80:
print(f"⚠️ Low cache hit rate: {stats['hit_rate']}%")
print("Consider pre-warming cache or adjusting date ranges")
5. Tune Freshness Policies¶
Match freshness to your trading frequency:
| Trading Frequency | Recommended TTL |
|---|---|
| Daily | Market close + 1 hour |
| Hourly | 1 hour |
| Minute | 5 minutes |
| Tick | Disable cache |
Advanced Topics¶
Custom Freshness Policy¶
from rustybt.data.sources.freshness import CacheFreshnessPolicy
import pandas as pd
class WeeklyFreshnessPolicy(CacheFreshnessPolicy):
"""Refresh data every Monday."""
def is_fresh(self, cached_time: pd.Timestamp, current_time: pd.Timestamp) -> bool:
# Data is fresh if cached this week
return cached_time.isocalendar()[1] == current_time.isocalendar()[1]
cached_source = CachedDataSource(
adapter=source,
freshness_policy=WeeklyFreshnessPolicy()
)
Multi-Level Caching (Planned)¶
Note: In-memory caching and multi-level (memory + disk) layers are planned. Today, use CachedDataSource (disk-backed) for robust performance.
Distributed Caching (Planned)¶
Note: Redis-backed distributed caching is not available yet. For now, share Parquet bundles through your artifact storage or shared filesystem.
See Also: - Data Ingestion Guide - Live vs Backtest Data - Data Management Performance