Daily Bar Readers¶
Daily bar readers provide access to end-of-day OHLCV data. RustyBT supports multiple storage formats with Parquet/Polars as the recommended modern implementation.
Overview¶
Daily bar readers load one bar per trading day for each asset. They are used for:
- Daily frequency backtests
- End-of-day strategy execution
- Historical analysis and research
- Data rolled up from minute bars
Available Implementations¶
| Reader | Format | Precision | Status | Recommended |
|---|---|---|---|---|
PolarsParquetDailyReader |
Parquet | Decimal | Active | ✅ Yes |
BcolzDailyBarReader |
Bcolz | float64 | Legacy | ⚠️ Existing projects only |
HDF5DailyBarReader |
HDF5 | float64 | Deprecated | ❌ No |
PolarsParquetDailyReader¶
Modern, recommended implementation with Decimal precision and Polars performance.
Features¶
- Decimal Precision: Financial-grade arithmetic with Python
Decimal - Lazy Loading: Efficient partition pruning for fast queries
- Partitioned Storage: Year/month partitioning for scalability
- Built-in Validation: OHLCV relationship validation
- Caching: Optional in-memory caching for hot data
- Metadata Integration: Tracks data quality and provenance
Storage Structure¶
data/bundles/<bundle_name>/daily_bars/
├── year=2022/
│ ├── month=01/
│ │ └── data.parquet
│ ├── month=02/
│ │ └── data.parquet
│ └── ...
├── year=2023/
│ ├── month=01/
│ │ └── data.parquet
│ └── ...
└── year=2024/
└── ...
Benefits of Partitioning: - Fast queries (only scans relevant partitions) - Scalable to decades of data - Easy data management (add/remove years)
API Reference¶
Class: PolarsParquetDailyReader¶
Location: rustybt.data.polars.parquet_daily_bars
Constructor¶
PolarsParquetDailyReader(
bundle_path: str,
enable_cache: bool = True,
enable_metadata_catalog: bool = True
)
Parameters:
- bundle_path (str): Path to bundle directory (e.g., "data/bundles/quandl")
- enable_cache (bool, default=True): Enable in-memory caching for frequently accessed data
- enable_metadata_catalog (bool, default=True): Enable metadata catalog integration
Example:
from rustybt.data.polars import PolarsParquetDailyReader
# Initialize with defaults (caching enabled)
reader = PolarsParquetDailyReader("data/bundles/my_bundle")
# Initialize without caching (for live data)
live_reader = PolarsParquetDailyReader(
"data/bundles/live_bundle",
enable_cache=False
)
Method: load_daily_bars()¶
Load daily bars for assets in date range.
def load_daily_bars(
sids: list[int],
start_date: date,
end_date: date,
fields: list[str] | None = None
) -> pl.DataFrame
Parameters:
- sids (list[int]): Asset IDs to load
- start_date (date): Start date (inclusive)
- end_date (date): End date (inclusive)
- fields (list[str], optional): Columns to load (default: all OHLCV)
Returns: Polars DataFrame with schema:
- date: pl.Date
- sid: pl.Int64
- open: pl.Decimal(18, 8)
- high: pl.Decimal(18, 8)
- low: pl.Decimal(18, 8)
- close: pl.Decimal(18, 8)
- volume: pl.Decimal(18, 8)
Raises:
- FileNotFoundError: If bundle directory not found
- DataError: If no data found or validation fails
Example:
from datetime import date
# Load 1 month of data for 3 assets
df = reader.load_daily_bars(
sids=[1, 2, 3],
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31)
)
print(df.head())
# ┌────────────┬─────┬──────────┬──────────┬──────────┬──────────┬──────────┐
# │ date ┆ sid ┆ open ┆ high ┆ low ┆ close ┆ volume │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ date ┆ i64 ┆ decimal ┆ decimal ┆ decimal ┆ decimal ┆ decimal │
# ╞════════════╪═════╪══════════╪══════════╪══════════╪══════════╪══════════╡
# │ 2024-01-02 ┆ 1 ┆ 185.28 ┆ 186.74 ┆ 184.35 ┆ 185.64 ┆ 82000000 │
# │ 2024-01-02 ┆ 2 ┆ 140.23 ┆ 141.05 ┆ 139.87 ┆ 140.93 ┆ 35000000 │
# │ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
# └────────────┴─────┴──────────┴──────────┴──────────┴──────────┴──────────┘
# Load only close prices (optimized)
close_df = reader.load_daily_bars(
sids=[1, 2, 3],
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31),
fields=["close"] # Only load close column
)
Method: load_spot_value()¶
Load spot values at specific date.
Parameters:
- sids (list[int]): Asset IDs
- target_date (date): Target date
- field (str, default="close"): Field to retrieve
Returns: DataFrame with sid and field columns
Example:
# Get closing prices for assets on specific date
prices = reader.load_spot_value(
sids=[1, 2, 3],
target_date=date(2024, 1, 15),
field="close"
)
# ┌─────┬──────────┐
# │ sid ┆ close │
# │ --- ┆ --- │
# │ i64 ┆ decimal │
# ╞═════╪══════════╡
# │ 1 ┆ 185.64 │
# │ 2 ┆ 140.93 │
# │ 3 ┆ 412.35 │
# └─────┴──────────┘
# Get volumes
volumes = reader.load_spot_value(
sids=[1, 2, 3],
target_date=date(2024, 1, 15),
field="volume"
)
Method: get_last_available_date()¶
Get last available trading date for an asset.
Parameters:
- sid (int): Asset ID
Returns: Last available date or None if no data
Example:
# Check when asset last traded
last_date = reader.get_last_available_date(sid=1)
if last_date:
print(f"Last data available: {last_date}")
else:
print("No data found for asset")
Method: get_first_available_date()¶
Get first available trading date for an asset.
Parameters:
- sid (int): Asset ID
Returns: First available date or None if no data
Example:
# Check data coverage
first = reader.get_first_available_date(sid=1)
last = reader.get_last_available_date(sid=1)
print(f"Data coverage: {first} to {last}")
Usage Patterns¶
Pattern 1: Basic Daily Data Loading¶
from datetime import date
from rustybt.data.polars import PolarsParquetDailyReader
# Initialize reader
reader = PolarsParquetDailyReader("data/bundles/equities")
# Load 1 year of daily data
df = reader.load_daily_bars(
sids=[1, 2, 3, 4, 5],
start_date=date(2023, 1, 1),
end_date=date(2023, 12, 31)
)
# Calculate returns
returns = df.with_columns([
pl.col("close").pct_change().over("sid").alias("returns")
])
Pattern 2: Efficient Column Selection¶
# Load only the columns you need
close_only = reader.load_daily_bars(
sids=sids,
start_date=start,
end_date=end,
fields=["close"] # 5x faster than loading all columns
)
# Load OHLC without volume
ohlc = reader.load_daily_bars(
sids=sids,
start_date=start,
end_date=end,
fields=["open", "high", "low", "close"]
)
Pattern 3: Spot Value Queries¶
# Get latest prices for portfolio
current_prices = reader.load_spot_value(
sids=[1, 2, 3, 4, 5],
target_date=date.today(),
field="close"
)
# Calculate portfolio value
portfolio_value = sum(
row["close"] * holdings[row["sid"]]
for row in current_prices.iter_rows(named=True)
)
Pattern 4: Data Coverage Checks¶
def check_data_coverage(reader, sids):
"""Check data availability for assets."""
coverage = []
for sid in sids:
first = reader.get_first_available_date(sid)
last = reader.get_last_available_date(sid)
coverage.append({
"sid": sid,
"first_date": first,
"last_date": last,
"days": (last - first).days if first and last else 0
})
return pl.DataFrame(coverage)
# Check coverage
coverage_df = check_data_coverage(reader, [1, 2, 3, 4, 5])
print(coverage_df)
Pattern 5: Caching for Performance¶
# Enable caching for repeated queries
reader = PolarsParquetDailyReader(
"data/bundles/my_data",
enable_cache=True
)
# First load (reads from disk)
df1 = reader.load_daily_bars(sids=[1, 2], start, end) # ~100ms
# Second load (uses cache)
df2 = reader.load_daily_bars(sids=[1, 2], start, end) # ~1ms
# Cache is automatically managed (LRU eviction)
Pattern 6: Integration with DataPortal¶
from rustybt.data.polars.data_portal import PolarsDataPortal
# Reader used internally by portal
portal = PolarsDataPortal(
daily_reader=PolarsParquetDailyReader("data/bundles/equities")
)
# Access via portal API
prices = portal.get_spot_value(
assets=[asset1, asset2],
field="close",
dt=pd.Timestamp("2024-01-15"),
data_frequency="daily"
)
Pattern 7: Multi-Asset Analysis¶
# Load data for portfolio
portfolio_sids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df = reader.load_daily_bars(
sids=portfolio_sids,
start_date=date(2023, 1, 1),
end_date=date(2024, 1, 1),
fields=["close"]
)
# Pivot for correlation analysis
pivot_df = df.pivot(
values="close",
index="date",
columns="sid"
)
# Calculate correlation matrix
corr_matrix = pivot_df.corr()
Performance Optimization¶
1. Partition Pruning¶
Polars automatically prunes partitions based on date filters:
# Only scans year=2024/month=01 partition
jan_data = reader.load_daily_bars(
sids=sids,
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31)
)
# Scans multiple partitions (slower but still efficient)
year_data = reader.load_daily_bars(
sids=sids,
start_date=date(2024, 1, 1),
end_date=date(2024, 12, 31)
)
2. Column Selection¶
Load only needed columns:
# GOOD: Load only close (fast)
close_df = reader.load_daily_bars(sids=sids, start=start, end=end, fields=["close"])
# GOOD: Load OHLC without volume (faster)
ohlc_df = reader.load_daily_bars(
sids=sids, start=start, end=end,
fields=["open", "high", "low", "close"]
)
# AVOID: Loading all columns when only need one
all_df = reader.load_daily_bars(sids=sids, start=start, end=end) # Slower
close = all_df.select("close") # Should have used fields parameter
3. Batch Asset Queries¶
Query multiple assets at once:
# GOOD: Single batch query
df = reader.load_daily_bars(sids=[1, 2, 3, 4, 5], start, end)
# AVOID: Individual queries
dfs = []
for sid in [1, 2, 3, 4, 5]:
df = reader.load_daily_bars(sids=[sid], start, end) # Inefficient!
dfs.append(df)
4. Cache Management¶
Use caching for repeated queries:
# Enable caching for backtests (repeated data access)
backtest_reader = PolarsParquetDailyReader(
bundle_path="data/bundles/backtest",
enable_cache=True # Cache hot data
)
# Disable caching for one-off queries
analysis_reader = PolarsParquetDailyReader(
bundle_path="data/bundles/analysis",
enable_cache=False # Don't waste memory
)
Data Validation¶
Automatic OHLCV Validation¶
All data is validated on load:
# Validates automatically
df = reader.load_daily_bars(sids=[1], start, end)
# Checks performed:
# 1. high >= low
# 2. high >= open
# 3. high >= close
# 4. low <= open
# 5. low <= close
# 6. volume >= 0
# Raises DataError if validation fails
Manual Validation¶
from rustybt.data.polars.validation import validate_ohlcv_relationships, DataError
try:
df = reader.load_daily_bars(sids=[1], start, end)
except DataError as e:
print(f"Invalid OHLCV data: {e}")
# Handle bad data (skip, fix, alert, etc.)
Exceptions¶
DataError¶
Raised when data issues detected.
Common causes: - No data found for date range - OHLCV validation failed - Corrupt Parquet files
Example:
from rustybt.data.polars.validation import DataError
try:
df = reader.load_daily_bars(
sids=[999], # Asset doesn't exist
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31)
)
except DataError as e:
print(f"Data error: {e}")
FileNotFoundError¶
Raised when bundle directory not found.
Example:
try:
reader = PolarsParquetDailyReader("nonexistent/bundle")
df = reader.load_daily_bars(sids=[1], start, end)
except FileNotFoundError as e:
print(f"Bundle not found: {e}")
Migration from Legacy Readers¶
From BcolzDailyBarReader¶
Before:
from rustybt.data.bcolz_daily_bars import BcolzDailyBarReader
reader = BcolzDailyBarReader("bundles/legacy_data")
arrays = reader.load_raw_arrays(
columns=["close"],
start_date=start,
end_date=end,
assets=[1, 2, 3]
) # Returns numpy arrays with float64
After:
from rustybt.data.polars import PolarsParquetDailyReader
reader = PolarsParquetDailyReader("bundles/new_data")
df = reader.load_daily_bars(
sids=[1, 2, 3],
start_date=start.date(),
end_date=end.date(),
fields=["close"]
) # Returns Polars DataFrame with Decimal
See Also¶
- PolarsDataPortal - High-level data access using daily readers
- Bar Readers - Bar reader interface and dispatch
- Bundle System - Data bundle management
- Data Ingestion - Creating daily bar bundles