Daily Bar Readers¶

Daily bar readers provide access to end-of-day OHLCV data. RustyBT supports multiple storage formats with Parquet/Polars as the recommended modern implementation.

Overview¶

Daily bar readers load one bar per trading day for each asset. They are used for:

Daily frequency backtests
End-of-day strategy execution
Historical analysis and research
Data rolled up from minute bars

Available Implementations¶

Reader	Format	Precision	Status	Recommended
`PolarsParquetDailyReader`	Parquet	Decimal	Active	✅ Yes
`BcolzDailyBarReader`	Bcolz	float64	Legacy	⚠️ Existing projects only
`HDF5DailyBarReader`	HDF5	float64	Deprecated	❌ No

PolarsParquetDailyReader¶

Modern, recommended implementation with Decimal precision and Polars performance.

Features¶

Decimal Precision: Financial-grade arithmetic with Python Decimal
Lazy Loading: Efficient partition pruning for fast queries
Partitioned Storage: Year/month partitioning for scalability
Built-in Validation: OHLCV relationship validation
Caching: Optional in-memory caching for hot data
Metadata Integration: Tracks data quality and provenance

Storage Structure¶

data/bundles/<bundle_name>/daily_bars/
├── year=2022/
│   ├── month=01/
│   │   └── data.parquet
│   ├── month=02/
│   │   └── data.parquet
│   └── ...
├── year=2023/
│   ├── month=01/
│   │   └── data.parquet
│   └── ...
└── year=2024/
    └── ...

Benefits of Partitioning: - Fast queries (only scans relevant partitions) - Scalable to decades of data - Easy data management (add/remove years)

API Reference¶

Class: `PolarsParquetDailyReader`¶

Location: rustybt.data.polars.parquet_daily_bars

Constructor¶

PolarsParquetDailyReader(
    bundle_path: str,
    enable_cache: bool = True,
    enable_metadata_catalog: bool = True
)

Parameters: - bundle_path (str): Path to bundle directory (e.g., "data/bundles/quandl") - enable_cache (bool, default=True): Enable in-memory caching for frequently accessed data - enable_metadata_catalog (bool, default=True): Enable metadata catalog integration

Example:

from rustybt.data.polars import PolarsParquetDailyReader

# Initialize with defaults (caching enabled)
reader = PolarsParquetDailyReader("data/bundles/my_bundle")

# Initialize without caching (for live data)
live_reader = PolarsParquetDailyReader(
    "data/bundles/live_bundle",
    enable_cache=False
)

Method: `load_daily_bars()`¶

Load daily bars for assets in date range.

def load_daily_bars(
    sids: list[int],
    start_date: date,
    end_date: date,
    fields: list[str] | None = None
) -> pl.DataFrame

Parameters: - sids (list[int]): Asset IDs to load - start_date (date): Start date (inclusive) - end_date (date): End date (inclusive) - fields (list[str], optional): Columns to load (default: all OHLCV)

Returns: Polars DataFrame with schema: - date: pl.Date - sid: pl.Int64 - open: pl.Decimal(18, 8) - high: pl.Decimal(18, 8) - low: pl.Decimal(18, 8) - close: pl.Decimal(18, 8) - volume: pl.Decimal(18, 8)

Raises: - FileNotFoundError: If bundle directory not found - DataError: If no data found or validation fails

Example:

from datetime import date

# Load 1 month of data for 3 assets
df = reader.load_daily_bars(
    sids=[1, 2, 3],
    start_date=date(2024, 1, 1),
    end_date=date(2024, 1, 31)
)

print(df.head())
# ┌────────────┬─────┬──────────┬──────────┬──────────┬──────────┬──────────┐
# │ date       ┆ sid ┆ open     ┆ high     ┆ low      ┆ close    ┆ volume   │
# │ ---        ┆ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
# │ date       ┆ i64 ┆ decimal  ┆ decimal  ┆ decimal  ┆ decimal  ┆ decimal  │
# ╞════════════╪═════╪══════════╪══════════╪══════════╪══════════╪══════════╡
# │ 2024-01-02 ┆ 1   ┆ 185.28   ┆ 186.74   ┆ 184.35   ┆ 185.64   ┆ 82000000 │
# │ 2024-01-02 ┆ 2   ┆ 140.23   ┆ 141.05   ┆ 139.87   ┆ 140.93   ┆ 35000000 │
# │ ...        ┆ ... ┆ ...      ┆ ...      ┆ ...      ┆ ...      ┆ ...      │
# └────────────┴─────┴──────────┴──────────┴──────────┴──────────┴──────────┘

# Load only close prices (optimized)
close_df = reader.load_daily_bars(
    sids=[1, 2, 3],
    start_date=date(2024, 1, 1),
    end_date=date(2024, 1, 31),
    fields=["close"]  # Only load close column
)

Method: `load_spot_value()`¶

Load spot values at specific date.

def load_spot_value(
    sids: list[int],
    target_date: date,
    field: str = "close"
) -> pl.DataFrame

Parameters: - sids (list[int]): Asset IDs - target_date (date): Target date - field (str, default="close"): Field to retrieve

Returns: DataFrame with sid and field columns

Example:

# Get closing prices for assets on specific date
prices = reader.load_spot_value(
    sids=[1, 2, 3],
    target_date=date(2024, 1, 15),
    field="close"
)

# ┌─────┬──────────┐
# │ sid ┆ close    │
# │ --- ┆ ---      │
# │ i64 ┆ decimal  │
# ╞═════╪══════════╡
# │ 1   ┆ 185.64   │
# │ 2   ┆ 140.93   │
# │ 3   ┆ 412.35   │
# └─────┴──────────┘

# Get volumes
volumes = reader.load_spot_value(
    sids=[1, 2, 3],
    target_date=date(2024, 1, 15),
    field="volume"
)

Method: `get_last_available_date()`¶

Get last available trading date for an asset.

def get_last_available_date(sid: int) -> date | None

Parameters: - sid (int): Asset ID

Returns: Last available date or None if no data

Example:

# Check when asset last traded
last_date = reader.get_last_available_date(sid=1)

if last_date:
    print(f"Last data available: {last_date}")
else:
    print("No data found for asset")

Method: `get_first_available_date()`¶

Get first available trading date for an asset.

def get_first_available_date(sid: int) -> date | None

Parameters: - sid (int): Asset ID

Returns: First available date or None if no data

Example:

# Check data coverage
first = reader.get_first_available_date(sid=1)
last = reader.get_last_available_date(sid=1)

print(f"Data coverage: {first} to {last}")

Usage Patterns¶

Pattern 1: Basic Daily Data Loading¶

from datetime import date
from rustybt.data.polars import PolarsParquetDailyReader

# Initialize reader
reader = PolarsParquetDailyReader("data/bundles/equities")

# Load 1 year of daily data
df = reader.load_daily_bars(
    sids=[1, 2, 3, 4, 5],
    start_date=date(2023, 1, 1),
    end_date=date(2023, 12, 31)
)

# Calculate returns
returns = df.with_columns([
    pl.col("close").pct_change().over("sid").alias("returns")
])

Pattern 2: Efficient Column Selection¶

# Load only the columns you need
close_only = reader.load_daily_bars(
    sids=sids,
    start_date=start,
    end_date=end,
    fields=["close"]  # 5x faster than loading all columns
)

# Load OHLC without volume
ohlc = reader.load_daily_bars(
    sids=sids,
    start_date=start,
    end_date=end,
    fields=["open", "high", "low", "close"]
)

Pattern 3: Spot Value Queries¶

# Get latest prices for portfolio
current_prices = reader.load_spot_value(
    sids=[1, 2, 3, 4, 5],
    target_date=date.today(),
    field="close"
)

# Calculate portfolio value
portfolio_value = sum(
    row["close"] * holdings[row["sid"]]
    for row in current_prices.iter_rows(named=True)
)

Pattern 4: Data Coverage Checks¶

def check_data_coverage(reader, sids):
    """Check data availability for assets."""
    coverage = []

    for sid in sids:
        first = reader.get_first_available_date(sid)
        last = reader.get_last_available_date(sid)

        coverage.append({
            "sid": sid,
            "first_date": first,
            "last_date": last,
            "days": (last - first).days if first and last else 0
        })

    return pl.DataFrame(coverage)

# Check coverage
coverage_df = check_data_coverage(reader, [1, 2, 3, 4, 5])
print(coverage_df)

Pattern 5: Caching for Performance¶

# Enable caching for repeated queries
reader = PolarsParquetDailyReader(
    "data/bundles/my_data",
    enable_cache=True
)

# First load (reads from disk)
df1 = reader.load_daily_bars(sids=[1, 2], start, end)  # ~100ms

# Second load (uses cache)
df2 = reader.load_daily_bars(sids=[1, 2], start, end)  # ~1ms

# Cache is automatically managed (LRU eviction)

Pattern 6: Integration with DataPortal¶

from rustybt.data.polars.data_portal import PolarsDataPortal

# Reader used internally by portal
portal = PolarsDataPortal(
    daily_reader=PolarsParquetDailyReader("data/bundles/equities")
)

# Access via portal API
prices = portal.get_spot_value(
    assets=[asset1, asset2],
    field="close",
    dt=pd.Timestamp("2024-01-15"),
    data_frequency="daily"
)

Pattern 7: Multi-Asset Analysis¶

# Load data for portfolio
portfolio_sids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

df = reader.load_daily_bars(
    sids=portfolio_sids,
    start_date=date(2023, 1, 1),
    end_date=date(2024, 1, 1),
    fields=["close"]
)

# Pivot for correlation analysis
pivot_df = df.pivot(
    values="close",
    index="date",
    columns="sid"
)

# Calculate correlation matrix
corr_matrix = pivot_df.corr()

Performance Optimization¶

1. Partition Pruning¶

Polars automatically prunes partitions based on date filters:

# Only scans year=2024/month=01 partition
jan_data = reader.load_daily_bars(
    sids=sids,
    start_date=date(2024, 1, 1),
    end_date=date(2024, 1, 31)
)

# Scans multiple partitions (slower but still efficient)
year_data = reader.load_daily_bars(
    sids=sids,
    start_date=date(2024, 1, 1),
    end_date=date(2024, 12, 31)
)

2. Column Selection¶

Load only needed columns:

# GOOD: Load only close (fast)
close_df = reader.load_daily_bars(sids=sids, start=start, end=end, fields=["close"])

# GOOD: Load OHLC without volume (faster)
ohlc_df = reader.load_daily_bars(
    sids=sids, start=start, end=end,
    fields=["open", "high", "low", "close"]
)

# AVOID: Loading all columns when only need one
all_df = reader.load_daily_bars(sids=sids, start=start, end=end)  # Slower
close = all_df.select("close")  # Should have used fields parameter

3. Batch Asset Queries¶

Query multiple assets at once:

# GOOD: Single batch query
df = reader.load_daily_bars(sids=[1, 2, 3, 4, 5], start, end)

# AVOID: Individual queries
dfs = []
for sid in [1, 2, 3, 4, 5]:
    df = reader.load_daily_bars(sids=[sid], start, end)  # Inefficient!
    dfs.append(df)

4. Cache Management¶

Use caching for repeated queries:

# Enable caching for backtests (repeated data access)
backtest_reader = PolarsParquetDailyReader(
    bundle_path="data/bundles/backtest",
    enable_cache=True  # Cache hot data
)

# Disable caching for one-off queries
analysis_reader = PolarsParquetDailyReader(
    bundle_path="data/bundles/analysis",
    enable_cache=False  # Don't waste memory
)

Data Validation¶

Automatic OHLCV Validation¶

All data is validated on load:

# Validates automatically
df = reader.load_daily_bars(sids=[1], start, end)

# Checks performed:
# 1. high >= low
# 2. high >= open
# 3. high >= close
# 4. low <= open
# 5. low <= close
# 6. volume >= 0

# Raises DataError if validation fails

Manual Validation¶

from rustybt.data.polars.validation import validate_ohlcv_relationships, DataError

try:
    df = reader.load_daily_bars(sids=[1], start, end)
except DataError as e:
    print(f"Invalid OHLCV data: {e}")
    # Handle bad data (skip, fix, alert, etc.)

Exceptions¶

`DataError`¶

Raised when data issues detected.

Common causes: - No data found for date range - OHLCV validation failed - Corrupt Parquet files

Example:

from rustybt.data.polars.validation import DataError

try:
    df = reader.load_daily_bars(
        sids=[999],  # Asset doesn't exist
        start_date=date(2024, 1, 1),
        end_date=date(2024, 1, 31)
    )
except DataError as e:
    print(f"Data error: {e}")

`FileNotFoundError`¶

Raised when bundle directory not found.

Example:

try:
    reader = PolarsParquetDailyReader("nonexistent/bundle")
    df = reader.load_daily_bars(sids=[1], start, end)
except FileNotFoundError as e:
    print(f"Bundle not found: {e}")

Migration from Legacy Readers¶

From BcolzDailyBarReader¶

Before:

from rustybt.data.bcolz_daily_bars import BcolzDailyBarReader

reader = BcolzDailyBarReader("bundles/legacy_data")
arrays = reader.load_raw_arrays(
    columns=["close"],
    start_date=start,
    end_date=end,
    assets=[1, 2, 3]
)  # Returns numpy arrays with float64

After:

from rustybt.data.polars import PolarsParquetDailyReader

reader = PolarsParquetDailyReader("bundles/new_data")
df = reader.load_daily_bars(
    sids=[1, 2, 3],
    start_date=start.date(),
    end_date=end.date(),
    fields=["close"]
)  # Returns Polars DataFrame with Decimal

Daily Bar Readers¶

Overview¶

Available Implementations¶

PolarsParquetDailyReader¶

Features¶

Storage Structure¶

API Reference¶

Class: PolarsParquetDailyReader¶

Constructor¶

Method: load_daily_bars()¶

Method: load_spot_value()¶

Method: get_last_available_date()¶

Method: get_first_available_date()¶

Usage Patterns¶

Pattern 1: Basic Daily Data Loading¶

Pattern 2: Efficient Column Selection¶

Pattern 3: Spot Value Queries¶

Pattern 4: Data Coverage Checks¶

Pattern 5: Caching for Performance¶

Pattern 6: Integration with DataPortal¶

Pattern 7: Multi-Asset Analysis¶

Performance Optimization¶

1. Partition Pruning¶

2. Column Selection¶

3. Batch Asset Queries¶

4. Cache Management¶

Data Validation¶

Automatic OHLCV Validation¶

Manual Validation¶

Exceptions¶

DataError¶

FileNotFoundError¶

Migration from Legacy Readers¶

From BcolzDailyBarReader¶

See Also¶

Class: `PolarsParquetDailyReader`¶

Method: `load_daily_bars()`¶

Method: `load_spot_value()`¶

Method: `get_last_available_date()`¶

Method: `get_first_available_date()`¶

`DataError`¶

`FileNotFoundError`¶