CSV Data Import Guide¶

This guide explains how to import custom data from CSV files using RustyBT's CSVAdapter.

Overview¶

The CSVAdapter provides flexible CSV import capabilities with:

Schema mapping: Map custom column names to OHLCV fields
Delimiter detection: Auto-detect comma, tab, semicolon, pipe delimiters
Date parsing: Support for multiple date formats (ISO8601, US, European, epoch)
Timezone handling: Convert timestamps to UTC
Missing data strategies: Skip, interpolate, or fail on missing values
Decimal precision: Financial-grade precision for price data

Quick Start¶

Standard CSV Format¶

For CSV files with standard column names:

from rustybt.data.adapters import CSVAdapter, CSVConfig, SchemaMapping
import pandas as pd

# Configure CSV adapter
config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(
        date_column="Date",
        open_column="Open",
        high_column="High",
        low_column="Low",
        close_column="Close",
        volume_column="Volume"
    )
)

# Create adapter and fetch data
adapter = CSVAdapter(config)
df = await adapter.fetch(
    symbols=[],
    start_date=pd.Timestamp('2023-01-01'),
    end_date=pd.Timestamp('2023-12-31'),
    resolution='1d'
)

CSV File Format¶

Date,Open,High,Low,Close,Volume
2023-01-01,100.50,101.00,100.00,100.80,1000000
2023-01-02,100.80,102.00,100.50,101.50,1100000
2023-01-03,101.50,103.00,101.00,102.50,1200000

Configuration Options¶

Schema Mapping¶

Map your CSV column names to standard OHLCV fields:

from rustybt.data.adapters import SchemaMapping

# Custom column names
mapping = SchemaMapping(
    date_column="trade_date",      # Your date column name
    open_column="o",                # Your open price column
    high_column="h",                # Your high price column
    low_column="l",                 # Your low price column
    close_column="c",               # Your close price column
    volume_column="vol",            # Your volume column
    symbol_column="ticker"          # Optional: symbol/ticker column
)

Notes: - Column matching is case-insensitive - All columns except symbol_column are required - If no symbol column exists, data is labeled as "CSV_DATA"

Date Formats¶

The adapter supports multiple date formats:

ISO8601 (Recommended):

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    date_format="%Y-%m-%d"  # 2023-01-15
)

ISO8601 with Time:

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    date_format="%Y-%m-%d %H:%M:%S"  # 2023-01-15 10:30:00
)

US Format:

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    date_format="%m/%d/%Y"  # 01/15/2023
)

European Format:

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    date_format="%d/%m/%Y"  # 15/01/2023
)

Unix Epoch (Auto-detected):

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    date_format=None  # Auto-detect epoch timestamps
)

CSV with epoch timestamps:

timestamp,open,high,low,close,volume
1672531200,100.50,101.00,100.00,100.80,1000000
1672617600,100.80,102.00,100.50,101.50,1100000

Auto-Detection:

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    date_format=None  # Try to auto-detect format
)

The adapter will try common formats in this order: 1. ISO8601 date (%Y-%m-%d) 2. ISO8601 datetime (%Y-%m-%d %H:%M:%S) 3. US format (%m/%d/%Y) 4. European format (%d/%m/%Y) 5. Unix epoch (seconds) 6. Unix epoch (milliseconds)

Delimiter Options¶

Auto-Detection (Recommended):

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    delimiter=None  # Auto-detect delimiter
)

Explicit Delimiter:

# Comma
config = CSVConfig(..., delimiter=",")

# Tab
config = CSVConfig(..., delimiter="\t")

# Semicolon
config = CSVConfig(..., delimiter=";")

# Pipe
config = CSVConfig(..., delimiter="|")

Timezone Handling¶

Convert timestamps from source timezone to UTC:

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    timezone="America/New_York"  # Source timezone
)

Supported Timezones: - Any valid pytz timezone name - "UTC" for timestamps already in UTC - "America/New_York" for US Eastern Time - "Europe/London" for UK time - "Asia/Tokyo" for Japan time - See pytz documentation for full list

Behavior: - All timestamps are converted to UTC internally - If timezone is "UTC", no conversion is performed - Timezone-naive timestamps are assumed to be in the specified timezone

Missing Data Strategies¶

Handle missing values in your CSV:

Fail (Default):

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    missing_data_strategy="fail"  # Raise error if missing values detected
)

Skip:

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    missing_data_strategy="skip"  # Remove rows with missing values
)

Interpolate:

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    missing_data_strategy="interpolate"  # Forward-fill missing values
)

Strategy Comparison:

Strategy	Behavior	Use Case
`fail`	Raises `InvalidDataError`	Production (ensure data quality)
`skip`	Removes incomplete rows	Minor missing data acceptable
`interpolate`	Forward-fills missing values	Preserve time series continuity

Header Row Handling¶

With Headers (Default):

config = CSVConfig(
    file_path="data/ohlcv.csv",
    schema_mapping=SchemaMapping(),
    has_header=True
)

Without Headers:

# Use column indices as names
config = CSVConfig(
    file_path="data/no_headers.csv",
    schema_mapping=SchemaMapping(
        date_column="column_1",
        open_column="column_2",
        high_column="column_3",
        low_column="column_4",
        close_column="column_5",
        volume_column="column_6"
    ),
    has_header=False
)

Example Configurations¶

Example 1: Standard CSV with Headers¶

CSV File:

Date,Open,High,Low,Close,Volume
2023-01-01,100.50,101.00,100.00,100.80,1000000
2023-01-02,100.80,102.00,100.50,101.50,1100000

Configuration:

config = CSVConfig(
    file_path="data/standard.csv",
    schema_mapping=SchemaMapping(
        date_column="Date",
        open_column="Open",
        high_column="High",
        low_column="Low",
        close_column="Close",
        volume_column="Volume"
    ),
    delimiter=",",
    has_header=True,
    date_format="%Y-%m-%d",
    timezone="UTC"
)

Example 2: Custom Format with Multiple Symbols¶

CSV File:

trade_date|ticker|o|h|l|c|vol
01/01/2023|AAPL|100.50|101.00|100.00|100.80|1000000
01/02/2023|AAPL|100.80|102.00|100.50|101.50|1100000
01/01/2023|MSFT|200.50|201.00|200.00|200.80|2000000
01/02/2023|MSFT|200.80|202.00|200.50|201.50|2100000

Configuration:



href="#__codelineno-21-1">config = CSVConfig( file_path="data/custom.csv", schema_mapping=SchemaMapping( date_column="trade_date", open_column="o", high_column="h", low_column="l", close_column="c", volume_column="vol", symbol_column="ticker" ), delimiter="|", has_header=True, date_format="%m/%d/%Y", timezone="America/New_York" class="p">) class="c1"># Fetch specific symbols class="n">adapter = CSVAdapter(config) class="n">df = await adapter.fetch( symbols=["AAPL"],  # Filter for AAPL only start_date=pd.Timestamp('2023-01-01'), end_date=pd.Timestamp('2023-01-31'), resolution='1d' class="p">)
Example 3: No Headers with Epoch Timestamps¶
CSV File:
1672531200,100.50,101.00,100.00,100.80,1000000
1672617600,100.80,102.00,100.50,101.50,1100000
1672704000,101.50,103.00,101.00,102.50,1200000

Configuration:
config = CSVConfig(
    file_path="data/no_headers.csv",
    schema_mapping=SchemaMapping(
        date_column="column_1",
        open_column="column_2",
        high_column="column_3",
        low_column="column_4",
        close_column="column_5",
        volume_column="column_6"
    ),
    delimiter=",",
    has_header=False,
    date_format=None,  # Auto-detect epoch
    timezone="UTC",
    missing_data_strategy="skip"
)

Example 4: Tab-Delimited with Time Series Gaps¶
CSV File:
Date    Open    High    Low Close   Volume
2023-01-01  100.50  101.00  100.00  100.80  1000000
2023-01-02
2023-01-03  101.50  103.00  101.00  102.50  1200000

Configuration:
config = CSVConfig(
    file_path="data/gaps.csv",
    schema_mapping=SchemaMapping(),
    delimiter="\t",
    has_header=True,
    date_format="%Y-%m-%d",
    timezone="UTC",
    missing_data_strategy="skip"  # Remove incomplete rows
)

Data Validation¶
The CSVAdapter automatically validates data quality:
OHLCV Relationship Checks:
- High >= Low
- High >= Open
- High >= Close
- Low <= Open
- Low <= Close
Data Quality Checks:
- No NULL values in required columns (after missing data handling)
- Timestamps are sorted chronologically
- No duplicate timestamps per symbol
Example Error Handling:
from rustybt.data.adapters.base import ValidationError, InvalidDataError

try:
    df = await adapter.fetch(
        symbols=[],
        start_date=pd.Timestamp('2023-01-01'),
        end_date=pd.Timestamp('2023-12-31'),
        resolution='1d'
    )
except ValidationError as e:
    print(f"Data validation failed: {e}")
    if e.invalid_rows is not None:
        print(f"Invalid rows:\n{e.invalid_rows}")
except InvalidDataError as e:
    print(f"CSV format error: {e}")

Performance Tips¶

Use Explicit Configuration: Disable auto-detection by specifying delimiter and date_format
Filter Date Range: Narrow date range to only needed data
Filter Symbols: If CSV contains multiple symbols, filter to specific ones
Optimize CSV Structure: Sort by timestamp, remove unnecessary columns
Use Parquet for Large Datasets: Convert CSV to Parquet for better performance

Sample Files¶
Example CSV files are provided in examples/data/:

sample_ohlcv.csv - Standard format with comma delimiter
sample_ohlcv_tab.csv - Tab-delimited format
sample_ohlcv_no_headers.csv - No headers with epoch timestamps
sample_ohlcv_custom_format.csv - Custom column names with pipe delimiter

Troubleshooting¶
Missing Required Columns¶
Error: InvalidDataError: Missing required columns in CSV
Solution: Verify column names match schema mapping (case-insensitive)
Date Parsing Failed¶
Error: InvalidDataError: Failed to auto-detect date format
Solution: Specify date_format explicitly
Invalid OHLCV Relationships¶
Error: ValidationError: Invalid OHLCV relationships in X rows
Solution: Check data quality - high must be >= low, etc.
Timezone Conversion Failed¶
Warning: timezone_conversion_failed, assuming_utc=True
Solution: Verify timezone name is valid pytz timezone
Integration with Data Catalog¶
Use CSVAdapter with the data catalog for caching:
from rustybt.data.catalog import DataCatalog
from rustybt.data.adapters import CSVAdapter, CSVConfig, SchemaMapping

# Create adapter
csv_config = CSVConfig(
    file_path="data/custom.csv",
    schema_mapping=SchemaMapping(...)
)
csv_adapter = CSVAdapter(csv_config)

# Register with catalog
catalog = DataCatalog(cache_dir=".cache/data")
catalog.register_adapter("custom_data", csv_adapter)

# Fetch with caching
df = await catalog.get_data(
    adapter_name="custom_data",
    symbols=["AAPL"],
    start_date=pd.Timestamp('2023-01-01'),
    end_date=pd.Timestamp('2023-12-31'),
    resolution='1d'
)

See Also¶

Creating Data Adapters
Caching System