data
Concept Overview
Section titled “Concept Overview”Before any AFML workflow begins, raw market data must be loaded into a consistent schema, cleaned of duplicates and formatting issues, and aligned to a regular time grid. This module handles that ingestion layer.
It accepts CSV or Parquet files with flexible column naming (e.g., “timestamp”, “datetime”, “date” all map to “ts”; “ticker” or “asset” map to “symbol”) and produces a standardized Polars DataFrame with canonical OHLCV columns. Deduplication handles duplicate (symbol, timestamp) keys, and calendar alignment generates a regular grid with explicit gap markers.
The data quality report provides diagnostics — row counts, symbol counts, duplicate counts, gap intervals, and null counts — that should be inspected before feeding data into bars, labeling, or any downstream module.
When to Use
Section titled “When to Use”Use this module as the first step when working with pre-aggregated OHLCV data (daily bars, minute bars from a vendor). If you have raw tick/trade data instead, use the data_structures module to construct bars first.
Prerequisites: A CSV or Parquet file, or an existing Polars DataFrame with OHLCV-like columns.
Alternatives: Direct Polars/pandas loading if you handle column normalization and cleaning yourself.
Key Parameters
Section titled “Key Parameters”| Parameter | Type | Description | Default |
|---|---|---|---|
path | `str | Path` | File path to CSV or Parquet OHLCV data |
symbol | `str | None` | Symbol name if not present as a column in the data |
interval | str | Calendar alignment interval (e.g., ‘1d’, ‘1h’, ‘5m’) | ‘1d’ |
dedupe_keep | str | Which duplicate to keep: ‘first’ or ‘last' | 'last’ |
Usage Examples
Section titled “Usage Examples”Python
Section titled “Python”Load, clean, and inspect OHLCV data
Section titled “Load, clean, and inspect OHLCV data”from openquant.data import load_ohlcv, data_quality_report, align_calendar
# Load from CSV/Parquet with auto column normalizationdf, report = load_ohlcv("prices.csv", symbol="AAPL", return_report=True)print(report)# {'row_count': 5040, 'symbol_count': 1, 'duplicate_key_count': 0, ...}
# Align to regular calendar (fills gaps with nulls + is_missing_bar flag)aligned = align_calendar(df, interval="1d")
# Quality report on any DataFramequality = data_quality_report(df)Common Pitfalls
Section titled “Common Pitfalls”- Forgetting to check the quality report for gaps — missing bars silently create NaN features downstream.
- Using align_calendar with an interval shorter than the data’s actual frequency — this creates many synthetic missing-bar rows.
API Reference
Section titled “API Reference”Python API
Section titled “Python API”data.load_ohlcvdata.clean_ohlcvdata.align_calendardata.data_quality_report
Key Functions
Section titled “Key Functions”load_ohlcvclean_ohlcvalign_calendardata_quality_report
Implementation Notes
Section titled “Implementation Notes”- Column aliases are resolved automatically (e.g., ‘timestamp’ → ‘ts’, ‘ticker’ → ‘symbol’).
- clean_ohlcv deduplicates by (symbol, ts) and sorts chronologically.
- align_calendar marks missing bars with is_missing_bar=True for downstream imputation logic.