Last active
July 6, 2025 07:23
-
-
Save databento-bot/833ba14d87ffa09dbcb36a38b0c832e6 to your computer and use it in GitHub Desktop.
Interpolate OHLCV-1s on Databento to ensure exactly one row per second
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Databento follows the standard convention of only printing OHLCV records when | |
trades actually occur within the interval. If no trade occurs within the interval, | |
no record is printed. | |
This approach is adopted by most data vendors for two key reasons: | |
1. Multiple interpolation strategies exist and the optimal choice depends on the | |
specific use case. Client-side interpolation keeps the strategy transparent. | |
2. This reduces storage and bandwidth requirements, especially for markets with | |
many illiquid instruments like options. | |
To learn more, see: https://databento.com/docs/schemas-and-data-formats/ohlcv | |
This example demonstrates one way to interpolate OHLCV-1s data to ensure exactly | |
one row per second. The interpolation strategy used: | |
- Forward fill the close price from the last known value | |
- Set open, high, and low equal to the forward-filled close price | |
- Set volume to 0 for interpolated periods | |
- Forward fill metadata columns | |
""" | |
import pandas as pd | |
import matplotlib.pyplot as plt | |
import databento as db | |
client = db.Historical() | |
data = client.timeseries.get_range( | |
dataset="XNAS.ITCH", | |
schema="ohlcv-1s", | |
symbols="AAPL", | |
start="2021-04-09", | |
end="2021-04-10", | |
) | |
df = data.to_df() | |
# Create a complete second-by-second index using specified start/end times | |
start_time = pd.Timestamp("2021-04-09") | |
end_time = pd.Timestamp("2021-04-10") | |
complete_index = pd.date_range(start=start_time, end=end_time, freq='1s', tz='UTC', inclusive='left') | |
# Reindex to include all seconds | |
df_interp = df.reindex(complete_index) | |
# Forward fill metadata columns first, then backfill if needed | |
metadata_cols = ['symbol', 'instrument_id', 'publisher_id', 'rtype'] | |
df_interp[metadata_cols] = df_interp[metadata_cols].ffill().bfill() | |
# Forward fill close prices first (may remain NaN if no prior data exists) | |
df_interp['close'] = df_interp['close'].ffill() | |
# For seconds with no trades, set open/high/low equal to the forward-filled close | |
# and volume to 0 (close may still be NaN if no data exists to forward fill from) | |
missing_mask = df_interp['open'].isna() | |
df_interp.loc[missing_mask, 'open'] = df_interp.loc[missing_mask, 'close'] | |
df_interp.loc[missing_mask, 'high'] = df_interp.loc[missing_mask, 'close'] | |
df_interp.loc[missing_mask, 'low'] = df_interp.loc[missing_mask, 'close'] | |
df_interp.loc[missing_mask, 'volume'] = 0 | |
print(df_interp.head(10)) | |
df['open'].plot() | |
plt.show() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment