Last active
April 3, 2024 22:31
-
-
Save databento-bot/8162c4b3dff5faa318c4421ddddfd3f0 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Reddit comment 2024-04-03 | |
> See https://www.reddit.com/r/algotrading/comments/1bu59ql/comment/kxuil9a | |
There will always be some differences in the vendor’s infrastructure used to process real-time vs. historical. | |
It takes a bit of effort to make these as identical as possible. Non-exhaustive list: | |
The most common issue I’ve seen is that the vendor will retroactively clean and patch their historical data ex post in ways | |
that are not replicable in real-time. (The most obvious tell is if you report a data error and they tell you it’s patched | |
within the same day.) This is one area where Bloomberg is quite good despite doing it the “wrong” way - they have a | |
strong data model and provenance/versioning. The “better” approach is to just give you the raw data and only | |
apply corrections through changing the real-time parser behavior and regenerating from scratch - MayStreet, Databento, | |
Pico and Exegy have this type of approach. | |
We’ve also seen vendors do some opaque mix and match of feeds and derived data, e.g. SIP historical with IEX/Nasdaq | |
real-time, synthetic prices. Some ATSes do this with weighted midprice, etc. (This is something that institutional | |
providers like Databento avoid by giving you the same feed or feeds strictly. Other notable ones that are good in | |
this regard: QuantHouse, Activ, Exegy, Pico.) | |
Another egregious issue is that the vendor will backfill from other secondary redistributors of drastically different | |
quality and mix and match. We see this kind of backfill-and-rebadge done often over ICE/IDS, Xignite, dxFeed, | |
IEXCloud, Quodd/Nanex and Refinitiv’s data because they are more liberal with historical redistribution. | |
(You don’t see Bloomberg’s data getting bootlegged since they restrict historical redistribution.) And we’ve always seen | |
the rebadged data be much, much worse than the original. | |
This is a more common issue to look out for among “newer” vendors - including us - since obviously a vendor started in, | |
say, 2019, needs another source for data dating back to say, 2010. The telltale sign is if the data is suspiciously | |
cheap AND the vendor is not an official licensed distributor on the exchange directories. There’s no reason good data | |
must be expensive, but it’s easier to make it cheap when you’re rebadging because secondary sources tend to be cheaper | |
so your margins are higher. Another way to tell is just to compare their oldest data to newest. (This is why Databento | |
doesn’t have data going that far back - we only trust primary sources like the exchange or raw packet captures.) | |
4. Another issue is when the timestamps are drastically different in historical vs. real-time. This is an area where | |
the legacy Refinitiv Tick History (non-MayStreet) is ironically quite good - they address it by being equally bad in | |
both historical and real-time and consolidating the history and real-time through their Docklands hub. | |
I’ve named the firms that are decent but you can probably come to your conclusion which ones are bad by omission. | |
I don’t mind naming good firms and giving credit where it’s due, even to competitors, but prefer not to namedrop | |
ones that are egregiously bad. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment