Created
January 29, 2018 23:17
-
-
Save andrewgross/300286593b0bd2c2cc0ace5db819095e to your computer and use it in GitHub Desktop.
Bad Floating Point
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Hey, | |
This is going to be a bit of an esoteric ticket. I noticed some strange behavior recently when comparing Spectrum and Redshift results on the same dataset. | |
Redshift Data: fare.txt | |
Parquet Data: fare.parquet | |
The parquet data was generated from fare.txt with PySpark using convert.py on Spark 2.2.0 | |
Redshift Table Schema: | |
CREATE TABLE public.fare ( | |
"id" VARCHAR, | |
"fare" DOUBLE PRECISION | |
) | |
Spectrum Table Schema: | |
CREATE EXTERNAL TABLE spectrum.fare ( | |
"id" VARCHAR, | |
"fare" DOUBLE PRECISION | |
) | |
STORED AS PARQUET | |
LOCATION 's3://redshift-scratch/fare_parquet/'; | |
NOTE: | |
I queried the data in several ways and got some very strange results. The following query gets the count of distinct ids and the sum of the fare field. Spectrum and Redshift get slightly different results, but I chalk this up to floating point issues: | |
select | |
count(distinct o.id) as redshift_count, | |
count(distinct s_o.id) as spectrum_count, | |
sum(o.fare) as redshift_sum, | |
sum(s_o.fare) as spectrum_count | |
from | |
public.fare as o, | |
spectrum.fare as s_o; | |
https://captured-pics.s3.amazonaws.com/yYPqvSHHhfn74XGRvvuVj7zFYIVCal3xgU.png | |
-------------------------------------------------------------------------------------------------- | |
However, when you only do a count on one of the fields, the sums come back exactly equal: | |
select | |
count(distinct o.id) as redshift_count, | |
-- count(distinct s_o.id) as spectrum_count, | |
sum(o.fare) as redshift_sum, | |
sum(s_o.fare) as spectrum_count | |
from | |
public.fare as o, | |
spectrum.fare as s_o; | |
https://captured-pics.s3.amazonaws.com/q2dlKsU1gCu0kmCy4p9pBURJwzVDJFoWiO.png | |
-------------------------------------------------------------------------------------------------- | |
It doesn't matter which column you use for count, they both have the same results. | |
select | |
-- count(distinct o.id) as redshift_count, | |
count(distinct s_o.id) as spectrum_count, | |
sum(o.fare) as redshift_sum, | |
sum(s_o.fare) as spectrum_count | |
from | |
public.fare as o, | |
spectrum.fare as s_o; | |
https://captured-pics.s3.amazonaws.com/XYbAMGIrqbnhwi9SGbId2B66il3dNiVIJZ.png | |
-------------------------------------------------------------------------------------------------- | |
However, if you change the count from distinct, the results no longer match! | |
select | |
count(o.*) as redshift_count, | |
count(s_o.*) as spectrum_count, | |
sum(o.fare) as redshift_sum, | |
sum(s_o.fare) as spectrum_count | |
from | |
public.fare as o, | |
spectrum.fare as s_o; | |
https://captured-pics.s3.amazonaws.com/Blw1pUfMdBm8j2ZLxFZRod4aXO6rCwLOlr.png | |
-------------------------------------------------------------------------------------------------- | |
And the results will no longer match if you use just one column: | |
select | |
count(o.*) as redshift_count, | |
-- count(s_o.*) as spectrum_count, | |
sum(o.fare) as redshift_sum, | |
sum(s_o.fare) as spectrum_count | |
from | |
public.fare as o, | |
spectrum.fare as s_o; | |
https://captured-pics.s3.amazonaws.com/byAIEQmXA0PntSKXATHZuyeQ8GJZ5PIW7y.png | |
-------------------------------------------------------------------------------------------------- | |
select | |
-- count(o.*) as redshift_count, | |
count(s_o.*) as spectrum_count, | |
sum(o.fare) as redshift_sum, | |
sum(s_o.fare) as spectrum_count | |
from | |
public.fare as o, | |
spectrum.fare as s_o; | |
https://captured-pics.s3.amazonaws.com/hempPkPhC0frjcFJXQiKE8cWAIORnOArpN.png | |
-------------------------------------------------------------------------------------------------- | |
There is a lot of weirdness going on here. I would like to be able to rely on floating point calculations being at least consistent and not dependent on the number of columns? At the moment our workaround is to ensure we use appropriate significant figures and truncate results, but this is a very confusing bug to be sure. I even had occasionally instances of the same query returning slightly different results (out at the 10th+ decimal place) on subsequent runs, but I have not been able to reproduce this reliably yet. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment