Skip to content

Instantly share code, notes, and snippets.

@javisantana
Last active May 27, 2016 11:16
Show Gist options
  • Save javisantana/29d42534774b5de128af5228094206a7 to your computer and use it in GitHub Desktop.
Save javisantana/29d42534774b5de128af5228094206a7 to your computer and use it in GitHub Desktop.

Sampling to calculate buckets

I ws thinking on using the new TABLESAMPLE feature included in postgres 9.5 in order to calculate clusters for visualization. Obvioulsy doing sampling adds some error to the final clusters but in the other hand provides (theorically) some time improvemens (avoid full scans with SYSTEM method and reduces CPU usage).

I only tested with this table, it should be tested with different data distributions and datasizes but looks promising for some clustering methods.

timing

jenks

BERNOULLI

cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select CDB_JenksBins(array_agg(vel::numeric), 7) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE BERNOULLI (10);
                              cdb_jenksbins
-------------------------------------------------------------------------
 {4.3573446,10.612988,15.145245,18.386246,21.200254,24.348658,29.323336}
(1 row)
Time: 1617.736 ms

SYSTEM

cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select CDB_JenksBins(array_agg(vel::numeric), 7) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE SYSTEM (10);                                                         LOG:  statement: select CDB_JenksBins(array_agg(vel::numeric), 7) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE SYSTEM (10);
                             cdb_jenksbins
------------------------------------------------------------------------
 {1.7491112,14.543345,17.362501,20.191946,21.745567,23.99369,28.706001}
(1 row)

Time: 1710.083 ms

100% sampling

cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select CDB_JenksBins(array_agg(vel::numeric), 7) from sensor_log_2013_10_27_12_01_1;
                              cdb_jenksbins
-------------------------------------------------------------------------
 {5.1650224,13.926011,16.328466,18.545721,21.298002,23.700457,29.858356}
(1 row)

Time: 2261.816 ms

Quantile bins

cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select CDB_QuantileBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1;
             cdb_quantilebins
-------------------------------------------
 {13.236655,17.239033,21.843311,29.858356}
(1 row)

Time: 23.375 ms
cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select CDB_QuantileBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE BERNOULLI (10);
           cdb_quantilebins
---------------------------------------
 {13.236655,17.3162,21.8999,29.498245}
(1 row)

Time: 3.200 ms
cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select CDB_QuantileBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE SYSTEM (10);
                  cdb_quantilebins
-----------------------------------------------------
 {11.662456,17.8255015000000000,22.712723,29.652576}
(1 row)
Time: 2.816 ms

error

avg percentage and std deviation on 10% sampling for jenks, quartiles and head and tails methods

cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select avg(100*abs(s10 - s100)/s100), stddev(abs(s10 - s100)/s100) from unnest((select CDB_QuantileBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1), (select CDB_QuantileBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE SYSTEM (10))) x(s100, s10);
        avg         |                   stddev
--------------------+--------------------------------------------
 3.2430976170828547 | 0.0103033362384431124316161531298645641616
 
 cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select avg(100*abs(s10 - s100)/s100), stddev(abs(s10 - s100)/s100) from unnest((select CDB_JenksBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1), (select CDB_JenksBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE SYSTEM (10))) x(s100, s10);
LOG:  statement: select avg(100*abs(s10 - s100)/s100), stddev(abs(s10 - s100)/s100) from unnest((select CDB_JenksBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1), (select CDB_JenksBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE SYSTEM (10))) x(s100, s10);
           avg           |                   stddev
-------------------------+--------------------------------------------
 32.37922940131385264271 | 0.5877181351833277149595473155945874389842
(1 row)

Time: 3558.188 ms

cartodb_dev_user_accf0647-d942-4e37-b129-8287c117e687_db=# select avg(100*abs(s10 - s100)/s100), stddev(abs(s10 - s100)/s100) from unnest((select CDB_HeadsTailsBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1), (select CDB_HeadsTailsBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE SYSTEM(10))) x(s100, s10);
LOG:  statement: select avg(100*abs(s10 - s100)/s100), stddev(abs(s10 - s100)/s100) from unnest((select CDB_HeadsTailsBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1), (select CDB_HeadsTailsBins(array_agg(vel::numeric), 4) from sensor_log_2013_10_27_12_01_1 TABLESAMPLE SYSTEM(10))) x(s100, s10);
          avg           |                   stddev
------------------------+--------------------------------------------
 1.85379607859246236981 | 0.0094798681590236172686651446763114294001

With BERNOULLI the results are better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment