Case study on extended statistics
Case study on extended statistics
First we need to ask ourselves the following questions
- What are statistics in Postgresql?
- When we do a query the planner will need to estimate the number of rows to
retrieve from the table. To do this infomation is stored in the table
pg_statistic
- Entries here are updated by
ANALYSE
and are always kept up to date, in the example I would be showing a case where that occurs - The statistics stored however are for a single column
- When we do a query the planner will need to estimate the number of rows to
retrieve from the table. To do this infomation is stored in the table
- What is an extended statistics?
- An extended statistics is something an user can create
- This is used in cases when queries are slow due to correlation between columns
Example
Consider the following tables
SET max_parallel_workers_per_gather = 0;
create table t1 (
a int,
b int
);
insert into t1 select i/1000, i/500 from generate_series(1, 1000000) s(i);
Things to note
- Data of column a and b have correlation between each other
Without parallel queries enabled
Now let’s start with a clean table
nimalanm@/tmp:test> ANALYZE t1;
ANALYZE
Time: 0.064s
nimalanm@/tmp:test> explain analyse select * from t1 where a = 0 and b = 1;
+---------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|---------------------------------------------------------------------------------------------------|
| Seq Scan on t1 (cost=0.00..19425.75 rows=25 width=8) (actual time=0.015..82.493 rows=50 loops=1) |
| Filter: ((a = 0) AND (b = 1)) |
| Rows Removed by Filter: 999950 |
| Planning Time: 0.037 ms |
| Execution Time: 82.507 ms |
+---------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.094s
Now let’s create a STATISTICS
saying that a and b are correlated
CREATE STATISTICS s1 (dependencies) on a, b from t1;
nimalanm@/tmp:test> explain analyse select * from t1 where a = 0 and b = 1;
+---------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|---------------------------------------------------------------------------------------------------|
| Seq Scan on t1 (cost=0.00..19425.00 rows=50 width=8) (actual time=0.013..70.186 rows=50 loops=1) |
| Filter: ((a = 0) AND (b = 1)) |
| Rows Removed by Filter: 999950 |
| Planning Time: 0.078 ms |
| Execution Time: 70.200 ms |
+---------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.083s
The estimate made by the optimizer is much close to the actual cost, there is a minor increase in performance. I couldn’t find a case which would give me a great performance difference though
With parallel queries enabled
Initially I didn’t disable parallel workers. I’ve used Postgres 12 so we seem to get a
parallel seq scan
, which was 2x
faster than hence messed with my benchmark
nimalanm@/tmp:test> explain analyse select * from t1 where a = 0 and b = 1;
+------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|------------------------------------------------------------------------------------------------------------------|
| Gather (cost=1000.00..11675.10 rows=1 width=8) (actual time=0.202..44.202 rows=50 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Parallel Seq Scan on t1 (cost=0.00..10675.00 rows=1 width=8) (actual time=14.889..29.008 rows=17 loops=3) |
| Filter: ((a = 0) AND (b = 1)) |
| Rows Removed by Filter: 333317 |
| Planning Time: 0.110 ms |
| Execution Time: 44.223 ms |
+------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.060s
nimalanm@/tmp:test> CREATE STATISTICS s1 (dependencies) on a, b from t1;
CREATE STATISTICS
Time: 0.005s
nimalanm@/tmp:test> analyse;
ANALYZE
Time: 0.108s
nimalanm@/tmp:test> explain analyse select * from t1 where a = 0 and b = 1;
+-------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-------------------------------------------------------------------------------------------------------------------|
| Gather (cost=1000.00..11680.00 rows=50 width=8) (actual time=0.157..43.423 rows=50 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Parallel Seq Scan on t1 (cost=0.00..10675.00 rows=21 width=8) (actual time=14.645..28.308 rows=17 loops=3) |
| Filter: ((a = 0) AND (b = 1)) |
| Rows Removed by Filter: 333317 |
| Planning Time: 0.162 ms |
| Execution Time: 43.446 ms |
+-------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.059s
Footnotes
Why is this just not about extended statistics
and why is this case study also about indexes
?
When we generally encounter a performance problem our general instinct is to create
a query on the column; which is effective, but adding more and more indexes
would cause the table to be reindexed for each write. If in the case the
columns are correlated, we could use extended statistics
without comprimising the write.
This is the performance without index
nimalanm@/tmp:test> insert into t1 select i/1000, i/500 from generate_series(1, 1000000) s(i);
INSERT 0 1000000
Time: 1.300s (a second), executed in: 1.300s (a second)
nimalanm@/tmp:test> drop table t1;
nimalanm@/tmp:test> create table t1 (
a int,
b int
);
insert into t1 select i/1000, i/500 from generate_series(1, 1000000) s(i);
CREATE TABLE
INSERT 0 1000000
Time: 1.368s (a second), executed in: 1.368s (a second)
nimalanm@/tmp:test> create index on t1 (a);
CREATE INDEX
Time: 0.645s
nimalanm@/tmp:test> explain analyse select * from t1 where a = 0 and b = 1;
+-----------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------------------------------------------|
| Bitmap Heap Scan on t1 (cost=93.93..4804.31 rows=25 width=8) (actual time=0.129..0.198 rows=500 loops=1) |
| Recheck Cond: (a = 0) |
| Filter: (b = 1) |
| Rows Removed by Filter: 499 |
| Heap Blocks: exact=5 |
| -> Bitmap Index Scan on t1_a_idx (cost=0.00..93.92 rows=5000 width=0) (actual time=0.067..0.067 rows=999 loops=1) |
| Index Cond: (a = 0) |
| Planning Time: 2.203 ms |
| Execution Time: 0.269 ms |
+-----------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.020s
nimalanm@/tmp:test> insert into t1 select i/1000, i/500 from generate_series(1, 1000000) s(i);
INSERT 0 1000000
Time: 3.211s (3 seconds), executed in: 3.211s (3 seconds)
Index is way faster in this case, but as you can see the write performance has increased 2x
.
I don’t feel this is an accurate comparison as it is a bitmap heap scan
vs seq scan
References
- https://www.postgresql.org/docs/12/multivariate-statistics-examples.html