SlideShare a Scribd company logo
1 of 57
Download to read offline
The PostgreSQL Query Planner

Atri Sharma
Whats this talk based on?
●

Research done by hackers like Robert Haas in
the past, some mailing list readings,
discussions with smart people etc...
Who's this talk for?
●

Users of postgres, who want to know how
query planner's heuristics can affect their
queries.
What are we going to discuss?
●

●
●

A small discussion about the Postgres query
planner.
Areas where problems arise in query planner.
Some cases where the planner has some nifty
points. Some of the cases may be well known.
However, they are worthy of discussion here if
only to see how the planner processes them
internally.
A little about the PostgreSQL Query
Planner
●

Quoting Bruce Momjian:
'The optimizer is the "brain" of the database,
interpreting SQL queries and determining the fastest
method of execution.'
Where can the problems arise in the
query planner
●

Row Count Estimates

●

DBA Ideas

●

Planner's Blocking points (What the planner
cant do at this point of time)
Row Count Stats
●

●

●

The backend essentially holds MCVs (Most
common values) and a histogram of the
distribution of the rest of the values.
The number of MCVs stored is equal to a
value known as statistics target.
Incorrect or outdated counts can lead to bad
selectivity estimates, leading to not so good
plans
User Settings
●

PostgreSQL depends quite some on the page
values (seq_page_cost, random_page_cost)
and others. Setting them to random values will
cause the query planner to choose an
inefficient plan over an efficient alternate plan.
Planner Not-Done-Yets
●

Some queries are slow in execution because
their optimization depends on a feature which
is not yet present in the query planner.
Selectivity Estimate Issues
●
●

●

Arbitrary estimates give a bad value
Cross table correlations arent taken into
consideration.
Planner doesnt understand expressions like (a
+ 0 = a)
Settings Issues
●

●

DBAs can use some settings to tune the query
planner.
Those settings can be harmful if set to
incorrect values.
Planner Cost Evaluation Parameters
●

seq_page_cost = 1.0

●

random_page_cost = 4.0

●

cpu_tuple_cost = 0.01

●

cpu_index_tuple_cost = 0.005

●

cpu_operator_cost = 0.0025

●

These parameters are used by the planner to
estimate the costs of plans.
Planner Cost Evaluation Parameters
Continued
●

●

random_page_cost abstracts the cost of
reading a random page and seq_page_cost
abstacts the same for a sequential read.
In real world scenarios, setting the
random_page_cost to a lower value can
help get better plans, since modern day disks
have reduced the gap between random page
read cost and sequential page read costs.
effective_cache_size
●

●

Sets the planner's assumption about the
effective size of the disk cache that is
available to a single query.
Should be set to 75% of the RAM.
Cases
●

●

Some examples of situations where the
planner decisions can lead to a bad plan.
Some cases are not strictly related to the
query planner, but are fun in general!
WITH
●

One of the classic cases.

●

WITH is an optimization barrier.
Why does WITH face performance
issues?
●

●

We have to understand how WITH is
processed by the planner.
The query planner intentionally runs the WITH
part first. It divides the query into separate
plans, which doesnt match up with the pull
architecture of the executor.
Solution
●

Use JOINS instead.
Solution
●

Use JOINS instead.
Feature Alert!
●

The current implementation approach for
GROUPING SETS uses CTE based
architecture. Hence, it suffers from the same
problem of divided query plans. If this problem
is solved, we could have an efficient
implementation of GROUPING SETS in
postgres, which would be very nice.
Materialized Views
●

Joining a materialized view back to its parent
table is expensive, since the planner does not
know about any correlation between the two
and the two tables are dependent on each
other.
Solution
●

Use intermediate tables to store the
intermediate results.
Aggregates
●

A function call like Agg(distinct) does not allow
the planner to use a HashAgg node, hence
leading to slower query plans.
Solution
●

Use subqueries instead.
EXISTS and NOT EXISTS
●

Using EXISTS and NOT EXISTS in your
queries is a good idea and will generally lead
to better plans. Use them instead of complete
queries if you only want to test presence or
absence of specific values.
Reason
●

EXISTS and NOT EXISTS allow the planner to
use semi-joins and anti-joins, respectively.
These are much efficient than full joins for just
testing the presence or absence of values.
Prepared Statements
●

●

Prepared statements are constantly a point of
query performance lags.
They perform much better now, though, thanks
to some recent work done.
What is bad with prepared
statements?
●

●

Pre 9.2, user had a choice of using prepared
statements and using generic plans, or not
use them and re plan every time.
Post 9.2, user doesnt have that restriction, but
the downside is that the user doesnt have a
control on whether a prepared statement is
replanned.
Pre 9.2
●

Prepared statements are good for workloads
where the selectivity of the queries is in a
small range. Widely varying selectivity values
can lead to bad performance for prepared
statements.
Post 9.2
●

The point to look at post 9.2 is if the planner is
overlooking a better alternate plan and
choosing a sub optimal plan.
Example
●

●

●

select * from sometable where id = any ($1);
where $1 was a small array.
The generic plan for this is of course trivial and
would have worked fine.
But the planner has to estimate the costs of
the generic plan without knowing how many
elements are in the array.
Example continued
●

●

It estimated a value of size of the array for the
generic plan which was much higher than the
actual size of the array.
What that meant was that the generic plan
was about twice the cost of the custom plan,
even though it was the exact same plan.
Take away lesson
●

●

You need to be wary when using prepared
statements, pre 9.2 or post 9.2.
Prepared statements are much better than
before, but some surprise may still turn up.
Join Reordering
●

●
●

●

The planner may reorder the joins present in a
query based on some heuristics that it thinks
will lead to better plans.
Generally, this is a good idea.
We should not assume that we have a better
idea about the optimal join order.
Still, if we are inclined to do so, we have a
setting known as join_collapse_limit
Gotcha!
●
●

●

We have a catch here, though.
Setting join_collapse_limit can lead to
suboptimal plans, but faster planning time
durations.
If you dont care as much about suboptimal
plans as much you care about faster planner
speeds, set join_collapse_limit to a lower
value.
Genetic Query Optimizer
●

●

The genetic query optimizer, or GEQO, is
used when the number of tables involved in a
join exceed a specific limit.
GEQO is used when searching the entire
search space for executing the join is not
feasible.
Genetic Query Optimizer Continued
●

GEQO uses genetic algorithms to generate
query plans and evaluate their fitness,
generating better plans in each successive
population.
GEQO Gotcha
●

GEQO is useful for generating plans when the
number of tables is high.

●

It is also useful for generating plans quickly.

●

However, it can lead to suboptimal plans.

●

Lower geqo_threshold if you care much about
the planner speed rather than the optimality of
the query plan.
Note about join_collapse_limit and
geqo_threshold
●

Since the default value for geqo_threshold is
higher than default value of
join_collapse_limit, GEQO isnt used by default
for explicit join syntax.
Lack of planner heuristics
●

●

●

Refer to 'Planner Conceptual Error when
Joining a Subquery -- Outer Query Condition
not Pulled Into Subquery' on performance list
It tells us that the planner does not understand
that 'a = b' and (b = c OR b = d) is equivalent to
(a = c OR a = d)
Planner lacks such heuristics. Another example
is SELECT * from foo WHERE (foo.a + 0 =
foo.a);
So, queries like...
●

SELECT *
FROM parent
INNER JOIN child ON child.parent_id = parent.id
LEFT JOIN (SELECT parent.id, parent.bin, COUNT(*) AS c
FROM parent
INNER JOIN child ON child.parent_id = parent.id
WHERE child.data > 25
GROUP BY 1, 2) agg1 ON agg1.id = parent.id AND agg1.bin = parent.bin
WHERE parent.bin IN (1, 2)
AND agg1.c >= 3;
Should be changed to...
●

SELECT *
FROM parent
INNER JOIN child ON child.parent_id = parent.id
LEFT JOIN (SELECT parent.id, parent.bin, COUNT(*) AS c
FROM parent
INNER JOIN child ON child.parent_id = parent.id
WHERE child.data > 25
GROUP BY 1, 2) agg1 ON agg1.id = parent.id AND agg1.bin =
parent.bin
WHERE parent.bin = 1
AND agg1.c >= 3;

Notice the 'bin = 1' here
NOT IN Vs. NOT EXISTS
●

SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE p.ProductID NOT IN (
SELECT ProductID
FROM Northwind..[Order Details])
NOT IN Vs. NOT EXISTS Continued
●

SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE NOT EXISTS (
SELECT 1
FROM Northwind..[Order Details] od
WHERE p.ProductId = od.ProductId)
What the planner does for NOT IN
●

●

NOT IN has two possible plans, a hashed
subplan (which requires a hashtable of the
entire subselect, so it's only done when that
would fit in work_mem)
If the subselect does not fit into work_mem, a
plain subplan is made, which is O(N^2).
What the planner does for NOT
EXISTS
●

●

●

The query planner generates an anti-join for
NOT EXISTS if the planner can sufficiently
disentangle a join clause from the conditions
in the subquery.
If not, then the not exists remains as a
(correlated) subplan.
A correlated subplan is one which has to be
re-run for every row.
Comparison of performance
●

●

NOT IN can be faster when the subquery
result is small since that result set can be
hashed, leading to better performance.
Then as your data gets larger, it might exceed
work_mem, and without warning, your query
might get very slow.
NOT IN Gotcha
●

●

NOT IN has some pretty well known issues
with NULL values. If a column has NULL
values, NOT IN might lead to more work,
hence leading to even more drop in the
performance of the query.
This is a quite constraining, since we may
have NULLs in our columns already or may
have them in the future.
So, what should we use?
●

●

The points discussed till now lead to the
conclusion that we should use NOT IN for
smaller data sets, and NOT EXISTS for larger
data sets.
However, since NOT IN has the constraint of
NULLs, we should generally prefer NOT
EXISTS in all cases unless our case
specifically needs NOT IN.
General Advice 1
●

Use tables to store intermediate results and
process over them wherever possible. This
allows for avoiding pitfalls such as
materialized view joins, UNION pitfalls,
DISTINCT pitfalls among others.
General Advice 2
●

Whenever the planner's accuracy about plan
costs seems to be reducing, run ANALYZE.
This shall update the planner statistics and
may lead to better results.
General Advice 3
●

When the accuracy of planner's costs doesnt
seem right, you can consider increasing the
statistics target (No. Of MCV values stored) by
using the following command:
ALTER COLUMN <column name> SET
STATISTICS <value>
Future developments to the query
planner
●
●

●
●

The query planner still lacks some features.
Understanding of foreign keys, inner join
removal, more heuristics.
Hopefully, we will have them all someday.
CTE planning needs to be improved and done
in line with postgres executor's data pull
model.
Case Study – Adding foreign keys
understanding to planner
●
●

In context of removal of inner joins.
The main issue identified is that if a row is
deleted, in the time duration from the deletion
of the row and running of the FK trigger, all
plans generated and dependent on that row's
foreign key will be invalid.
Thank You!

More Related Content

Similar to Talk pg conf eu 2013

Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantAkshay Rai
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedShubham Tagra
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
 
Data Enginering from Google Data Warehouse
Data Enginering from Google Data WarehouseData Enginering from Google Data Warehouse
Data Enginering from Google Data Warehousearungansi
 
Tech Talk - JPA and Query Optimization - publish
Tech Talk  -  JPA and Query Optimization - publishTech Talk  -  JPA and Query Optimization - publish
Tech Talk - JPA and Query Optimization - publishGleydson Lima
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmSupun Abeysinghe
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...GeeksLab Odessa
 
OutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems
 
Final Presentation.pptx
Final Presentation.pptxFinal Presentation.pptx
Final Presentation.pptxMarkBauer47
 
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...Equnix Business Solutions
 
DevOpsDays Phoenix 2018: Using Prometheus and Grafana for Effective Service D...
DevOpsDays Phoenix 2018: Using Prometheus and Grafana for Effective Service D...DevOpsDays Phoenix 2018: Using Prometheus and Grafana for Effective Service D...
DevOpsDays Phoenix 2018: Using Prometheus and Grafana for Effective Service D...r4j4h
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan
 
MySQL Query Optimisation 101
MySQL Query Optimisation 101MySQL Query Optimisation 101
MySQL Query Optimisation 101Federico Razzoli
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 

Similar to Talk pg conf eu 2013 (20)

Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. Elephant
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Data Enginering from Google Data Warehouse
Data Enginering from Google Data WarehouseData Enginering from Google Data Warehouse
Data Enginering from Google Data Warehouse
 
Tech Talk - JPA and Query Optimization - publish
Tech Talk  -  JPA and Query Optimization - publishTech Talk  -  JPA and Query Optimization - publish
Tech Talk - JPA and Query Optimization - publish
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
OutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems Tips and Tricks
OutSystems Tips and Tricks
 
Final Presentation.pptx
Final Presentation.pptxFinal Presentation.pptx
Final Presentation.pptx
 
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
 
DevOpsDays Phoenix 2018: Using Prometheus and Grafana for Effective Service D...
DevOpsDays Phoenix 2018: Using Prometheus and Grafana for Effective Service D...DevOpsDays Phoenix 2018: Using Prometheus and Grafana for Effective Service D...
DevOpsDays Phoenix 2018: Using Prometheus and Grafana for Effective Service D...
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
MySQL Query Optimisation 101
MySQL Query Optimisation 101MySQL Query Optimisation 101
MySQL Query Optimisation 101
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Kubernetes Workload Rebalancing
Kubernetes Workload RebalancingKubernetes Workload Rebalancing
Kubernetes Workload Rebalancing
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Talk pg conf eu 2013

  • 1. The PostgreSQL Query Planner Atri Sharma
  • 2. Whats this talk based on? ● Research done by hackers like Robert Haas in the past, some mailing list readings, discussions with smart people etc...
  • 3. Who's this talk for? ● Users of postgres, who want to know how query planner's heuristics can affect their queries.
  • 4. What are we going to discuss? ● ● ● A small discussion about the Postgres query planner. Areas where problems arise in query planner. Some cases where the planner has some nifty points. Some of the cases may be well known. However, they are worthy of discussion here if only to see how the planner processes them internally.
  • 5. A little about the PostgreSQL Query Planner ● Quoting Bruce Momjian: 'The optimizer is the "brain" of the database, interpreting SQL queries and determining the fastest method of execution.'
  • 6. Where can the problems arise in the query planner ● Row Count Estimates ● DBA Ideas ● Planner's Blocking points (What the planner cant do at this point of time)
  • 7. Row Count Stats ● ● ● The backend essentially holds MCVs (Most common values) and a histogram of the distribution of the rest of the values. The number of MCVs stored is equal to a value known as statistics target. Incorrect or outdated counts can lead to bad selectivity estimates, leading to not so good plans
  • 8. User Settings ● PostgreSQL depends quite some on the page values (seq_page_cost, random_page_cost) and others. Setting them to random values will cause the query planner to choose an inefficient plan over an efficient alternate plan.
  • 9. Planner Not-Done-Yets ● Some queries are slow in execution because their optimization depends on a feature which is not yet present in the query planner.
  • 10. Selectivity Estimate Issues ● ● ● Arbitrary estimates give a bad value Cross table correlations arent taken into consideration. Planner doesnt understand expressions like (a + 0 = a)
  • 11. Settings Issues ● ● DBAs can use some settings to tune the query planner. Those settings can be harmful if set to incorrect values.
  • 12. Planner Cost Evaluation Parameters ● seq_page_cost = 1.0 ● random_page_cost = 4.0 ● cpu_tuple_cost = 0.01 ● cpu_index_tuple_cost = 0.005 ● cpu_operator_cost = 0.0025 ● These parameters are used by the planner to estimate the costs of plans.
  • 13. Planner Cost Evaluation Parameters Continued ● ● random_page_cost abstracts the cost of reading a random page and seq_page_cost abstacts the same for a sequential read. In real world scenarios, setting the random_page_cost to a lower value can help get better plans, since modern day disks have reduced the gap between random page read cost and sequential page read costs.
  • 14. effective_cache_size ● ● Sets the planner's assumption about the effective size of the disk cache that is available to a single query. Should be set to 75% of the RAM.
  • 15. Cases ● ● Some examples of situations where the planner decisions can lead to a bad plan. Some cases are not strictly related to the query planner, but are fun in general!
  • 16. WITH ● One of the classic cases. ● WITH is an optimization barrier.
  • 17. Why does WITH face performance issues? ● ● We have to understand how WITH is processed by the planner. The query planner intentionally runs the WITH part first. It divides the query into separate plans, which doesnt match up with the pull architecture of the executor.
  • 20. Feature Alert! ● The current implementation approach for GROUPING SETS uses CTE based architecture. Hence, it suffers from the same problem of divided query plans. If this problem is solved, we could have an efficient implementation of GROUPING SETS in postgres, which would be very nice.
  • 21. Materialized Views ● Joining a materialized view back to its parent table is expensive, since the planner does not know about any correlation between the two and the two tables are dependent on each other.
  • 22. Solution ● Use intermediate tables to store the intermediate results.
  • 23. Aggregates ● A function call like Agg(distinct) does not allow the planner to use a HashAgg node, hence leading to slower query plans.
  • 25. EXISTS and NOT EXISTS ● Using EXISTS and NOT EXISTS in your queries is a good idea and will generally lead to better plans. Use them instead of complete queries if you only want to test presence or absence of specific values.
  • 26. Reason ● EXISTS and NOT EXISTS allow the planner to use semi-joins and anti-joins, respectively. These are much efficient than full joins for just testing the presence or absence of values.
  • 27. Prepared Statements ● ● Prepared statements are constantly a point of query performance lags. They perform much better now, though, thanks to some recent work done.
  • 28. What is bad with prepared statements? ● ● Pre 9.2, user had a choice of using prepared statements and using generic plans, or not use them and re plan every time. Post 9.2, user doesnt have that restriction, but the downside is that the user doesnt have a control on whether a prepared statement is replanned.
  • 29. Pre 9.2 ● Prepared statements are good for workloads where the selectivity of the queries is in a small range. Widely varying selectivity values can lead to bad performance for prepared statements.
  • 30. Post 9.2 ● The point to look at post 9.2 is if the planner is overlooking a better alternate plan and choosing a sub optimal plan.
  • 31. Example ● ● ● select * from sometable where id = any ($1); where $1 was a small array. The generic plan for this is of course trivial and would have worked fine. But the planner has to estimate the costs of the generic plan without knowing how many elements are in the array.
  • 32. Example continued ● ● It estimated a value of size of the array for the generic plan which was much higher than the actual size of the array. What that meant was that the generic plan was about twice the cost of the custom plan, even though it was the exact same plan.
  • 33. Take away lesson ● ● You need to be wary when using prepared statements, pre 9.2 or post 9.2. Prepared statements are much better than before, but some surprise may still turn up.
  • 34. Join Reordering ● ● ● ● The planner may reorder the joins present in a query based on some heuristics that it thinks will lead to better plans. Generally, this is a good idea. We should not assume that we have a better idea about the optimal join order. Still, if we are inclined to do so, we have a setting known as join_collapse_limit
  • 35. Gotcha! ● ● ● We have a catch here, though. Setting join_collapse_limit can lead to suboptimal plans, but faster planning time durations. If you dont care as much about suboptimal plans as much you care about faster planner speeds, set join_collapse_limit to a lower value.
  • 36. Genetic Query Optimizer ● ● The genetic query optimizer, or GEQO, is used when the number of tables involved in a join exceed a specific limit. GEQO is used when searching the entire search space for executing the join is not feasible.
  • 37. Genetic Query Optimizer Continued ● GEQO uses genetic algorithms to generate query plans and evaluate their fitness, generating better plans in each successive population.
  • 38. GEQO Gotcha ● GEQO is useful for generating plans when the number of tables is high. ● It is also useful for generating plans quickly. ● However, it can lead to suboptimal plans. ● Lower geqo_threshold if you care much about the planner speed rather than the optimality of the query plan.
  • 39. Note about join_collapse_limit and geqo_threshold ● Since the default value for geqo_threshold is higher than default value of join_collapse_limit, GEQO isnt used by default for explicit join syntax.
  • 40. Lack of planner heuristics ● ● ● Refer to 'Planner Conceptual Error when Joining a Subquery -- Outer Query Condition not Pulled Into Subquery' on performance list It tells us that the planner does not understand that 'a = b' and (b = c OR b = d) is equivalent to (a = c OR a = d) Planner lacks such heuristics. Another example is SELECT * from foo WHERE (foo.a + 0 = foo.a);
  • 42. ● SELECT * FROM parent INNER JOIN child ON child.parent_id = parent.id LEFT JOIN (SELECT parent.id, parent.bin, COUNT(*) AS c FROM parent INNER JOIN child ON child.parent_id = parent.id WHERE child.data > 25 GROUP BY 1, 2) agg1 ON agg1.id = parent.id AND agg1.bin = parent.bin WHERE parent.bin IN (1, 2) AND agg1.c >= 3;
  • 44. ● SELECT * FROM parent INNER JOIN child ON child.parent_id = parent.id LEFT JOIN (SELECT parent.id, parent.bin, COUNT(*) AS c FROM parent INNER JOIN child ON child.parent_id = parent.id WHERE child.data > 25 GROUP BY 1, 2) agg1 ON agg1.id = parent.id AND agg1.bin = parent.bin WHERE parent.bin = 1 AND agg1.c >= 3; Notice the 'bin = 1' here
  • 45. NOT IN Vs. NOT EXISTS ● SELECT ProductID, ProductName FROM Northwind..Products p WHERE p.ProductID NOT IN ( SELECT ProductID FROM Northwind..[Order Details])
  • 46. NOT IN Vs. NOT EXISTS Continued ● SELECT ProductID, ProductName FROM Northwind..Products p WHERE NOT EXISTS ( SELECT 1 FROM Northwind..[Order Details] od WHERE p.ProductId = od.ProductId)
  • 47. What the planner does for NOT IN ● ● NOT IN has two possible plans, a hashed subplan (which requires a hashtable of the entire subselect, so it's only done when that would fit in work_mem) If the subselect does not fit into work_mem, a plain subplan is made, which is O(N^2).
  • 48. What the planner does for NOT EXISTS ● ● ● The query planner generates an anti-join for NOT EXISTS if the planner can sufficiently disentangle a join clause from the conditions in the subquery. If not, then the not exists remains as a (correlated) subplan. A correlated subplan is one which has to be re-run for every row.
  • 49. Comparison of performance ● ● NOT IN can be faster when the subquery result is small since that result set can be hashed, leading to better performance. Then as your data gets larger, it might exceed work_mem, and without warning, your query might get very slow.
  • 50. NOT IN Gotcha ● ● NOT IN has some pretty well known issues with NULL values. If a column has NULL values, NOT IN might lead to more work, hence leading to even more drop in the performance of the query. This is a quite constraining, since we may have NULLs in our columns already or may have them in the future.
  • 51. So, what should we use? ● ● The points discussed till now lead to the conclusion that we should use NOT IN for smaller data sets, and NOT EXISTS for larger data sets. However, since NOT IN has the constraint of NULLs, we should generally prefer NOT EXISTS in all cases unless our case specifically needs NOT IN.
  • 52. General Advice 1 ● Use tables to store intermediate results and process over them wherever possible. This allows for avoiding pitfalls such as materialized view joins, UNION pitfalls, DISTINCT pitfalls among others.
  • 53. General Advice 2 ● Whenever the planner's accuracy about plan costs seems to be reducing, run ANALYZE. This shall update the planner statistics and may lead to better results.
  • 54. General Advice 3 ● When the accuracy of planner's costs doesnt seem right, you can consider increasing the statistics target (No. Of MCV values stored) by using the following command: ALTER COLUMN <column name> SET STATISTICS <value>
  • 55. Future developments to the query planner ● ● ● ● The query planner still lacks some features. Understanding of foreign keys, inner join removal, more heuristics. Hopefully, we will have them all someday. CTE planning needs to be improved and done in line with postgres executor's data pull model.
  • 56. Case Study – Adding foreign keys understanding to planner ● ● In context of removal of inner joins. The main issue identified is that if a row is deleted, in the time duration from the deletion of the row and running of the FK trigger, all plans generated and dependent on that row's foreign key will be invalid.