This is a presentation that describes how Oracle uses histograms to make decisions on SQL query execution. To see the actual webinar and demo, go https://portal.hotsos.com/events/webinars/
Note that without properly collected statistics, the CBO will do one of two things: if no statistics exist for any object used in the SQL statement, the CBO may use rule-based optimization (prior to v10) or use dynamic sampling if statistics exist for any single object but not others in the SQL statement, the CBO may use a set of default statistics for the object without statistics or use dynamic sampling. CBO default statistics for objects without collected stats (prior to v10…in v10 dynamic sampling is typically used instead of defaults): TABLE SETTING DEFAULT STATISTICS cardinality (number of blocks * (block size – cache layer) / average row length average row length 100 bytes number of blocks 100 or actual value based on the extent map remote cardinality (distrib) 2000 rows remote average row length 100 bytes INDEX SETTING DEFAULT STATISTICS levels 1 leaf blocks 25 leaf blocks/key 1 data blocks/key 1 distinct keys 100 clustering factor 800
Plot A illustrates a situation in which the execution plan does not change, but the query response time varies significantly as the number of rows in the table changes. This kind of thing occurs when an application chooses a TABLE ACCESS (FULL) execution plan for a growing table. It’s what causes RBO-based applications to appear fast in a small development environment, but then behave poorly in the production environment. Plot B illustrates the marginal improvement that’s achievable, for example, by distributing an inefficient application’s workload more uniformly across the disks in a disk array. Notice that the execution plan (or “shape of the performance curve”) isn’t necessarily changed by such an operation (although, if the output of dbms_stats.gather_system_statistics changes as a result of the configuration change, then the plan might change). The performance for a given number of rows might change, however, as the plot here indicates. Plot C illustrates what is commonly the most profound type of performance change: an execution plan change. This situation can be caused by a change to any of CBO inputs. For example, an accidental deletion of a segment’s statistics can change a plan from a nice fast plan (depicted by the green curve, which is O(log n)) to a horrifically slow plan (depicted by the red curve, which is O(n 2 )). The phenomenon illustrated in plot C is what has happened when a query that was fast last week now runs for 14 hours without completing before you finally give up and kill the session.
Since the CBO determines the selectivity of predicates that appear in queries, it is important that there be adequate information for the CBO to make it's estimates properly. By gathering histogram data, the CBO can make improved selectivity estimates in the presence of data skew, resulting in optimal execution plans with non-uniform data distributions. The histogram approach provides an efficient and compact way to represent data distributions. Selectivity estimates are used to decide when to use an index and the order in which to join tables. Many table columns are not uniformly distributed. Therefore, the normal calculations for selectivity may not be accurate without the use of histograms.
Height-balanced histograms put approximately the same number of values into each interval, so that the endpoints of the interval are determined by the number of values in that interval. Only the last (largest) values in each bucket appear as bucket (end point) values. A height-balanced histogram will be created if the number of histogram buckets ( SIZE ) indicates a value smaller than the number of distinct values in the column. Frequency histograms (sometimes called value-based histograms) are created when the number of histogram buckets ( SIZE ) specified is greater than or equal to the number of distinct column values. In frequency histograms, all the individual values in the column have a corresponding bucket, and the bucket number reflects the repetition count of each value. The type of histogram is stored in the HISTOGRAM column of the *TAB_COL_STATISTICS views. The column can have values of HEIGHT BALANCED, FREQUENCY , or NONE . The SIZE of a histogram can be set by you or automatically by Oracle when the histogram is collected. The default SIZE (when no SIZE is specified) is 75. The maximum SIZE is 255.
DBMS_STATS Constants SIZE REPEAT Causes the histograms to be created with the same options as last time you created it. It reads the data dictionary to figure out what to do. SIZE AUTO Oracle looks at the data and using a magical, undocumented and changing algorithm, figures out all by itself what columns to gather stats on and how many buckets and all. It'll collect histograms in memory only for those columns which are used by your applications (those columns appearing in a predicate involving an equality, range, or like operators). It knows that a particular column was used by an application because at parse time, it will store workload information in SGA. Then it will store histograms in the data dictionary only if it has skewed data (and it worthy of a histogram). SIZE SKEWONLY When you collect histograms with the SIZE option set to SKEWONLY , it collects histogram data in memory for all specified columns (if you do not specify any, all columns are used). Once an "in-memory" histogram is computed for a column, it is stored inside the data dictionary only if it has "popular" values (multiple end-points with the same value which is what is meant by "there is skew in the data").
In Oracle version 8, the use of bind variables in a predicate effectively disables the use of histograms. This is because the optimizer needs to know the value ( WHERE col = 'x' ) in order to check the histogram statistics for selectivity for that value. When a bind variable is used, it is not actually bound into the query until execution time. Since the execution plan is determined in the parse phase, the optimizer won't know the value and thus can't use the histogram to makes its decision. In Oracle version 9, the optimizer behavior regarding bind variables changed slightly. In version 9, when a query is initially parsed, the optimizer will "peek" at the value of the bind variable and use the value it finds to make decisions. Does that make the situation better or worse? It depends. Let's say that when the query is initially parsed, it has a bind variable value of 1 being used in the predicate. If the column has a histogram and the histogram indicates that selectivity is low for that value (few values match), then it will likely choose to use an index on that column if available. Everything works well, performance is sub-second and everyone is happy. Now, what happens if the query is executed a 2 nd time but passes the value of 0 in the bind variable (and the selectivity for the value 0 is high…lots of values match). What happens? The original plan is still used and the query will attempt to use the same index. If there are thousands of records in the row source, it is likely that the index scan will perform significantly worse than simply doing a full table scan. In this case, everything works but performance stinks and complaints arise. So, what do you do? For some, the best solution is to not use bind variables when you have a column with a limited number of values and the values are skewed and to just hard-code the value you need. The best way to know what to do is to test different approaches to find what works best for your environment.
The RBO workaround is forgivable because it’s all the RBO environment could offer as an option. The CBO technique shown here is particularly bad because it makes the application less flexible and therefore less able to respond appropriately to system changes. Ideally, if you (the developer) already know that data for certain columns tends to skew, you can write code to account for it. A good guideline to follow is to look at the number of distinct values in the column. If the column has only a few distinct values, then hard-coding the value will allow the optimizer to correctly choose the plan based on histogram data. If there are a lot of distinct values, but you know in advance the actual skewed values, you could write conditional code to use a bind variable in all cases except when the known skewed values are requested. In that case, the conditional code would branch to a SQL statement version which hard-codes the skewed value under those circumstances.