1. “Row Pattern Matching” with
Database 12c MATCH_RECOGNIZE
Beating the Best Pre-12c Solutions
Stew Ashton (stewashton.wordpress.com)
OUGN17
Can you read the following line? If not, please move closer.
It's much better when you can read the code ;)
2. Agenda
• Who am I?
• Pre-12c solutions compared to row pattern
matching with MATCH_RECOGNIZE
– For all sizes of data
– Thinking in patterns
– Syntax only as needed
• Watch out for “catastrophic backtracking”
• Other things to keep in mind (time permitting)
2
3. Who am I?
• 1967: first FORTRAN program
• 1981-2015: software engineer / architect
• 2005-present: focus on Oracle DB development
– Contribute to asktom & OTN SQL forum
– Presented at OOW, UKOUG tech, DOAG
– 2015 DEVVY awards finalist.
– Advocate of data-centric application architecture
3
4. 4 pre-12c Solutions
• Analytics
1) Tabibitosan / Fixed Difference
2) "Start of Group"
• Model
3) Bin fitting (fixed size)
4) Bin fitting (fixed number)
4
5. 1) “Fixed Difference” / Tabibitosan
PAGE
1
2
3
5
6
7
10
11
12
42
5
-1
-2
-3
-4
= 0
= 0
= 0
= 1
• Identify and group rows with
consecutive values
• My presentation: print best liked slides
• Math: subtract known consecutives (1, 2,…)
– Consecutive means A = B - 1
– If A = B - 1 (consecutive) then A - 1 = B – 2
Else (not consecutive) A - 1 <> B - 2
– Consecutive becomes equality,
non-consecutive becomes inequality
7. Pattern and Matching Rows
• PATTERN
– Uninterrupted series of input rows
– Described as list of conditions (≅ “regular expressions”)
PATTERN (A B*)
"A" : 1 row, "B*" : 0 or more rows, as many as possible
• DEFINE (at least one) row condition
[A undefined = TRUE]
B AS page = PREV(page)+1
• Each series that matches the pattern is a “match”
– "A" and "B" identify the rows that meet each condition
– There can be unmatched rows between series
7
8. Input, Processing, Output
1. Define input
2. Order input
3. Process pattern
4. using defined conditions
5. Output: rows per match
6. Output: columns per row
7. Go where after match?
8
SELECT *
FROM t
MATCH_RECOGNIZE (
ORDER BY page
MEASURES
A.page as firstpage,
LAST(page) as lastpage,
COUNT(*) cnt
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B*)
DEFINE B AS page = PREV(page)+1
);
9. 1) Run_Stats comparison
9
For one million rows:
“Latches” are serialization devices: fewer means more scalable
Stat Pre 12c Match_R Pct
Latches 5047 4836 96%
Elapsed Time 1.1164 1.0528 94%
CPU used by this session 1.09 1.04 95%
10. 1) Execution Plans
10
Operation Used-Mem
SELECT STATEMENT
HASH GROUP BY 49M (0)
VIEW
WINDOW SORT 20M (0)
TABLE ACCESS FULL
Operation Used-Mem
SELECT STATEMENT
VIEW
MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTOMATON 20M (0)
TABLE ACCESS FULL
11. 2) “Start of Group”
• Identify group boundaries, often using LAG()
• 3 steps instead of 2:
1. For each row: if start of group, assign 1
Else assign 0
2. Running total of 1s and 0s produces a group
identifier
3. Group by the group identifier
11
12. 2) Requirement
GROUP_NAME START_TS END_TS
X 2014-01-01 00:00 2014-02-01 00:00
X 2014-03-01 00:00 2014-04-01 00:00
X 2014-04-01 00:00 2014-05-01 00:00
X 2014-06-01 00:00 2014-06-01 01:00
X 2014-06-01 01:00 2014-06-01 02:00
X 2014-06-01 02:00 2014-06-01 03:00
Y 2014-06-01 03:00 2014-06-01 04:00
Y 2014-06-01 04:00 2014-06-01 05:00
Y 2014-07-03 08:00 2014-09-29 17:00
12
Merge contiguous date ranges in same group
13. 1
2
2
3
3
3
1
1
2
X 01-01 00:00 02-01 00:00 1
X 03-01 00:00 04-01 00:00 1
X 04-01 00:00 05-01 00:00 0
X 06-01 00:00 06-01 01:00 1
X 06-01 01:00 06-01 02:00 0
X 06-01 02:00 06-01 03:00 0
Y 06-01 03:00 06-01 04:00 1
Y 06-01 04:00 06-01 05:00 0
Y 07-03 08:00 09-29 17:00 1
X 01-01 00:00 02-01 00:00
X 03-01 00:00 05-01 00:00
X 06-01 00:00 06-01 03:00
Y 06-01 03:00 06-01 05:00
Y 07-03 08:00 09-29 17:00
13
with grp_starts as (
select a.*,
case when start_ts =
lag(end_ts) over(
partition by group_name
order by start_ts
)
then 0 else 1 end grp_start
from t a
), grps as (
select b.*,
sum(grp_start) over(
partition by group_name
order by start_ts
) grp_id
from grp_starts b)
select group_name,
min(start_ts) start_ts,
max(end_ts) end_ts
from grps
group by group_name, grp_id;
14. 2) Match_Recognize
14
SELECT * FROM t
MATCH_RECOGNIZE(
PARTITION BY group_name
ORDER BY start_ts
MEASURES
A.start_ts start_ts,
end_ts end_ts
PATTERN(A B*)
DEFINE B AS start_ts = prev(end_ts)
);
New this time:
• Added PARTITION BY
• ONE ROW PER MATCH
and
SKIP PAST LAST ROW
are the defaults
One solution replaces two methods: simple!
15. 2) Run_Stats comparison
15
For 500,000 rows:
Stat Pre 12c Match_R Pct
Latches 8108 7133 88%
Elapsed Time 1.7551 0.9765 56%
CPU used by this session 1.73 0.96 55%
16. Operation Used-Mem
SELECT STATEMENT
HASH GROUP BY 22M (0)
VIEW
WINDOW BUFFER 32M (0)
VIEW
WINDOW SORT 27M (0)
TABLE ACCESS FULL
Operation Used-Mem
SELECT STATEMENT
VIEW
MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTOMATON 27M (0)
TABLE ACCESS FULL
2) Execution Plans
16
17. 2) Matching within a group
17
SELECT * FROM (
SELECT * from t
WHERE group_name = 'X'
)
MATCH_RECOGNIZE
…
);
Filter before MATCH_RECOGNIZE to avoid extra work
18. 2) Predicate pushing
18
Select * from <view> where group_name = 'X'
Operation Name A-Rows Buffers
SELECT STATEMENT 3 4
VIEW 3 4
MATCH RECOGNIZE SORT DETERMINISTIC
FINITE AUTO
3 4
TABLE ACCESS BY INDEX ROWID
BATCHED
T 6 4
INDEX RANGE SCAN TI 6 3
20. SELECT s first_site, MAX(e) last_site, MAX(sm) sum_cnt FROM (
SELECT s, e, cnt, sm FROM t
MODEL
MEASURES (study_site s, study_site e, cnt, cnt sm)
RULES (
sm[ > 1] =
CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000
THEN cnt[cv()]
ELSE sm[cv() - 1] + cnt[cv()]
END,
s[ > 1] =
CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000
THEN s[cv()]
ELSE s[cv() - 1]
END
)
)
GROUP BY s;
• DIMENSION with row_number
orders data and processing
• rn can be used like a subscript
• cv() means current row
• cv()-1 means previous row
DIMENSION BY (row_number() over(order by study_site) rn)
rn
[cv() – 1] [cv()] [cv()]
[cv()]
[cv() – 1] [cv()]
rn
[cv() - 1] [cv()] [cv()]
[cv()]
[cv() – 1]
20
21. SELECT * FROM t
MATCH_RECOGNIZE (
ORDER BY study_site
MEASURES
FIRST(study_site) first_site,
LAST(study_site) last_site,
SUM(cnt) sum_cnt
PATTERN (A+)
DEFINE A AS SUM(cnt) <= 65000
);
New this time:
• PATTERN
(A+) replaces (A B*)
means 1 or more rows
• Why? In previous
examples I used PREV(),
which returns NULL on
the first row.
One solution replaces 3 methods: simpler!
21
22. 3) Run_Stats comparison
For 150000 rows:
Stat Pre 12c Match_R Pct
Latches 3501 1312 37%
Elapsed Time 2.4036 0.1028 4%
CPU used by this session 2.39 0 0%
22
23. Id Operation Used-Mem
0 SELECT STATEMENT
1 HASH GROUP BY 1481K (0)
2 VIEW
3 SQL MODEL ORDERED 17M (0)
4 WINDOW SORT 4330K (0)
5 TABLE ACCESS FULL
Id Operation Used-Mem
0 SELECT STATEMENT
1 VIEW
2 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 4330K(0)
3 TABLE ACCESS FULL
3) Execution Plans
23
24. Name Val Val BIN1 BIN2 BIN3
1 1 10 10
2 2 9 10 9
3 3 8 10 9 8
4 4 7 10 9 15
5 5 6 10 15 15
6 6 5 15 15 15
7 7 4 19 15 15
8 8 3 19 18 15
9 9 2 19 18 17
10 10 1 19 18 18
4) “Bin fitting”: fixed number
24
• Requirement
– Distribute values in 3
parts as equally as
possible
• “Best fit decreasing”
– Sort values in
decreasing order
– Put each value in least
full “bin”
25. 4) Brilliant pre 12c solution
SELECT bin, Max (bin_value) bin_value
FROM (
SELECT * FROM items
MODEL
DIMENSION BY
(Row_Number() OVER
(ORDER BY item_value DESC) rn)
MEASURES (
item_name,
item_value,
Row_Number() OVER
(ORDER BY item_value DESC) bin,
item_value bin_value,
Row_Number() OVER
(ORDER BY item_value DESC) rn_m,
0 min_bin,
Count(*) OVER () - 3 - 1 n_iters
)
RULES ITERATE(100000)
UNTIL (ITERATION_NUMBER >= n_iters[1]) (
min_bin[1] = Min(rn_m) KEEP (DENSE_RANK
FIRST ORDER BY bin_value)[rn<= 3],
bin[ITERATION_NUMBER + 3 + 1] =
min_bin[1],
bin_value[min_bin[1]] =
bin_value[CV()] +
Nvl(item_value[ITERATION_NUMBER+4], 0))
)
WHERE item_name IS NOT NULL
group by bin;
25
26. SELECT * from items
MATCH_RECOGNIZE (
ORDER BY item_value desc
MEASURES
sum(bin1.item_value) bin1,
sum(bin2.item_value) bin2,
sum(bin3.item_value) bin3
PATTERN ((bin1|bin2|bin3)+)
DEFINE
bin1 AS count(bin1.*) = 1
OR sum(bin1.item_value)-bin1.item_value
<= least(
sum(bin2.item_value),
sum(bin3.item_value)
),
bin2 AS count(bin2.*) = 1
OR sum(bin2.item_value)-bin2.item_value
<= sum(bin3.item_value)
);
• ()+ = 1 or more of whatever
is inside
• '|' = alternatives,
“preferred in the order
specified”
• Bin1 condition:
• No rows here yet,
• Or this bin least full
• Bin2 condition
• No rows here yet, or
• This bin less full than 3
PATTERN ((bin1|bin2|bin3)+)
bin1 AS count(bin1.*) = 1
OR sum(bin1.item_value)-bin1.item_value
<= least(
sum(bin2.item_value),
sum(bin3.item_value)
),
bin2 AS count(bin2.*) = 1
OR sum(bin2.item_value)-bin2.item_value
<= sum(bin3.item_value)
26
27. 4) Run_Stats comparison
27
For 20,000 rows:
Stat Pre 12c Match_R Pct
Latches 99165 794 .80%
Elapsed Time 58.3251 0.028 .05%
CPU used by this session 56.75 0 .00%
28. Backtracking
• What happens when there is no match???
• “Greedy” quantifiers * + {2,}
– are not that greedy
– Take all the rows they can, BUT
give rows back if necessary – one at a time
• Regular expression engines will test all possible
combinations to find a match
28
30. Repeating conditions
select 'match' from (
select level n from dual
connect by level <= 100
)
match_recognize(
pattern(a b* c)
define b as n > prev(n)
, c as n = 0
);
Runs in 0.006 secs
select 'match' from (
select level n from dual
connect by level <= 100
)
match_recognize(
pattern(a b* b* b* c)
define b as n > prev(n)
, c as n = 0
);
Runs in 2.934 secs
30
31. SELECT * FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES FIRST(tstamp) AS start_tstamp,
LAST(tstamp) AS end_tstamp
AFTER MATCH SKIP TO LAST UP
PATTERN (STRT DOWN+ UP+ DOWN+ UP+)
DEFINE DOWN AS price < PREV(price),
UP AS price > PREV(price),
STRT AS price >= nvl(PREV(PRICE),0)
);
Runs in 0.014 seconds
Imprecise Conditions
CREATE TABLE Ticker (
SYMBOL VARCHAR2(10),
tstamp DATE,
price NUMBER
);
insert into ticker
select 'ACME',
sysdate + level/24/60/60,
10000-level
from dual
connect by level <= 5000;
SELECT * FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES FIRST(tstamp) AS start_tstamp,
LAST(tstamp) AS end_tstamp
AFTER MATCH SKIP TO LAST UP
PATTERN (STRT DOWN+ UP+ DOWN+ UP+)
DEFINE DOWN AS price < PREV(price),
UP AS price > PREV(price)
);
Runs in 13 seconds
INMEMORY: 7 seconds 31
32. Keep in Mind
• Backtracking
– Precise conditions
– Test data with no matches
• To debug:
Measures classifier() cl,
match_number() mn
All rows per match with
unmatched rows
• No DISTINCT, no LISTAGG
• MEASURES columns must
have aliases
• “Reluctant quantifier” = ?
= JDBC bind variable
(Fix in 12.2 JDBC driver !)
• “Pattern variables” are
identifiers, not bind
variables
32
33. Questions?
The really good stuff is tomorrow!
"Meet your Match", 10:30, Bundestag 2
33
1) consecutive
values
2) based on
starting point
3) Bin fitting:
fixed size
4) Bin fitting:
fixed number
Pre 12c Analytic function,
Group by
2 analytics,
Group by
MODEL clause
using “subscript”
Procedural
MODEL clause
Match_
Recognize
C = PREV(C)+1 C1=PREV(C2) SUM(C1) <= X 3 conditions
Elapsed 94% 56% 4% 0.05%
34. Which row do we mean?
Expression DEFINE
MEASURES
ALL ROWS… ONE ROW…
FIRST(start_ts) First row of match
start_ts current row last row of match
LAST(end_ts) current row last row of match
FINAL
LAST(end_ts)
ORA-62509 last row of match
B.start_ts most recent B row last B row
PREV(), NEXT() Physical offset from referenced row
COUNT(*) from first to current row all rows in match
COUNT(B.*) B rows including current row all B rows
34
35. Output Row “shape”
Per Match PARTITION BY ORDER BY MEASURES Other input
ONE ROW X Omitted X omitted
ALL ROWS X X X X
35
ORA-00918, anyone?