This document summarizes Stew Ashton's presentation on using Oracle Database 12c's MATCH_RECOGNIZE clause to solve various "row pattern matching" problems in a more efficient way than pre-12c solutions. The document provides examples of using MATCH_RECOGNIZE for problems involving identifying consecutive values, grouping data into fixed bins, and distributing values evenly across bins. It shows that MATCH_RECOGNIZE offers performance improvements and simpler solutions compared to earlier approaches using window functions, self-joins and the MODEL clause.
Developer Data Modeling Mistakes: From Postgres to NoSQL
Row Pattern Matching with Database 12c MATCH_RECOGNIZE
1. “Row Pattern Matching” with
Database 12c MATCH_RECOGNIZE
Beating the Best Pre-12c Solutions
Stew Ashton UKOUG Tech 14 Stew ASHTON
UKOUG Tech 14
2. Agenda
• Who am I?
• Pre-12c solutions compared to row pattern
matching with MATCH_RECOGNIZE
– For all sizes of data
– Thinking in patterns
• Watch out for “catastrophic backtracking”
• Other things to keep in mind (time permitting)
2
3. Who am I?
• 33 years in IT
– Developer, Technical Sales Engineer, Technical Architect
– Aeronautics, IBM, Finance
– Mainframe, client-server, Web apps
• 25 years as an American in Paris
• 9 years using Oracle database
– Performance analysis
– Replace Java with SQL
• 2 years as internal “Oracle Development Expert”
3
4. 1) “Fixed Difference”
• Identify and group rows with consecutive values
• My presentation: print slides to keep
• Math: subtract known consecutives
– If A-1 = B-2 then A = B-1
– Else A <> B-1
– Consecutive becomes equality,
non-consecutive becomes inequality
• “Consecutive” = fixed difference of 1
PAGE
1
2
3
5
6
7
10
11
12
36
4
5. 1) Pre-12c
select min(page) firstpage,
max(page) lastpage,
count(*) cnt
FROM (
SELECT page,
page –
Row_Number() over(order by page)
as grp_id
FROM t
)
GROUP BY grp_id;
FIRSTPAGE PAGE [RN] GRP_LASTPAGE ID
CNT
1 1 0
2 2 0
3 3 0
5 4 1
6 5 1
7 6 1
10 7 3
11 8 3
12 9 3
42 10 32
1 3 3
5 7 3
10 12 3
36 36 1
5
6. Think “match a row pattern”
• PATTERN
– Uninterrupted series of input rows
– Described as a list of conditions (“regular expressions”)
PATTERN (A B*)
"A" : 1 row, "B" : 0 or more rows, as many as possible
• DEFINE each row condition
[A undefined = TRUE]
B AS page = PREV(page)+1
• Each series that matches the pattern is a “match”
– "A" and "B" identify the rows that meet their conditions
6
7. Input, Processing, Output
1. Define input
2. Order input
3. Process pattern
4. using defined conditions
5. Output: rows per match
6. Output: columns per row
7. Go where after match?
SELECT *
FROM t
MATCH_RECOGNIZE (
ORDER BY page
MEASURES
PATTERN (A B*)
DEFINE B AS page = PREV(page)+1
ONE ROW PER MATCH
MEASURES
A.page firstpage,
LAST(page) lastpage,
COUNT(*) cnt
AFTER MATCH SKIP PAST LAST ROW
);
A.page firstpage,
LAST(page) lastpage,
COUNT(*) cnt
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B*)
DEFINE B AS page = PREV(page)+1
7
8. 1) Run_Stats comparison
For one million rows:
Stat Pre 12c Match_R Pct
Latches 4090 4079 100%
Elapsed Time 5.51 5.56 101%
CPU used by this session 5.5 5.55 101%
“Latches” are serialization devices: fewer means more scalable
8
9. 1) Execution Plans
Operation Used-Mem
SELECT STATEMENT
HASH GROUP BY 40M (0)
Id Operation Name Starts E-Rows A-Rows A-Time Buffers OMem 1Mem Used-Mem
0 SELECT STATEMENT 1 400K 00:00:01.83 1594
1 VIEW
HASH GROUP BY 1 1000K 400K 00:00:01.83 1594 41M 5035K 40M (0)
2 VIEW WINDOW SORT 1 1000K 1000K 00:00:12.69 1594
3 WINDOW SORT 1 1000K 1000K 00:00:03.46 1594 22M 20M 1749K (0)
20M (0)
4 TABLE ACCESS FULL T 1 1000K 1000K 00:00:02.53 1594
TABLE ACCESS FULL
Id Operation Name Starts E-Rows A-Rows A-Time Buffers OMem 1Mem Used-Mem
0 SELECT STATEMENT 1 400K 00:00:03.45 1594
1 VIEW 1 1000K 400K 00:00:03.45 1594
2
Operation Used-Mem
MATCH RECOGNIZE SORT DETERMINISTIC FINITE
SELECT AUTO
STATEMENT
VIEW
1 1000K 400K 00:00:01.87 1594 22M 1749K 20M (0)
3 TABLE ACCESS FULL T 1 1000K 1000K 00:00:02.09 1594
MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 20M (0)
TABLE ACCESS FULL
9
10. 2) “Start of Group”
• Identify group boundaries, often using LAG()
• 3 steps instead of 2:
1. For each row: if start of group, assign 1
Else assign 0
2. Running total of 1s and 0s produces a group
identifier
3. Group by the group identifier
10
11. 2) Requirement
GROUP_NAME START_TS END_TS
X 2014-01-01 00:00 2014-02-01 00:00
X 2014-03-01 00:00 2014-04-01 00:00
X 2014-04-01 00:00 2014-05-01 00:00
X 2014-06-01 00:00 2014-06-01 01:00
X 2014-06-01 01:00 2014-06-01 02:00
X 2014-06-01 02:00 2014-06-01 03:00
Y 2014-06-01 03:00 2014-06-01 04:00
Y 2014-06-01 04:00 2014-06-01 05:00
Y 2014-07-03 08:00 2014-09-29 17:00
Merge contiguous date ranges in same group
11
12. 1
2
2
3
3
3
1
1
2
X X 05-X 06-06-03:Y 03:05:Y 07-03 08:09-29 17:X 01-01 00:00 02-01 00:00
1
X 03-01 00:00 04-01 00:00
1
X 04-01 00:00 05-01 00:00
0
X 06-01 00:00 06-01 01:00
1
X 06-01 01:00 06-01 02:00
0
X 06-01 02:00 06-01 03:00 0
Y 06-01 03:00 06-01 04:00 1
Y 06-01 04:00 06-01 05:00 0
Y 07-03 08:00 09-29 17:00 1
with grp_starts as (
select a.*,
case when start_ts =
lag(end_ts) over(
partition by group_name
order by start_ts
)
then 0 else 1 end grp_start
from t a
), grps as (
select b.*,
sum(grp_start) over(
partition by group_name
order by start_ts
) grp_id
from grp_starts b)
select group_name,
min(start_ts) start_ts,
max(end_ts) end_ts
from grps
group by group_name, grp_id;
12
13. 2) Match_Recognize
SELECT * FROM t
MATCH_RECOGNIZE(
PARTITION BY group_name
ORDER BY start_ts
MEASURES
A.start_ts start_ts,
end_ts end_ts,
next(start_ts) - end_ts gap
PATTERN(A B*)
DEFINE B AS start_ts = prev(end_ts)
);
New this time:
• Added PARTITION BY
• MEASURES
added gap using row
outside the match!
• ONE ROW PER MATCH
and
SKIP PAST LAST ROW
are the defaults
One solution replaces two methods: simple!
13
14. Which row do we mean?
14
Column name by itself = « current » row
• Define: row being evaluated
• All rows: each row being output
• One row: last row being output
START_TS END_TS DEFINE
MEASURES
ALL ROWS ONE ROW
00:00 01:00 FIRST() FIRST() FIRST()
01:00 02:00 Current Current Current
02:00 03:00 LAST() LAST() LAST()
04:00 05:00 FINAL LAST FINAL LAST
15. Which row do we mean?
Expression DEFINE
MEASURES
ALL ROWS… ONE ROW…
FIRST(start_ts) First row of match
start_ts current row last row of match
LAST(end_ts) current row last row of match
FINAL
ORA-62509 last row of match
LAST(end_ts)
B.start_ts most recent B row last B row
PREV(), NEXT() Physical offset from referenced row
COUNT(*) from first to current row all rows in match
COUNT(B.*) B rows including current row all B rows
15
16. 2) Run_Stats comparison
For 500,000 rows:
Stat Pre 12c Match_R Pct
Latches 10165 8066 79%
Elapsed Time 32,16 20,58 64%
CPU used by this session 31,94 19,67 62%
16
17. 2) Execution Plans
Operation Used-Mem
SELECT STATEMENT
HASH GROUP BY 20M (0)
VIEW
WINDOW BUFFER 32M (0)
VIEW
WINDOW SORT 27M (0)
TABLE ACCESS FULL
Operation Used-Mem
SELECT STATEMENT
VIEW
MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 27M (0)
TABLE ACCESS FULL
17
18. 2) Matching within a group
18
SELECT * FROM (
SELECT * from t
WHERE group_name = 'X'
)
MATCH_RECOGNIZE
…
);
Filter before MATCH_RECOGNIZE to avoid extra work
19. 2) Predicate pushing
Select * from <view> where group_name = 'X'
Operation Name A-Rows Buffers
SELECT STATEMENT 3 4
VIEW 3 4
MATCH RECOGNIZE SORT DETERMINISTIC
FINITE AUTO
3 4
TABLE ACCESS BY INDEX ROWID
BATCHED
T 6 4
INDEX RANGE SCAN TI 6 3
19
21. SELECT s first_site, MAX(e) last_site, MAX(sm) sum_cnt FROM (
SELECT s, e, cnt, sm FROM t
MODEL
DIMENSION BY (row_number() over(order by study_site) rn)
MEASURES (study_site s, study_site e, cnt, cnt sm)
RULES (
sm[ > 1] =
CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()]
> 65000
THEN cnt[cv()]
ELSE sm[cv() - 1] + cnt[cv()]
END,
s[ > 1] =
CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()]
> 65000
THEN s[cv()]
ELSE s[cv() - 1]
END
)
)
GROUP BY s;
• DIMENSION with row_number
orders data and processing
• rn can be used like a subscript
• cv() means current row
• cv()-1 means previous row
rn
[– [[[[– [rn
[[[[[– 21
22. SELECT * FROM t
MATCH_RECOGNIZE (
ORDER BY study_site
MEASURES
FIRST(study_site) first_site,
LAST(study_site) last_site,
SUM(cnt) sum_cnt
PATTERN (A+)
DEFINE A AS SUM(cnt) <= 65000
);
New this time:
• PATTERN
(A+) replaces (A B*)
means 1 or more rows
• Why? In previous
examples I used PREV(),
which returns NULL on
the first row.
One solution replaces 3 methods: simpler!
22
23. 3) Run_Stats comparison
For one million rows:
Stat Pre 12c Match_R Pct
Latches 357448 4622 1%
Elapsed Time 32.85 2.9 9%
CPU used by this session 31.31 2.88 9%
23
24. 3) Execution Plans
Id Operation Used-Mem
0 SELECT STATEMENT
1 HASH GROUP BY 7534K (0)
2 VIEW
3 SQL MODEL ORDERED 105M (0)
4 WINDOW SORT 27M (0)
5 TABLE ACCESS FULL
Id Operation Used-Mem
0 SELECT STATEMENT
1 VIEW
2 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 27M (0)
3 TABLE ACCESS FULL
24
25. 4) “Bin fitting”: fixed number
Name Val Val BIN1 BIN2 BIN3
1 1 10 10
2 2 9 10 9
3 3 8 10 9 8
4 4 7 10 9 15
5 5 6 10 15 15
6 6 5 15 15 15
7 7 4 19 15 15
8 8 3 19 18 15
9 9 2 19 18 17
10 10 1 19 18 18
• Requirement
– Distribute values in 3
parts as equally as
possible
• “Best fit decreasing”
– Sort values in
decreasing order
– Put each value in least
full “bin”
25
26. 4) Brilliant pre 12c solution
SELECT bin, Max (bin_value) bin_value
FROM (
SELECT * FROM items
MODEL
DIMENSION BY
(Row_Number() OVER
(ORDER BY item_value DESC) rn)
MEASURES (
item_name,
item_value,
Row_Number() OVER
(ORDER BY item_value DESC) bin,
item_value bin_value,
Row_Number() OVER
(ORDER BY item_value DESC) rn_m,
0 min_bin,
Count(*) OVER () - 3 - 1 n_iters
)
RULES ITERATE(100000)
UNTIL (ITERATION_NUMBER >= n_iters[1]) (
min_bin[1] = Min(rn_m) KEEP (DENSE_RANK
FIRST ORDER BY bin_value)[rn<= 3],
bin[ITERATION_NUMBER + 3 + 1] =
min_bin[1],
bin_value[min_bin[1]] =
bin_value[CV()] +
Nvl(item_value[ITERATION_NUMBER+4], 0))
)
WHERE item_name IS NOT NULL
group by bin;
26
27. SELECT * from items
MATCH_RECOGNIZE (
ORDER BY item_value desc
MEASURES
sum(bin1.item_value) bin1,
sum(bin2.item_value) bin2,
sum(bin3.item_value) bin3
PATTERN ((bin1|bin2|bin3)+)
DEFINE
bin1 AS count(bin1.*) = 1
OR sum(bin1.item_value)-bin1.item_value
<= least(
sum(bin2.item_value),
sum(bin3.item_value)
),
bin2 AS count(bin2.*) = 1
OR sum(bin2.item_value)-bin2.item_value
<= sum(bin3.item_value)
);
• ()+ = 1 or more of whatever
is inside
• '|' = alternatives,
“preferred in the order
specified”
• Bin1 condition:
• No rows here yet,
• Or this bin least full
• Bin2 condition
• No rows here yet, or
• This bin less full than 3
27
28. 4) Run_Stats comparison
For 10,000 rows:
Stat Pre 12c Match_R Pct
Latches 3124 47 2%
Elapsed Time 28 0.02 0%
CPU used by this session 26.39 0.03 0%
28
29. 4) Execution Plans
Id Operation Used-Mem
0 SELECT STATEMENT
1 HASH GROUP BY 817K (0)
2 VIEW
3 SQL MODEL ORDERED 1846K (0)
4 WINDOW SORT 424K (0)
5 TABLE ACCESS FULL
Id Operation Used-Mem
0 SELECT STATEMENT
1 VIEW
2 MATCH RECOGNIZE SORT 330K (0)
3 TABLE ACCESS FULL
29
30. Backtracking
• What happens when there is no match???
• “Greedy” quantifiers - * + {2,}
– are not that greedy
– Take all the rows they can, BUT
give rows back if necessary – one at a time
• Regular expression engines will test all possible
combinations to find a match
30
31. Repeating conditions
select 'match' from (
select level n from dual
connect by level <= 100
)
match_recognize(
pattern(a b* c)
define b as n > prev(n)
, c as n = 0
);
Runs in 0.005 secs
select 'match' from (
select level n from dual
connect by level <= 100
)
match_recognize(
pattern(a b* b* b* c)
define b as n > prev(n)
, c as n = 0
);
Runs in 5.4 secs
31
32. 32
123456789
A
AB
ABBB
ABBBB
ABBBBB
ABBBBBB
ABBBBBBB
ABBBBBBBC
ABBBBBBC
ABBBBBC
ABBBBC
ABBBC
ABBC
ABC
AC
Backtracking in action:
1. Find A
2. Find all the Bs you can
3. At the end, look for a C
4. No C? Backtrack through the Bs
5. Still no C? No Match!
33. Imprecise Conditions
SELECT * FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES FIRST(tstamp) AS start_tstamp,
LAST(tstamp) AS end_tstamp
AFTER MATCH SKIP TO LAST UP
PATTERN (STRT DOWN+ UP+ DOWN+ UP+)
DEFINE DOWN AS price < PREV(price),
UP AS price > PREV(price),
STRT AS price >= nvl(PREV(PRICE),0)
);
Runs in 0.02 seconds
CREATE TABLE Ticker (
SYMBOL VARCHAR2(10),
tstamp DATE,
price NUMBER
);
insert into ticker
select 'ACME',
sysdate + level/24/60/60,
10000-level
from dual
connect by level <= 5000;
price)
);
Runs in 24 seconds
INMEMORY: 13 seconds 33
34. Keep in Mind
• Backtracking
– Precise conditions
– Test data with no matches
• To debug:
Measures classifier() cl,
match_number() mn
All rows per match with
unmatched rows
• No DISTINCT, no LISTAGG
• MEASURES columns must
have aliases
• “Reluctant quantifier” = ?
= JDBC bind variable
• “Pattern variables” are
range variables, not bind
variables
34
35. Output Row “shape”
Per Match PARTITION BY ORDER BY MEASURES Other input
ONE ROW X Omitted X omitted
ALL ROWS X X X X
ORA-00918, anyone?
35