Row Pattern Matching with Database 12c MATCH_RECOGNIZE

“Row Pattern Matching” with
Database 12c MATCH_RECOGNIZE
Beating the Best Pre-12c Solutions
Stew Ashton UKOUG Tech 14 Stew ASHTON
UKOUG Tech 14

Agenda
• Who am I?
• Pre-12c solutions compared to row pattern
matching with MATCH_RECOGNIZE
– For all sizes of data
– Thinking in patterns
• Watch out for “catastrophic backtracking”
• Other things to keep in mind (time permitting)
2

Who am I?
• 33 years in IT
– Developer, Technical Sales Engineer, Technical Architect
– Aeronautics, IBM, Finance
– Mainframe, client-server, Web apps
• 25 years as an American in Paris
• 9 years using Oracle database
– Performance analysis
– Replace Java with SQL
• 2 years as internal “Oracle Development Expert”
3

1) “Fixed Difference”
• Identify and group rows with consecutive values
• My presentation: print slides to keep
• Math: subtract known consecutives
– If A-1 = B-2 then A = B-1
– Else A <> B-1
– Consecutive becomes equality,
non-consecutive becomes inequality
• “Consecutive” = fixed difference of 1
PAGE
1
2
3
5
6
7
10
11
12
36
4

1) Pre-12c
select min(page) firstpage,
max(page) lastpage,
count(*) cnt
FROM (
SELECT page,
page –
Row_Number() over(order by page)
as grp_id
FROM t
)
GROUP BY grp_id;
FIRSTPAGE PAGE [RN] GRP_LASTPAGE ID
CNT
1 1 0
2 2 0
3 3 0
5 4 1
6 5 1
7 6 1
10 7 3
11 8 3
12 9 3
42 10 32
1 3 3
5 7 3
10 12 3
36 36 1
5

Think “match a row pattern”
• PATTERN
– Uninterrupted series of input rows
– Described as a list of conditions (“regular expressions”)
PATTERN (A B*)
"A" : 1 row, "B" : 0 or more rows, as many as possible
• DEFINE each row condition
[A undefined = TRUE]
B AS page = PREV(page)+1
• Each series that matches the pattern is a “match”
– "A" and "B" identify the rows that meet their conditions
6

Input, Processing, Output
1. Define input
2. Order input
3. Process pattern
4. using defined conditions
5. Output: rows per match
6. Output: columns per row
7. Go where after match?
SELECT *
FROM t
MATCH_RECOGNIZE (
ORDER BY page
MEASURES
PATTERN (A B*)
DEFINE B AS page = PREV(page)+1
ONE ROW PER MATCH
MEASURES
A.page firstpage,
LAST(page) lastpage,
COUNT(*) cnt
AFTER MATCH SKIP PAST LAST ROW
);
A.page firstpage,
LAST(page) lastpage,
COUNT(*) cnt
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B*)
DEFINE B AS page = PREV(page)+1
7

1) Run_Stats comparison
For one million rows:
Stat Pre 12c Match_R Pct
Latches 4090 4079 100%
Elapsed Time 5.51 5.56 101%
CPU used by this session 5.5 5.55 101%
“Latches” are serialization devices: fewer means more scalable
8

1) Execution Plans
Operation Used-Mem
SELECT STATEMENT
HASH GROUP BY 40M (0)
Id Operation Name Starts E-Rows A-Rows A-Time Buffers OMem 1Mem Used-Mem
0 SELECT STATEMENT 1 400K 00:00:01.83 1594
1 VIEW
HASH GROUP BY 1 1000K 400K 00:00:01.83 1594 41M 5035K 40M (0)
2 VIEW WINDOW SORT 1 1000K 1000K 00:00:12.69 1594
3 WINDOW SORT 1 1000K 1000K 00:00:03.46 1594 22M 20M 1749K (0)
20M (0)
4 TABLE ACCESS FULL T 1 1000K 1000K 00:00:02.53 1594
TABLE ACCESS FULL
Id Operation Name Starts E-Rows A-Rows A-Time Buffers OMem 1Mem Used-Mem
0 SELECT STATEMENT 1 400K 00:00:03.45 1594
1 VIEW 1 1000K 400K 00:00:03.45 1594
2
Operation Used-Mem
MATCH RECOGNIZE SORT DETERMINISTIC FINITE
SELECT AUTO
STATEMENT
VIEW
1 1000K 400K 00:00:01.87 1594 22M 1749K 20M (0)
3 TABLE ACCESS FULL T 1 1000K 1000K 00:00:02.09 1594
MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 20M (0)
TABLE ACCESS FULL
9

2) “Start of Group”
• Identify group boundaries, often using LAG()
• 3 steps instead of 2:
1. For each row: if start of group, assign 1
Else assign 0
2. Running total of 1s and 0s produces a group
identifier
3. Group by the group identifier
10

2) Requirement
GROUP_NAME START_TS END_TS
X 2014-01-01 00:00 2014-02-01 00:00
X 2014-03-01 00:00 2014-04-01 00:00
X 2014-04-01 00:00 2014-05-01 00:00
X 2014-06-01 00:00 2014-06-01 01:00
X 2014-06-01 01:00 2014-06-01 02:00
X 2014-06-01 02:00 2014-06-01 03:00
Y 2014-06-01 03:00 2014-06-01 04:00
Y 2014-06-01 04:00 2014-06-01 05:00
Y 2014-07-03 08:00 2014-09-29 17:00
Merge contiguous date ranges in same group
11

1
2
2
3
3
3
1
1
2
X X 05-X 06-06-03:Y 03:05:Y 07-03 08:09-29 17:X 01-01 00:00 02-01 00:00
1
X 03-01 00:00 04-01 00:00
1
X 04-01 00:00 05-01 00:00
0
X 06-01 00:00 06-01 01:00
1
X 06-01 01:00 06-01 02:00
0
X 06-01 02:00 06-01 03:00 0
Y 06-01 03:00 06-01 04:00 1
Y 06-01 04:00 06-01 05:00 0
Y 07-03 08:00 09-29 17:00 1
with grp_starts as (
select a.*,
case when start_ts =
lag(end_ts) over(
partition by group_name
order by start_ts
)
then 0 else 1 end grp_start
from t a
), grps as (
select b.*,
sum(grp_start) over(
partition by group_name
order by start_ts
) grp_id
from grp_starts b)
select group_name,
min(start_ts) start_ts,
max(end_ts) end_ts
from grps
group by group_name, grp_id;
12

2) Match_Recognize
SELECT * FROM t
MATCH_RECOGNIZE(
PARTITION BY group_name
ORDER BY start_ts
MEASURES
A.start_ts start_ts,
end_ts end_ts,
next(start_ts) - end_ts gap
PATTERN(A B*)
DEFINE B AS start_ts = prev(end_ts)
);
New this time:
• Added PARTITION BY
• MEASURES
added gap using row
outside the match!
• ONE ROW PER MATCH
and
SKIP PAST LAST ROW
are the defaults
One solution replaces two methods: simple!
13

Which row do we mean?
14
Column name by itself = « current » row
• Define: row being evaluated
• All rows: each row being output
• One row: last row being output
START_TS END_TS DEFINE
MEASURES
ALL ROWS ONE ROW
00:00 01:00 FIRST() FIRST() FIRST()
01:00 02:00 Current Current Current
02:00 03:00 LAST() LAST() LAST()
04:00 05:00 FINAL LAST FINAL LAST

Which row do we mean?
Expression DEFINE
MEASURES
ALL ROWS… ONE ROW…
FIRST(start_ts) First row of match
start_ts current row last row of match
LAST(end_ts) current row last row of match
FINAL
ORA-62509 last row of match
LAST(end_ts)
B.start_ts most recent B row last B row
PREV(), NEXT() Physical offset from referenced row
COUNT(*) from first to current row all rows in match
COUNT(B.*) B rows including current row all B rows
15

For 500,000 rows:
Latches 10165 8066 79%
Elapsed Time 32,16 20,58 64%
CPU used by this session 31,94 19,67 62%
16

2) Execution Plans
Operation Used-Mem
SELECT STATEMENT
HASH GROUP BY 20M (0)
VIEW
WINDOW BUFFER 32M (0)
VIEW
WINDOW SORT 27M (0)
TABLE ACCESS FULL
Operation Used-Mem
SELECT STATEMENT
VIEW
MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 27M (0)
TABLE ACCESS FULL
17

2) Matching within a group
18
SELECT * FROM (
SELECT * from t
WHERE group_name = 'X'
)
MATCH_RECOGNIZE
…
);
Filter before MATCH_RECOGNIZE to avoid extra work

2) Predicate pushing
Select * from <view> where group_name = 'X'
Operation Name A-Rows Buffers
SELECT STATEMENT 3 4
VIEW 3 4
MATCH RECOGNIZE SORT DETERMINISTIC
FINITE AUTO
3 4
TABLE ACCESS BY INDEX ROWID
BATCHED
T 6 4
INDEX RANGE SCAN TI 6 3
19

3) “Bin fitting”: fixed size
• Requirement
– Order by study_site
– Put in “bins” with size =
65,000 max
STUDY_SITE CNT STUDY_SITE CNT
1001 3407 1026 137
1002 4323 1028 6005
1004 1623 1029 76
1008 1991 1031 4599
1011 885 1032 1989
1012 11597 1034 3427
1014 1989 1036 879
1015 5282 1038 6485
1017 2841 1039 3
1018 5183 1040 1105
1020 6176 1041 6460
1022 2784 1042 968
1023 25865 1044 471
1024 3734 1045 3360
FIRST_SITE LAST_SITE SUM_CNT
1001 1022 48081
1023 1044 62203
1045 1045 3360
20

SELECT s first_site, MAX(e) last_site, MAX(sm) sum_cnt FROM (
SELECT s, e, cnt, sm FROM t
MODEL
DIMENSION BY (row_number() over(order by study_site) rn)
MEASURES (study_site s, study_site e, cnt, cnt sm)
RULES (
sm[ > 1] =
CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()]
> 65000
THEN cnt[cv()]
ELSE sm[cv() - 1] + cnt[cv()]
END,
s[ > 1] =
CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()]
> 65000
THEN s[cv()]
ELSE s[cv() - 1]
END
)
)
GROUP BY s;
• DIMENSION with row_number
orders data and processing
• rn can be used like a subscript
• cv() means current row
• cv()-1 means previous row
rn
[– [[[[– [rn
[[[[[– 21

SELECT * FROM t
MATCH_RECOGNIZE (
ORDER BY study_site
MEASURES
FIRST(study_site) first_site,
LAST(study_site) last_site,
SUM(cnt) sum_cnt
PATTERN (A+)
DEFINE A AS SUM(cnt) <= 65000
);
New this time:
• PATTERN
(A+) replaces (A B*)
means 1 or more rows
• Why? In previous
examples I used PREV(),
which returns NULL on
the first row.
One solution replaces 3 methods: simpler!
22

For one million rows:
Latches 357448 4622 1%
Elapsed Time 32.85 2.9 9%
23

3) Execution Plans
Id Operation Used-Mem
0 SELECT STATEMENT
1 HASH GROUP BY 7534K (0)
2 VIEW
3 SQL MODEL ORDERED 105M (0)
4 WINDOW SORT 27M (0)
5 TABLE ACCESS FULL
0 SELECT STATEMENT
1 VIEW
2 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 27M (0)
3 TABLE ACCESS FULL
24

4) “Bin fitting”: fixed number
Name Val Val BIN1 BIN2 BIN3
1 1 10 10
2 2 9 10 9
3 3 8 10 9 8
4 4 7 10 9 15
5 5 6 10 15 15
6 6 5 15 15 15
7 7 4 19 15 15
8 8 3 19 18 15
9 9 2 19 18 17
10 10 1 19 18 18
• Requirement
– Distribute values in 3
parts as equally as
possible
• “Best fit decreasing”
– Sort values in
decreasing order
– Put each value in least
full “bin”
25

4) Brilliant pre 12c solution
SELECT bin, Max (bin_value) bin_value
FROM (
SELECT * FROM items
MODEL
DIMENSION BY
(Row_Number() OVER
(ORDER BY item_value DESC) rn)
MEASURES (
item_name,
item_value,
Row_Number() OVER
(ORDER BY item_value DESC) bin,
item_value bin_value,
Row_Number() OVER
(ORDER BY item_value DESC) rn_m,
0 min_bin,
Count(*) OVER () - 3 - 1 n_iters
)
RULES ITERATE(100000)
UNTIL (ITERATION_NUMBER >= n_iters[1]) (
min_bin[1] = Min(rn_m) KEEP (DENSE_RANK
FIRST ORDER BY bin_value)[rn<= 3],
bin[ITERATION_NUMBER + 3 + 1] =
min_bin[1],
bin_value[min_bin[1]] =
bin_value[CV()] +
Nvl(item_value[ITERATION_NUMBER+4], 0))
)
WHERE item_name IS NOT NULL
group by bin;
26

SELECT * from items
MATCH_RECOGNIZE (
ORDER BY item_value desc
MEASURES
sum(bin1.item_value) bin1,
sum(bin2.item_value) bin2,
sum(bin3.item_value) bin3
PATTERN ((bin1|bin2|bin3)+)
DEFINE
bin1 AS count(bin1.*) = 1
OR sum(bin1.item_value)-bin1.item_value
<= least(
sum(bin2.item_value),
sum(bin3.item_value)
),
bin2 AS count(bin2.*) = 1
OR sum(bin2.item_value)-bin2.item_value
<= sum(bin3.item_value)
);
• ()+ = 1 or more of whatever
is inside
• '|' = alternatives,
“preferred in the order
specified”
• Bin1 condition:
• No rows here yet,
• Or this bin least full
• Bin2 condition
• No rows here yet, or
• This bin less full than 3
27

For 10,000 rows:
Latches 3124 47 2%
Elapsed Time 28 0.02 0%
28

4) Execution Plans
0 SELECT STATEMENT
1 HASH GROUP BY 817K (0)
2 VIEW
3 SQL MODEL ORDERED 1846K (0)
4 WINDOW SORT 424K (0)
5 TABLE ACCESS FULL
0 SELECT STATEMENT
1 VIEW
2 MATCH RECOGNIZE SORT 330K (0)
3 TABLE ACCESS FULL
29

Backtracking
• What happens when there is no match???
• “Greedy” quantifiers - * + {2,}
– are not that greedy
– Take all the rows they can, BUT
give rows back if necessary – one at a time
• Regular expression engines will test all possible
combinations to find a match
30

Repeating conditions
select 'match' from (
select level n from dual
connect by level <= 100
)
match_recognize(
pattern(a b* c)
define b as n > prev(n)
, c as n = 0
);
Runs in 0.005 secs
select 'match' from (
select level n from dual
connect by level <= 100
)
match_recognize(
pattern(a b* b* b* c)
define b as n > prev(n)
, c as n = 0
);
Runs in 5.4 secs
31

32
123456789
A
AB
ABBB
ABBBB
ABBBBB
ABBBBBB
ABBBBBBB
ABBBBBBBC
ABBBBBBC
ABBBBBC
ABBBBC
ABBBC
ABBC
ABC
AC
Backtracking in action:
1. Find A
2. Find all the Bs you can
3. At the end, look for a C
4. No C? Backtrack through the Bs
5. Still no C? No Match!

Imprecise Conditions
SELECT * FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES FIRST(tstamp) AS start_tstamp,
LAST(tstamp) AS end_tstamp
AFTER MATCH SKIP TO LAST UP
PATTERN (STRT DOWN+ UP+ DOWN+ UP+)
DEFINE DOWN AS price < PREV(price),
UP AS price > PREV(price),
STRT AS price >= nvl(PREV(PRICE),0)
);
Runs in 0.02 seconds
CREATE TABLE Ticker (
SYMBOL VARCHAR2(10),
tstamp DATE,
price NUMBER
);
insert into ticker
select 'ACME',
sysdate + level/24/60/60,
10000-level
from dual
connect by level <= 5000;
price)
);
Runs in 24 seconds
INMEMORY: 13 seconds 33

Keep in Mind
• Backtracking
– Precise conditions
– Test data with no matches
• To debug:
Measures classifier() cl,
match_number() mn
All rows per match with
unmatched rows
• No DISTINCT, no LISTAGG
• MEASURES columns must
have aliases
• “Reluctant quantifier” = ?
= JDBC bind variable
• “Pattern variables” are
range variables, not bind
variables
34

Output Row “shape”
Per Match PARTITION BY ORDER BY MEASURES Other input
ONE ROW X Omitted X omitted
ALL ROWS X X X X
ORA-00918, anyone?
35

Questions?
More details at:
stewashton.wordpress.com
36

Row Pattern Matching with Database 12c MATCH_RECOGNIZE

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à Row Pattern Matching with Database 12c MATCH_RECOGNIZE

Similaire à Row Pattern Matching with Database 12c MATCH_RECOGNIZE (20)

Plus de stewashton

Plus de stewashton (7)

Dernier

Dernier (20)

Row Pattern Matching with Database 12c MATCH_RECOGNIZE