Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Parallel SQL
Joel Bernstein
Search Engineer, Alfresco
jbernste@apache.org
3
03
Introduction
•  Joel Bernstein
•  Lucene/Solr Committer
•  Search Engineer at Alfresco
•  Live and work in NYC
4
03
Alfresco
•  Open source ECM (Enterprise Content Management)
•  Alfresco is a system of record for documents
•  Uses S...
5
01
Agenda
1.  SQL Unleashed (What can it do?)
2. SQL Under the Hood (How does it work?)
6
01
SQL Unleashed
(In Solr 6.0)
7
01
Why SQL?
•  Solr has many awesome features.
•  But all of these feature create complexity.
•  Which faceting API to u...
8
01
The SQL Interface at Glance
•  SQL over Map/Reduce: supports high
cardinality aggregations and
distributed joins.
•  ...
9
01
SQL Syntax: Limited and Unlimited SELECT
•  select colA, colB from tableB
•  select colA, colB from tableB limit 100
...
10
01
SQL Syntax: ORDER BY
•  select a, b from tableB order by a desc,
b desc
•  Unlimited selects sort the entire result
...
11
01
The Predicate: Phrase Searching
•  select a, b from tableB where c = ‘hello
world’
•  Searches for the phrase ‘hello...
12
01
The Predicate: Boolean searching
•  select a, b from tableB where c = ‘(hello world)’
•  Adding parens searches for ...
13
01
The Predicate: Range query
•  select a, b from tableB where c = ‘[0 TO 100]’
14
01
The Predicate: Arbitrary Boolean clauses
•  select a, b from tableB where (c = ‘hello
world’ AND d = ‘[0 TO 100]’)
15
01
SQL Syntax: Select Distinct
•  select distinct a, b from tableB
•  Map/Reduce Implementation: Tuples
•  are shuffled ...
16
01
Shuffle vs Push Down
•  Shuffling: high cardinality and
parallel relational algebra
(Distributed Joins)
•  Pushdown (F...
17
01
Aggregations: Stats
•  select count(*), sum(a) from tableA
•  Uses the StatsComponent under the covers
•  Initial re...
18
01
Aggregations: GROUP BY
•  select a, b count(*), sum(c) from tableB group by
a, b having count(*) > 50 order by sum(c...
19
01
JDBC Driver
•  Ships with Solrj
•  Poolable Connection and Statement
•  SolrCloud Aware Load Balancing
•  Connection...
20
01
SQL Under the Hood
21
01
SQL Parsing
•  Presto SQL Parser handles the parsing
•  SQL Statements are compiled to
TupleStream objects
•  The Tu...
22
01
Parallel Computing Framework
•  Shuffling
•  Worker Collections
•  Streaming API
•  Streaming Expressions
•  Parallel...
23
01
Shuffling (sorting & partitioning)
•  Shuffling is pushed down into the search engine
•  Sorting: /export handler “str...
24
01
Shuffling (sorting & partitioning)
Worker 2Worker 1
Shard 1
Replica 1
Shard 2
Replica 1
Shard 1
Replica 2
Shard 2
Rep...
25
01
Worker Collections
•  Are Generic SolrCloud Collections
•  Can hold data, or just perform work
•  Search results are...
26
01
Streaming API
•  Java Programming API for the parallel
computing framework
•  Real-time Map/Reduce and Parallel
Rela...
27
01
Streaming Expressions
•  Contributed by Dennis Gove (Bloomberg)
•  String Query Language and Serialization
format fo...
28
01
Parallel SQL
•  Compiles SQL to a TupleStream
•  The TupleStream is serialized to a
Streaming Expression and sent to...
29
01
From SQL to Streaming Expression
select str_s, count(*), sum(field_i), min(field_i), max(field_i),
avg(field_i) from col...
30
01
Parallel SQL Shuffle (5 workers, 5 shards, aggregationMode=map_reduce)
Client
Worker 2
Shard 3
Replica 2
Worker 3Work...
31
01
Jira Tickets
•  SOLR-7560: Parallel SQL Support
•  SOLR-7377: Solr Streaming Expressions
•  SOLR-7082: Streaming Agg...
32
01
Getting Involved
• SQL is in Trunk
• Releasing with Solr 6
• Streaming API and Streaming Expressions
are located in ...
33
01
Questions
Thanks!
Prochain SlideShare
Chargement dans…5
×

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco

2 203 vues

Publié le

Lucene/Solr Revolution 2015

Publié dans : Technologie
  • Soyez le premier à commenter

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco

  1. 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  2. 2. Parallel SQL Joel Bernstein Search Engineer, Alfresco jbernste@apache.org
  3. 3. 3 03 Introduction •  Joel Bernstein •  Lucene/Solr Committer •  Search Engineer at Alfresco •  Live and work in NYC
  4. 4. 4 03 Alfresco •  Open source ECM (Enterprise Content Management) •  Alfresco is a system of record for documents •  Uses Solr for search •  1800+ customers •  11 million active user accounts •  Alfresco Solr: Document level access control, eventually consistent, transactional, multi-master, distributed search and faceting (coming in Alfresco 5.1)
  5. 5. 5 01 Agenda 1.  SQL Unleashed (What can it do?) 2. SQL Under the Hood (How does it work?)
  6. 6. 6 01 SQL Unleashed (In Solr 6.0)
  7. 7. 7 01 Why SQL? •  Solr has many awesome features. •  But all of these feature create complexity. •  Which faceting API to use? When to Stream? Which parameters to use for optimal performance? •  The complexity level increases dramatically when distributed joins come into play •  With SQL we can provide an optimizer to choose the best query plan.
  8. 8. 8 01 The SQL Interface at Glance •  SQL over Map/Reduce: supports high cardinality aggregations and distributed joins. •  SQL over Facets: high performance on moderate cardinality aggregations. •  SQL with Solr Search Predicates •  SQL is fully integrated with SolrCloud
  9. 9. 9 01 SQL Syntax: Limited and Unlimited SELECT •  select colA, colB from tableB •  select colA, colB from tableB limit 100 •  Unlimited selects return the entire result set. Return fields must be DocValues. •  Limited selects can sort by score and retrieve any stored field.
  10. 10. 10 01 SQL Syntax: ORDER BY •  select a, b from tableB order by a desc, b desc •  Unlimited selects sort the entire result set
  11. 11. 11 01 The Predicate: Phrase Searching •  select a, b from tableB where c = ‘hello world’ •  Searches for the phrase ‘hello world’ in field c.
  12. 12. 12 01 The Predicate: Boolean searching •  select a, b from tableB where c = ‘(hello world)’ •  Adding parens searches for (hello OR world). •  Supports Solr query syntax inside the parens.
  13. 13. 13 01 The Predicate: Range query •  select a, b from tableB where c = ‘[0 TO 100]’
  14. 14. 14 01 The Predicate: Arbitrary Boolean clauses •  select a, b from tableB where (c = ‘hello world’ AND d = ‘[0 TO 100]’)
  15. 15. 15 01 SQL Syntax: Select Distinct •  select distinct a, b from tableB •  Map/Reduce Implementation: Tuples •  are shuffled to worker nodes where the distinct operation is performed. •  JSON Facet Implementation: distinct operation is pushed down into the search engine •  Map/Reduce for high cardinality •  Facet for high QPS
  16. 16. 16 01 Shuffle vs Push Down •  Shuffling: high cardinality and parallel relational algebra (Distributed Joins) •  Pushdown (Facet): blazing fast, high QPS, moderate cardinality •  aggregationMode flag is available with the JDBC driver and http interface [map_reduce or facet]
  17. 17. 17 01 Aggregations: Stats •  select count(*), sum(a) from tableA •  Uses the StatsComponent under the covers •  Initial release supports count, sum, avg, min, max •  Aggregation logic is always pushed down into the search engine.
  18. 18. 18 01 Aggregations: GROUP BY •  select a, b count(*), sum(c) from tableB group by a, b having count(*) > 50 order by sum(c) desc •  Supports complex having clause: having (count(*) > 50 AND sum(b) < 1000) •  Has Map/Reduce implementation (shuffle) •  And JSON Facet implementation (push down) •  Map/Reduce can handle high cardinality multi- dimension aggregations.
  19. 19. 19 01 JDBC Driver •  Ships with Solrj •  Poolable Connection and Statement •  SolrCloud Aware Load Balancing •  Connection has aggregationMode switch [map_reduce or facet]
  20. 20. 20 01 SQL Under the Hood
  21. 21. 21 01 SQL Parsing •  Presto SQL Parser handles the parsing •  SQL Statements are compiled to TupleStream objects •  The TupleStream is the base interface of the Streaming API •  The Streaming API is a general purpose parallel computing API for SolrCloud
  22. 22. 22 01 Parallel Computing Framework •  Shuffling •  Worker Collections •  Streaming API •  Streaming Expressions •  Parallel SQL
  23. 23. 23 01 Shuffling (sorting & partitioning) •  Shuffling is pushed down into the search engine •  Sorting: /export handler “stream sorts” entire result sets. •  Partitioning: HashQParserPlugin, hash partitioning filter. Partitions results on arbitrary fields. •  Tuples (search results) begin streaming instantly to worker nodes. Shuffling never requires a spill to disk. •  All replicas shuffle in parallel for the same query. Allows for massive throughput.
  24. 24. 24 01 Shuffling (sorting & partitioning) Worker 2Worker 1 Shard 1 Replica 1 Shard 2 Replica 1 Shard 1 Replica 2 Shard 2 Replica 2 Client Each worker is shuffled ½ the result set Tuples are sorted and partitioned on keys
  25. 25. 25 01 Worker Collections •  Are Generic SolrCloud Collections •  Can hold data, or just perform work •  Search results are shuffled to the workers •  Configured with the /stream handler
  26. 26. 26 01 Streaming API •  Java Programming API for the parallel computing framework •  Real-time Map/Reduce and Parallel Relational Algebra •  Abstracts search results as Streams of tuples (TupleStream) •  Streams are transformed in parallel by pluggable Decorator streams. •  Parallel transformations include: group by, roll up, union, intersect, complement and join
  27. 27. 27 01 Streaming Expressions •  Contributed by Dennis Gove (Bloomberg) •  String Query Language and Serialization format for the Streaming API •  Streaming Expressions compile to TupleStreams •  TupleStreams serialize to Streaming Expressions
  28. 28. 28 01 Parallel SQL •  Compiles SQL to a TupleStream •  The TupleStream is serialized to a Streaming Expression and sent to worker nodes. •  Worker nodes translate the Streaming Expression back into TupleStream •  Worker nodes open() and read() the TupleStream in parallel. Tuples are returned from each worker
  29. 29. 29 01 From SQL to Streaming Expression select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text='XXXX' group by str_s rollup( search(collection1, q="(text:XXXX)", qt="/export", fl="str_s, field_i", partitionKeys=str_s, sort="str_s asc", zkHost="127.0.0.1:64149/solr"), over=str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i))
  30. 30. 30 01 Parallel SQL Shuffle (5 workers, 5 shards, aggregationMode=map_reduce) Client Worker 2 Shard 3 Replica 2 Worker 3Worker 1 Worker 4 Worker 5 Shard 1 Replica 2 Shard 1 Replica 3 Shard 2 Replica 3 Shard 2 Replica 2 Shard 2 Replica 1 Shard 1 Replica 1 Shard 3 Replica 1 Shard 3 Replica 3 Shard 4 Replica 3 Shard 4 Replica 2 Shard 4 Replica 1 Shard 5 Replica 3 Shard 5 Replica 2 Shard 5 Replica 1 /SQL handler
  31. 31. 31 01 Jira Tickets •  SOLR-7560: Parallel SQL Support •  SOLR-7377: Solr Streaming Expressions •  SOLR-7082: Streaming Aggregation for SolrCloud •  SOLR-7441: Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL
  32. 32. 32 01 Getting Involved • SQL is in Trunk • Releasing with Solr 6 • Streaming API and Streaming Expressions are located in the Solrj libraries (solrj.io) • Patches welcome • Testers and feedback needed
  33. 33. 33 01 Questions Thanks!

×