6. www.stratebi.com
• OLAP (On-Line Analytical Processing)
• Analytical systems that enable interactive queries.
• Requires very low query latency: Milliseconds-Seconds.
• Usually supports SQL and, sometimes, MDX query language.
• Enables KPI’s data aggregation and filtering across hierarchical multidimensional
structures (OLAP cubes).
• Used as data source for diferents goals:
• Detailed data analysis (OLAP views).
• Dashboarding.
• Reporting.
7. www.stratebi.com
• Big Data OLAP
• Big Data: Volume, Variety and Velocity.
• OLAP applications over Big Data sets.
• Main challenges:
• Very low query latency over fact and dimension tables of billions to trillions of rows.
• Support for ANSI SQL and BI Tools integration.
• Real time data ingestion and processing.
9. www.stratebi.com
• Why Apache Kylin?
• Sub-second queries over +12 billion rows fact tables.
• Best query latency results (in our deployements and benchmarks)
• ANSI SQL and BI Tools integration.
• Integration with Pentaho possible through JDBC, Mondrian and PME
• Also Superset, Tableau, Power BI, Zeppelin, Microstrategy…
• Star and snowflake schemas full support
• Not all tools support it (e.g. Druid)
• Near Real time data ingestion (Kafka) and processing.
• It is an Apache open-source project.
• Currently in version 2.5
10. www.stratebi.com
• Apache Kylin Architecture
• M-OLAP approach:
• Data pre-aggregation.
• Enables only analytical
queries.
• Hadoop based tool
• Full scalability
• Hadoop nodes
• Hbase and Kylin separated
clusters (if needed)
13. www.stratebi.com
• Why Apache Kylin and Pentaho BA Server?
• It is becoming more and more necessary to provide dashboarding, reporting and
OLAP viewing over Big Data scenarios.
• Using our STTools Pentaho plugins: STPivot, STReport, STDashboard,…
• Also Pentaho Reporting, Community Dashboard Editor, Saiku (plugin),…
• Both Kylin and Pentaho are leading BI & Big Data open-source tools.
• Pentaho enables integration with most-known Big Data tools: Hive, Impala, Spark SQL,…
• Integration with Pentaho possible through JDBC, Mondrian and PME
• Mondrian 4.X using existing Mondrian 4.4 (lagunitas)
• Mondrian 3.X, with a great effort of our team.
• Using Pentaho BA Sever 7.1
14. www.stratebi.com
• Identified issues and solutions: Kylin and Mondrian 3.X (3.14)
• Issue 1: Kylin needs ANSI-92 inner joins but Mondrian 3.X generated old-style joins.
• Solution: We defined a Mondrian dialect and we used this patch to implement
allowsJoinOn() method.
• Issue 2: Mondrian native cross join and nonempty properties caused invalid SQL
code for Kylin.
• Solution: We disabled these properties for Kylin dialect.
• Issue 3: Kylin needs the fact table to be the first table in the from SQL clause.
• Workaround: We modified Mondrian code to identify fact tables using a name prefix (F or
FT) and thus place them first in the from clause.
15. www.stratebi.com
• Identified issues and solutions: Kylin and Mondrian 3.X
• Some interesting used references:
• How to implement Kylin dialect for Mondrian
• https://web.archive.org/web/20171010103502/http://dekarlab.de/wp/?p=443
• Pentaho JIRA - MONDRIAN-955
• Mondrian should support the Dialect.allowsJoinOn() option
• Patch
• Pentaho JIRA - MONDRIAN-2364
• Add dialect for Apache Kylin
16. www.stratebi.com
• Identified issues and solutions: Kylin and Pentaho Metadata Editor
• Issue 1: There is no dialect for Kylin in PME.
• Solution: Definition of the Kylin dialect using the Hive 2 SQL dialect.
• Works perfectly without changing anything.
• JDBC connections between Pentaho BA Server and Kylin:
• Initially we used the generic connection through a JDBC driver.
• To simplify the connection, we defined the connection interface for Kylin in Pentaho
BA Server.
• We have used Pentaho BA Server 7.1 but a connection to Kylin has not yet been included
in Pentaho 8.1.
17. www.stratebi.com
• Enabling security at schemas, concepts and data levels:
• Mondrian 3.X
• We could not use views to filter data (Kylin approach limitation)
• Solution: We have used Mondrian Dynamic Schema Processor
• We extended the typical Mondrian DSP class using a variable that replaces a piece of
XML from the schema.
• Pentaho Metadata Editor
• PME requires roles and users tables be created in the same data source, but Kylin does
not allow it (Kylin approach limitation).
• Solution: We have created JDBCSecuritySqlGenerator
• Extension of this PME existing security class.
• The security is defined in a file we called securitySQLGenerator-properties.xml.
18. www.stratebi.com
• What have we obtained?
• Dasboarding, reporting and OLAP viewing using our Pentaho STTools over cubes
with more than a billion rows (1.000.000.000)
• Enabling sub-second Roll-up, Drill-down, Slice and Dice and Pivot OLAP operations.
• We have carried the first deployement of Kylin for a Spain based company.
• Try our demo with Kylin, Pentaho and STPivot viewer (Marketplace available)
• http://bigdata.stratebi.com/kylin-olap/index.htm
20. www.stratebi.com
• Kylin applied to digital marketing scenario
• Initial Scenario
• OLAP system for data analysis using an in-house reporting tool.
• Based on MySQL (80% queries) + Redshift (20% queries)
• Several million rows per hour in some fact tables
• Goals
• Reduce query latency (some queries take >20s to run)
• Reduce ETL processing time: "Data freshness".
• Implementation of Open-Source BI tools (STTools)
• Self-service OLAP, reporting and dashboarding
22. www.stratebi.com
• Kylin applied to digital marketing scenario
• Goals achieved
• Reduced query latency: User queries were compared for the company's three most
important reports.
• Kylin query executions times are 4 times faster than Redshift.
• Most Kylin queries have response times below 1 second.
• Some very complex queries that in Redshift take about 30 seconds are executed in
over 400 milliseconds using Kylin
• Full integration with open source BI tools (STTools)
• STPivot, STReport, STDashboard
• Security implemented at schema and data levels (Mondrian and PME).
25. www.stratebi.com
• Why Vertica is an alternative to Kylin for Big Data OLAP?
• Sub-second queries over billions of rows fact tables.
• In our implementations and benchmark it achieves very good query latency results.
• But it is not as fast as Kylin for extremely huge fact tables.
• ANSI SQL and BI Tools integration.
• Integration with Pentaho possible through JDBC, Mondrian and PME
• Also Superset, Tableau, Power BI, Zeppelin, Microstrategy…
• Star and snowflake schemas full support
• Near Real time data ingestion and processing.
• Microfocus Vertica is not an open-source project.
• But there is a free community version, enough for much typical Big Data scenarios.
26. www.stratebi.com
• Vertica Architecture
• Distributed processing in cluster mode.
• But it does not need a hadoop cluster to work.
• Although it does support integration with Hadoop (e.g. Spark or Hive)
• Columnar and distributed storage
• Hybrid OLAP (tables, projections,
flattened tables…)
27. www.stratebi.com
• Integration with Pentaho and STTools
• Seamless integration with Pentaho PDI for data warehouse loading
• Including bulk load steps
• We have also integrated Vertica with Pentaho BA Server for several successfully use
cases
• Be careful defining the Mondrian OLAP scheme to achieve good performance.
• In PME we have faced similar issues to Kylin (use of PostgreSQL dialect)
• Retail Sector use case
• + 3,000 points of sales = high concurrency
• Volumetrics determined by sales line level detail
• Need for highly customized graphics (we have implemented a lot of CDE dashboards)
29. www.stratebi.com
• Why a Big Data OLAP Benchmark?
• To test the performance of the two most powerful Big Data OLAP tools
• Kylin vs Vertica
• Compare their performance against OLAP implementations in traditional databases
• PostgreSQL: Open source relational database that has a good performance for OLAP
systems.
30. www.stratebi.com
• Benchmark implementation
• We have used the SSB benchmark
• A star scheme version of the best known TPC-H
(industry-standard)
• Kyligence team has an implementation of the SSB
benchmark for Apache Kylin.
• Including schemas and data generator.
• We have adapted it to use with Vertica and
PostgreSQL.
• It provides a set of 13 analytical queries
31. www.stratebi.com
• Test performed
• Number of rows of facts and dimensions tables for each test performed.
• Hardware used
LINEORDER CUSTOMER PART SUPPLIER DATE
Test – Role of table Fact (KPI) Dimension Dimension Dimension Dimension
100M 100.000.000 40.000 32.000 20.000 2.556
500M 500.000.000 200.000 48.000 100.000 2.556
1.000M 1.000.000.000 400.000 56.000 200.000 2.556
Tool Distributed
Processing
Kind of
hardware
Nº of
hosts
Processor Cores RAM Memory
Kylin 2.4 Yes Dedicated
Cloud
3 Intel(R) Atom(TM) CPU C2750 @
2.40GHz
8 32 Gb
Vertica 9.1 Yes Dedicated
Cloud
3 Intel(R) Atom(TM) CPU C2750 @
2.40GHz
8 32 Gb
PostgreSQL 9.6 No Dedicated
Cloud
1 Intel(R) Atom(TM) CPU C2750 @
2.40GHz
8 32 Gb
34. www.stratebi.com
• Benchmark Results
• Kylin and Vertica are both suitable for Big Data OLAP applications.
• Apache Kylin has the best query performance.
• But high hardware, software (Hadoop) and know-how requirements.
• 100% open source version without limitations.
• Vertica is the alternative to Kylin for less extreme Big Data scenarios.
• Lower hardware, software and know-how requirements.
• Free community version with some limitations.
• PostgreSQL is not suitable for Big Data OLAP.
36. www.stratebi.com
• Pentaho also integrates with many other Big Data tools
• Lince Big Data Stack
• Our selection of Big Data tools based on experience and tests.
• All of them allow the integration with Pentaho open source tools.
• Lince BI tools (formerly STTools) are used to analyze the data from Big Data repositories.
• STPivot: OLAP Viewer.
• STReport: Ad-Hoc Reporting.
• STDashboard: Fast Dashboards.
• STCard: Balanced Scorecards.
40. www.stratebi.com
• Pentaho BA server enables Big Data OLAP in combination with Kylin or Vertica.
• Easy to integrate through JDBC connector with SQL based plugins (CDE dashboards)
• We have worked hard to integrate these tools with Mondrian 3.X and PME 7.1.
• Best performance results with the integration between Pentaho, Kylin and STTools
• Sub-second Roll-up, Drill-down, Slice and Dice and Pivot OLAP operations.
• Experienced performance with STTools is really good, but we have to extend our benchmark to
test it (Kylin with Mondrian or PME)
• Pentaho tools are useful for Big Data ETL and analysis
• However, our experience tells us that many of the Pentaho Big Data connectors and features
are very hard to configure.
• We propose to include Kylin and Vertica dialects (Mondrian and PME) in future Pentaho
versions.