4. • Dashboard providing a common view across sales transactions
• Multiple roles
• Top management
• Brand managers
• Channel managers
• Requiring to organize data in multiple ways
• Establish dynamic hierarchies based on multiple attributes
DYNAMIC VIEW ACROSS SALES
6. • 3 Years historical data
• 7,2 billion transactions representing 4,5TB
• Wide group of users spread across the organization
• Intuitive User Interface with a great User Experience
• Detailed visualization
• Row level security
• Maximum dashboard load time 5s
CHALLENGES
7. THE SOLUTION
HDFS Hive Impala
Pentaho Data Integration (PDI)
PDI
HBase
Web
Application
Hadoop
8.
9.
10.
11.
12. • Impala on Cloudera Hadoop can be used as an interactive data
base
• Hadoop distributed nature allows implementing used cases that
wouldn’t be viable on other technologies
• We went from 7 days of data to 3 years
• Pentaho Data Integration implements and orchestrates the whole
ETL process, making it much easier
• From traditional data sources to summarized data on Hadoop
KEY TAKEAWAYS
14. • Data lake goal is to make data available on a centralized location
• Requires dealing with
• Wide set of sources
• Disparate technologies
• In this case it is a repetitive batch loading process
DATA INGESTION
17. • Pentaho Data Integration flexibility is a great match for Hadoop
semi-structured nature
• Cloudera Hadoop can be easily used to store data and make it
immediately available through a SQL interface
• Patterns and well defined workflows are essential to data
governance
KEY TAKEAWAYS
19. • Government agencies have long collected data but that doesn’t
mean it can easily be perceived by citizens
• Challenge
• Create an intuitive UI to represent more than 100 KPIs across 308
municipalities
• Become a standard in terms of transparency
GOVERNMENT CHALLENGE
24. • Pentaho Business Analytics is a comprehensive suite
• Pentaho Server components are really flexible and extensible
allowing creating custom UIs such as:
• Analytics portals
• Embed on existing products
KEY TAKEAWAYS
Goal is:
- to let know examples of what we have been doing
- inspire you to use these technologies
Structure was static
Last 7 days to last 3 years
Mutliple levels that defined drill down path
Multiple elements where each has a criteria establishing the rows to aggregate
Sales are sum on each element
Multiple hierarchies like this can be created and processed overnigh
3 main components: PDI + Hadoop + Web App
Sqoop from Oracle
Process on Hive (formulas, pre aggregation) using conf from HBase
Impala stores end result
Data is summarized as much as possible allowing each chart is able to be visualized using only a couple of rows which are filtered based on security criteria
Zoom, Pan e Play
Single workflow/pattern across all data sources
Promote reusability -> opposite to typical ETL
Create a metadata repository
Describes sources, destinations and simple processes required to ingest the data, can be done with automatic profiling
Implement the ingestion process with PDI
Flexible tool with meta data injection capabilities
Open standards allowing creating transformations on the fly
Use Hadoop as the data repository
File system based and thus very flexible
Additional layers can be placed on top to access data in multiple ways
3 main components: PDI + Hadoop + Web App
Sqoop from Oracle
Process on Hive (formulas, pre aggregation) using conf from HBase
Impala stores end result
can be easily understood by anyone be “cosy” and attractive