This document discusses an open source metadata driven data warehousing project at a Dutch waterboard. It describes the basic architecture of using open source tools like Pentaho and Quipu to build a source data vault, business data vault, and generate data marts. It also covers lessons learned, including that it is possible to quickly build an EDW with open source software, but some challenges around automation and performance still remain. The goal is to deliver added business value in a cost effective way.
Data vault seminar May 5-6 Dommel - The factory and the workshop
1. The factory and the workshopOpen source metadata driven data warehousing Johannes van den Bosch @johannesvdb
2. Agenda Background Organization Project Basicarchitecture The factory …and the workshop Topics fromyesterday Real time Business keyintegration Staging out Hierarchies and supertypes Self service BI and writeback Lessonslearnt
3. Waterschap De Dommel Dutch Waterboard in south of Netherlands Managing water quality and quantity for 900.000 citizens and 150.000 hectares 375 employees of which 2 BI Projectsmanagedon time, money and goals Demandforintegrated management information
4. Project Current BI architecturereachedlimits New greenfieldarchitecture Open source advocate Passionatebeliever in costeffectivesolutionsforgovernment It’s our money! Convinced management No software cost Internalhours (2x 0.3 FTE) 1 year
5. Open source software ETL Pentaho Data Integration (Kettle) Data warehouse management Quipu Documentation MediaWiki Modeling Power*Architect
6. EDW architecture Reporting Analysis Dashboards Data mart 1 Data mart n Business Data Vault Source Data Vault 1 Source Data Vault 2 Source Data Vaultn Supplydriven Demanddriven Generated and automated Staging 1 Staging 2 Stagingn Source 1 Source 2 Sourcen
20. PoC Decided to try and build the bDV and Data Marts 100% virtual bDV = views on top of sDV Data marts = views on top of bDV Conclusion: it is possible
24. Integration: hubs Source Source data vault Business data vault person employee_h employee_h_s person_h person_h_s System x Users Users_h Users_h_s System y
30. bDV design decisions: partialbDV Source data vault Business data vault T H I H T H T
31. Full bDVvspartialbDV Full Lots of elements to define Easy data marts Partial Lesswork More T between data vault and data marts Multiple versions of the truth
32. VirtualvsPhysical Virtual (views) No physicalmaintenance Easy to adapt Performance limitations Platform defines performance Lineage (dependingon platform) Real time Auditability? Physical Scalability / performance Manualtweaking (indexes, etc.) Surrogatekeys easy More intuitive to develop (ETL in stead of SQL) More complex transformations (ie. aggregations)
33. Self service BI and write back Palofor Excel Open source MOLAP Everycellpoints to location in the cube Writeback to cubepossible EDW cube excel
34. Lessonslearnt Itispossible to quicklybuildan EDW with open source software Somereallycooldevelopments (ie. data mart generation) Automationonlygoessofar Somechallengesstillneed to beaddressed …it is business intelligenceafter all. Automate, ifitsavesyou money Itcan save you time to focus on the important stuff The end product counts: does itdeliveraddedvalue? What’s the best EDW architecture? Itdepends!™