2. This group extends the TDWI community
online and is designed to foster peer
network and discussion of key issues
relevant to business intelligence and data
warehousing managers.
TDWI (The Data Warehousing
Institute™) provides education, training,
certification, news, and research for
executives and information technology
(IT) professionals worldwide. Founded in
1995, TDWI is the premier educational
institute for business intelligence and
data warehousing. Our Web site is
www.tdwi.org.
Why this topic?
There’s a lot of confusion and misconception about the
meaning of Agile, especially as it applies to BI
Many in corporate IT still believe that Agile cannot easily be
applied to BI
Posts on this topic in the TDWI forum in LinkedIn would benefit
from being organized and summarized
3. What we’ll cover
Misconceptions about Agile BI
Core techniques of Agile BI
Review of ETL tool landscape and benefits
Decision factors for choosing the ETL environment
Mitigating aspects of ETL tools that make Agile harder
How to implement an Agile BI development environment
Due to the prevailing confusion
and misconceptions, it’s easier to
start with what Agile BI is not
4. Misconceptions about Agile in the BI community
There’s a common misconception that Agile BI applies to practically any
methodology or tool that helps develop BI projects faster or in a more flexible way.
Some examples of misconceptions:
Agile is primarily adding iterations to
typical projects
Agile implies starting to code without
planning or design
Agile involves particular data models,
such as Data Vault
Agile involves rapid prototyping
techniques, as can be achieved by
certain metadata driven tools
Agile involves self-serve reporting, such
as Tableau
Agile involves moving ETL from a
separate code base into the reporting
layer, as made possible by in-memory
processing, such as with QlickView
Agile involves building real-time or low-
latency DW, rather than traditional batch
Agile operates in a hosted cloud
environment, especially PaaS (Platform
as a Service)
5. The culprits for the myths and misconceptions
#1 Vendors claim that their products are agile.
#2 The BI community as a whole does not have a long history or
substantial practice with agile development. Therefore they are more likely
to be swayed by vendor pitches.
6. The culprits for the myths and misconceptions (cont.)
In the software development world, that’s equivalent to saying
that new frameworks, such as Ruby on Rails, are needed for
Agile development. (Few credible publications or developers
would make such a claim.)
The implication that other BI tools can’t be used to achieve
Agile BI is simply not true. (Even general purpose
development platforms can be applied to BI.)
In reality, team composition, proficiency with existing
technologies and management’s acceptance of agile is a
bigger impact than a specific type of BI tool.
“...Agile BI methodology differs from
[agile software development] in that it
requires new and different technologies
and architectures for support. Metadata-
generated BI applications are one such
example...”
Example source of
misconceptions:
The article goes on to claim that these
particular tools are needed in order to
achieve “development done faster”,
“react[ing] more quickly to ...
requirements“, incremental product
delivery, “rapid prototypes versus
specifications”, “reacting versus
planning”, “personal interactions ...
versus documentation”, etc.
Forrester Research article “Agile Out of the
Box”, 2010
This list is just buzz words associated with agile without
substantial evidence of why other tools are insufficient.
Rapid prototyping is confused with the role of end-to-end
working software.
On the contrary, arguments can be made why the tools
identified could be detrimental to agile teams. (See TDWI
LinkedIn group discussion “The Role of ETL tools in Agile
BI”.)
What’s wrong with itWhat’s being said
7. The reality
Yes, many of the items misclassified as necessary for
Agile still help projects ramp up and complete faster.
Yes, many improve the flexibility of dealing with
changes in source data, business logic and reporting.
Yes, many provide additional visibility into complex
logic and functional changes across team members and
stakeholders.
Data Vault model
Rapid prototyping tools
Metadata driven BI tools
Self-serve reporting
In-memory processing
Hosted cloud (PaaS)
environment
But none of them are required
to have successful Agile BI projects
8. So what are the requirements for implementing Agile BI?
Productive Agile BI teams operate almost identically to Agile methodology used for
software development.
...With just the minimal tweaks to accommodate:
1. Integration of available ETL and reporting tools into the development
environment
2. Changes to regression testing due to the fact that databases have state
3. Challenges of managing large data sets in the deployment process
9. Techniques for implementing Agile in BI
Timebox deliverables – of course
Measure completion with working
software! (Prototypes using non-
production tools are OK. But need to get
end-to-end data flow working ASAP.)
Highly efficient, daily team
synchronization in which entire team
participates.
Monitor completion of features (stories),
not time spent. Calculate team velocity to
improve planning.
Hold sprint retrospectives to learn from
mistakes.
Leverage techniques of Agile app dev:
Manage everything in version control,
including data model and test data sets
Assume refactoring of working code can
occur later to improve performance and
maintainability
Use Test Driven Development (TDD), to
ensure understanding of requirements and
reduce rework
Implement Continuous Integration to
automate build, tests, deployment
Measure project success by delivery of
business value, not delivery of predefined
requirements on time and on budget
Accept that it’s OK to fail, but fail early and
adapt. (Non agile projects don’t recognize
failure until time or budget runs out.)
10. What’s the reason for low adoption of Agile in BI?
Application Development Business Intelligence
Development
Environment
Custom app development using
standard, general purpose
languages well suited for
automation
Proprietary vendor architectures and
DSLs (domain specific languages) not
well suited for automation
Team skills Have skills to write automation for
continuous integration
Rely on vendors to provide these
features
Costs Low up front investment by
leveraging open source platforms
High up front investment in vendor-
specific tools: DW appliance, data
modeling, ETL, OLAP, Reporting, etc.
Releases Software is stateless and therefore
easier to test and deploy with each
build
Databases have state, with each build
needing to start with a certain data set.
High data volumes may take hours to
load a changed data model or roll back
changes.
Agile is widely adopted in application development
...but not in BI
Potential reasons might stem from differences between the two worlds
12. ETL tools have evolved over the years
Graphical development accomplishing ETL through parameterization and
configuration, rather than code generation
Avoids complexities with code management and deployment
Intuitive development UI enabling developers to manipulate ETL metadata
From metadata, generate code in a general purpose (such as C or Java) or
domain specific (such as SQL or MDX) language
Types: One-shot generators (that require switching to a native dev env) vs.
full development environments with managed version deployments
Origin: Reusable code compiled from a few similar projects
Just change parameters to reuse for specific loading, logging, change data
capture, database connections, etc.
One-time solutions
Built with focus on short-term delivery and minimal up front cost
Custom
Code
Frameworks
Code
Generators
Engines
13. We can categorize the major ETL players
The vendors
Traditional vendors: Informatica, SSIS, DataStage
Open source: Talend, Pentaho Kettle
Metadata driven, automated discovery, federated integration:
Kalido, BI Ready, Wherescape, Composite Software
The most common alternative
SQL + shell scripts
Native DB load utilities
14. ETL tools have lots of value
Built-in commonly used features for transformation and job control
Without ETL tools, we’re reinventing the wheel on many BI design patterns that
have been implemented countless times throughout history
Abstracts complex logic into a graphical components or domain specific language
that leverages best practices and is often more maintainable over the potentially
long project life span
Graphical representation of data model, data flow and job flow provide visibility
into business logic, especially useful for less technical team members
Provides a degree of self-documentation without the need to update the graphical
representation of logic separately from source code
Master Data Management (MDM)
Data cleansing
Change Data Capture (CDC)
Data lineage and data dependency
functionality
Processing of SCD (Slowly Changing
Dimensions)
Parallelization of tasks that can be run
concurrently
Advanced merging functionality
15. But many ETL tools are
not well suited to an
Agile BI environment
16. First, these tools may not be ideal for Agile in general...
Some ETL tools are...
Not well suited for code refactoring, branching, and merging because
the code is not in text files that can be used modern version control, such
as Git
Not well suited for use with automation in Continuous Integration,
because they’re often standalone environments with no provisions for
external automation
Not well suited for TDD (Test Driven Development), unless the vendors
explicitly made provisions for unit test automation
Proprietary and have “black box” features that might make testing more
challenging or decrease portability of test cases
Expensive, with high up-front license cost also putting more capital at
risk – unless open source ETL, of course
17. Second, they may negatively impact productivity of Agile teams
ETL tools may...
Require a proprietary, vendor-specific skill set not present in the organization
Cause work priority to be stove-piped and limited to skill set, rather than overall
business value
Prevent the ability to leverage the full dev team, since they fall under a
separate development environment from the rest of apps
Result in a productivity hit, since some professional developers are more
productive writing code in native languages than using GUI tools, even after
training
Not provide compelling enough reasons for developers to learn any one ETL
tool, since the lack of industry standards decreases skill portability
18. Third, there are other challenges and considerations
There are challenges and limitations with ETL tools even outside of Agile
Require allocation of additional resources to manage version upgrades of the
ETL tool, even if the code base hasn’t been changing
When the type of processing needed is outside of core ETL tool features,
complexity can grow quickly
Usefulness of visual representations for data models, data flows and job flows is
reduced as complexity increases
Some find GUI development less efficient than traditional coding, especially for
complex or unique type of processing
Often the sophisticated features are underutilized, resulting in expensive tools
being used just for job scheduling
19. Fourth, BI is increasingly involving Big Data
Big Data implementations often make ETL tools less compelling
Large volumes make it more efficient to
Manipulate data in place using ELT, rather than have multiple staging areas
Use native methods (MapReduce /Java, SQL, Hive, etc.) that allow for more control
and performance optimization
High velocity of data makes it harder to use ETL tools that have traditionally
been designed around batch-oriented processing.
High variability of data makes ETL tools less attractive, since they expect a
fixed schema and don’t gracefully accommodate changes. Common examples
include unstructured web log data in flat files and logical objects from apps
stored in key-value pair format.
MPP vendors, such as Teradata and Netezza make a case for doing ELT (rather
than ETL) processing natively and provide built-in features to do so
Currently ETL tools are rarely used with the Hadoop ecosystem for many of the
reasons stated, as well as licensing cost
20. That said, how do we
implement an-
Agile BI environment?
21. First, use ETL tools when it makes sense
Pick the right ETL tool for the job...
We covered the potential benefits and problems of using such ETL tools
for Agile BI. Look for situations where benefits outweigh the problems.
For example, a good situation to employ ETL tools might be: A use case
requiring sophisticated data cleansing transformations, complex job control
logic, and data volumes easily handled by traditional SMP database
architectures.
Outside of such situations, consider using SQL, DB-specific native code, or
general purpose languages already in use elsewhere in the organization.
Is it OK to start with using an ETL tool as a job scheduler?
Yes, assuming it’s an efficient way to handle much needed job control
logic, including failures, event triggers, and dependencies.
Plus, you get the option to adopt other capabilities of the tool over time
with low project risk.
While traditional ETL tools
can simplify a complex task,
they can also overcomplicate
a simple task.
22. Second, when you do use ETL tools, look
for ways to mitigate these issues identified
So what’s the solution?
L
Issue Approach
High up-front
license cost
Use open source tools or less expensive licenses like with SQL Server.
Aggressive vendor negotiations, in light of lower cost alternatives.
Use with
Continuous
Integration
See following slides. Some vendors, like Microsoft, may make provisions for
automated builds within their environment. Otherwise look for opportunities to
simplify, partially automate, and notify team of build state.
Use with
version control
Where possible, save ETL logic to XML, create dumps of repository, and
generate code from metadata. Then manage in common version control tool.
Decreased
portability
Move code to general purpose development languages, including SQL and
MDX. Consider tools that generate generic code from GUI or metadata.
Vendor-specific
skill set
Build cross-functional team by...
Training existing developers
Hiring well-rounded developers willing to learn ETL tools
Risk of
introducing
another
development
environment
Start using ETL tools now and “grow” into using the functionality
Continue coding in what you know: native RDBMS code or even general
app dev languages
Start using ETL as a glorified job scheduler to wrap native code
When refactoring code, take the opportunity to push more logic into the
ETL tool
Gradually start using other features such as MDM, data quality,
notifications, enterprise service bus, etc.
23. Continuous Integration: Methodology
Each developer should have a sandbox:
1-to-1 app instance to DB instance (CI by Martin Fowler)
Automate: Table deployment, usage stats, schema
verification, data migration verification, DB testing,
migration to prod
Version control all DB assets, ideally using a
distributed tool like Git
Use tool like dbDeploy and link app build, DB version, and forward/reverse DDL & DML
scripts
Generate a test data set with a dimension annotating what each is testing; Becomes a
company asset that enables TDD of BI
For cases where an application consumes data from the data warehouse:
BI developers should learn software coding practices; Application developers should learn
data modeling, SQL, DB tuning
Consuming apps use 2 phased builds:
Build 1, DB is stubbed out and runs within minutes
Build 2, includes real DB for end-to-end testing, but might run for a while
Bugs found in Build 2, trigger additions to the test data set; Next time same bug is caught
in Build 1
Shared
developer
schema
Dev 1
Dev 2
Dev 3
Typical BI dev env
with contention
during development
Sandboxed dev env
appropriate for agile
development
Schema Dev 1Dev 1
Dev 2
Dev 3
Schema Dev 2
Schema Dev 3
24. Continuous Integration: Tools & Configuration
How dbDeploy works
dbDeploy is treated as a custom Ant task:
1. Logs & assigns version #s to changes in
SQL files
2. Save changelog table since prior version
3. Generates DDL & DML scripts to apply to
DB in other envs
Tool Type Purpose
Ant Build tool Automates steps to build & deploy software
Jenkins Continuous
Integration
Monitors source code repository (Git) for checkins,
automatically launching build-test cycles and publishing
results.
Git Source control /
repository
Source code repository optimized for branching and merging,
making it efficient for each developer to have their own
sandbox environment. It triggers CI built-test cycles.
dbDeploy,
dbMaintain,
etc.
Database
refactoring manager
Automates the process of establishing
which database refactorings need to be
run against a specific database in order
to migrate it to a particular build.
DbUnit, DbFit,
SQLUnit
Unit test automation Common tool to aid TDDD (Test-driven DB development).
Manage DB state between test runs, import/export test
datasets, run unit tests and log exceptions. Regression testing
of DDL, DML, stored procedures.
Developer
Env.
Repository
(Git)
CI Environment
Check
out
Build Tool
Deploy
& Test
Test server
Prod server
Project
Code
Check
in
Success /
Fail Tag
25. Continuous Refactoring & Releases of Databases
Dev
Sandbox
Project
Integration
Sandbox
Test / QA
Sandbox
Production
Highly iterative
development
Characteristics
Environment
Deployment
Frequency
Risk / impact
of bug
Project-Level
Testing
System Integration
Testing
Operations &
Support
Frequent Infrequent Controlled
Low impact
Medium impact
High impact
Based on presentation by Pramod Sadalage
Testing Test data set
(Used for TDD)
Test data set
Benchmark data
Production data
26. Continuous Integration:
Possible Configuration for Microsoft BI Stack
PowerDelivery
Addresses TFS’s weakness in coordinating the promotion of builds
through multiple environments of the delivery pipeline: triggering build
on commit, promoting commit build to test, promoting test build to
prod
Windows PowerShell
Task-based command-line shell & scripting language (built on .NET)
for task automation
Team Foundation Server
Microsoft's application lifecycle management (ALM) solution.
Collaboration platform that supports agile delivery practices
Build machine is configured for continuous integration, so latest
working version is refreshed and available to the entire distributed
team
SQL Server Data Tools
Develop, debug, and execute database unit tests interactively
in Visual Studio.
Puts database testing on an equal footing with application testing.
Can then be run from command line or from a build machine
Integrated with testing, bug tracking, and project management using
TFS