Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Continuous ETL Testing for Pentaho Data Integration (kettle)

2 497 vues

Publié le

The talk gives an overview of the benefits of automated testing, defines different types of ETL tests, and showcases techniques for implementing an ETL test suite.

It was held at the Pentaho Community Meetup 2017 in Mainz.

Publié dans : Données & analyses
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/KUYJA ◀ ◀ ◀ ◀
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Continuous ETL Testing for Pentaho Data Integration (kettle)

  1. 1. Twineworks Testing PDI solutions Slawomir Chodnicki BI Consultant slawo@twineworks.com
  2. 2. TwineworksThe sample project 2 . |-- bin # entry point scripts |-- environments # environment configuration |-- etl # ETL solution `-- spec # tests and helpers Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  3. 3. TwineworksTest orchestration $ bin/robot test 3 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  4. 4. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin4
  5. 5. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin5
  6. 6. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin6
  7. 7. Twineworks Jenkins is a continuous integration server. It’s basic role is to run the test suite and build any artifacts upon changes in version control. Example server: http://ci.pentaho.com/
  8. 8. Twineworks
  9. 9. Twineworks
  10. 10. Twineworks
  11. 11. Twineworks Jenkins is a continuous integration server. It’s basic role is to run the test suite and build any artifacts upon changes in version control. Example server: http://ci.pentaho.com/
  12. 12. Twineworks Testable solutions
  13. 13. Twineworks Configuration management • configure all data sources/targets and paths through kettle variables or parameters • local environment - not in version control • test environment - reference environment • production environment - optional 13 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  14. 14. Twineworks Configuration management 14 environments `-- local # dev environment – not in version control |-- environment.sh # shell environment variables |-- my.cnf # database config file `-- .kettle # KETTLE_HOME |-- shared.xml # database connections `-- kettle.properties # kettle variables `-- test # test environment – in version control `-- production # other environments Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  15. 15. TwineworksEnvironments are self-contained • share nothing • reproducible results: – you can run it – your team can run it – ci-server can run it 15 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  16. 16. Twineworks Configuration management $ bin/robot spoon 16 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  17. 17. Twineworks Configuration management 17 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  18. 18. TwineworksInitialize db $ bin/robot db reset clearing database [ OK ] initializing database 2017/10/20 15:32:37 - Kitchen - Start of run. 2017/10/20 15:32:37 - reset_dwh - Start of job execution … … 2017/10/20 15:32:41 - Kitchen - Finished! 2017/10/20 15:32:41 - Kitchen - Processing ended after 4 seconds. [ OK ] 18 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  19. 19. TwineworksSub-systems/Phases • Define sub-systems/phases – define pre-requisites • data expected in certain sources – define outcomes • data written to certain sinks • A sub-system/phase of the ETL process is responsible for a small set of related side- effects to happen 19 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  20. 20. TwineworksSub-systems/Phases 20 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  21. 21. TwineworksSub-systems/Phases 21 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  22. 22. TwineworksEntry points • Define entry points with a full functional contract. • An entry point implements an application feature. 22 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  23. 23. TwineworksEntry points 23
  24. 24. Twineworks Testing ETL solutions
  25. 25. TwineworksKinds of automated tests • Computation tests • Integration tests • Functional tests • Non-functional tests 25 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  26. 26. TwineworksComputation tests • Single unit of ETL under test • Performs a computation (no side-effects) • What is a “unit” in PDI? – Job? – Transformation? – Sub-transformation (mapping)? 26 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  27. 27. TwineworksA simple computation test Test job spec/dwh/validate_params/validate_params_spec.kjb 27 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  28. 28. TwineworksA simple computation test Test job spec/dwh/validate_params/validate_params_spec.kjb The job calls etl/dwh/validate_params.kjb - with DATA_DATE=2016-07-01 and expects it to succeed - with DATA_DATE=1867-12-21 and expects it to fail The job succeeds if all expectations are met. It fails otherwise. 28 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  29. 29. TwineworksTest jobs 29 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  30. 30. Twineworks Test transformation results 30 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  31. 31. Twineworks Test transformation results 31 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  32. 32. TwineworksTest sub-transformations 32 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  33. 33. TwineworksTest sub-transformations 33 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  34. 34. Twineworks Integration tests
  35. 35. TwineworksIntegration tests • ETL responsible for a set of related side- effects under test • Most common case in ETL testing – Test individual phases of a batch process 35 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  36. 36. TwineworksIntegration tests 36 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  37. 37. TwineworksIntegration tests 37 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  38. 38. TwineworksIntegration tests 38 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  39. 39. Twineworks Functional testing
  40. 40. TwineworksFunctional tests • Entry point of ETL solution under test • Assertions reflect invocation contract – Behavior on happy path – Behavior on errors – Behavior on incorrect invocation 40 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  41. 41. TwineworksTest the daily run 41 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  42. 42. TwineworksTest the daily run 42 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  43. 43. TwineworksTest the daily run 43 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  44. 44. Twineworks Non-functional tests
  45. 45. TwineworksNon-functional tests • Performance – how long does workload x take? • Stability – what does it take to break it? – How much memory is too little? – What happens when loading unexpected data? (truncated file, column too long, 50MB XML in string field, badly formatted CSV reads as single field, empty files) 45 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  46. 46. TwineworksNon-functional tests • Security – Verify configuration assumptions automatically • Compliance – We must use version x of library y 46 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  47. 47. TwineworksTest compliance 47 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  48. 48. Twineworks Scripting tests
  49. 49. TwineworksJRuby JRuby is the ruby language on the JVM http://jruby.org Maintained by Redhat. Runs Rails on JBoss  49 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  50. 50. TwineworksRspec Rspec is a testing framework for ruby http://rspec.info/ https://relishapp.com/rspec 50 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  51. 51. TwineworksRspec $ bin/robot test • Includes helper files in spec/support • traverses the spec folder looking for files whose names end in _spec.rb and loads them as tests 51 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  52. 52. Twineworks 52 describe "dwh clear job" do end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  53. 53. Twineworks 53 describe "dwh clear job" do describe "when db is not empty" do end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  54. 54. Twineworks 54 describe "dwh clear job" do describe "when db is not empty" do before :all do dwh_db.load_fixture "spec/fixtures/steelwheels/steelwheels.sql" @result = run_job "etl/dwh/load/load.kjb", {} end end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  55. 55. TwineworksRspec helpers 55 spec/support/spec_helpers.rb def dwh_db ... end Returns a JDBC database object. Connects on demand, and closes automatically when test- suite ends. Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  56. 56. TwineworksRspec helpers 56 spec/support/spec_helpers.rb def dwh_db ... end In addition dwh_db.load_fixture(path) allows loading a sql or json fixture file dwh_db.reset() triggers $ bin/robot db reset Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  57. 57. Twineworks 57 describe "dwh clear job" do describe "when db is not empty" do before :all do dwh_db.load_fixture "spec/fixtures/steelwheels/steelwheels.sql" } end end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  58. 58. Twineworks 58 describe "dwh clear job" do describe "when db is not empty" do before :all do dwh_db.load_fixture "spec/fixtures/steelwheels/steelwheels.sql" @result = run_job "etl/dwh/util/clear.kjb" end end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  59. 59. TwineworksRspec helpers 59 spec/support/spec_helpers.rb def run_job file, params ... end Runs a kettle job and returns a map { :successful? => true/false, :log => “log text”, :result => [row1, row2, row3, …] } Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  60. 60. Twineworks 60 describe "dwh clear job" do describe "when db is not empty" do before :all do dwh_db.load_fixture "spec/fixtures/steelwheels/steelwheels.sql" @result = run_job "etl/dwh/util/clear.kjb" end end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  61. 61. Twineworks 61 describe "dwh clear job" do describe "when db is not empty" do before :all do dwh_db.load_fixture "spec/fixtures/steelwheels/steelwheels.sql" @result = run_job "etl/dwh/util/clear.kjb" end it "completes successfully" do expect(@result[:successful?]).to be true end end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  62. 62. Twineworks 62 describe "dwh clear job" do describe "when db is not empty" do before :all do dwh_db.load_fixture "spec/fixtures/steelwheels/steelwheels.sql" @result = run_job "etl/dwh/util/clear.kjb" end it "completes successfully" do expect(@result[:successful?]).to be true end it "clears the db" do expect(dwh_db.query("SHOW TABLES").to_a.length).to eq 0 end end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  63. 63. TwineworksTest orchestration $ bin/robot test 63 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  64. 64. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin64
  65. 65. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin65
  66. 66. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin66
  67. 67. TwineworksRunning jobs as rspec tests 67 spec/etl/etl_spec.rb Recursively traverses etl/spec looking for files whose names end in _spec.kjb, and dynamically generates a describe and it block for it. Hence all such job files are part of the test suite. Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  68. 68. Twineworks 68 describe "ETL" do Dir.glob("./**/*_spec.kjb").each do |path| describe "#{path}” do it "completes successfully" do @result = run_job path.to_s, {} expect(@result[:successful?]).to be true end end end end Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  69. 69. TwineworksRspec – test orchestration 69 Rspec runs in two phases Phase 1: collects tests, recording the structure as given by the describe blocks. Phase 2: filters found tests as per command line parameters and executes them Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  70. 70. TwineworksRspec – test orchestration 70 Run only tests containing the word ‘clear’ in their name or enclosing describe blocks: $ bin/robot test --example 'clear' Run only tests tagged ‘long_running’: $ bin/robot test --tag 'long_running' Run only tests in spec/commands $ bin/robot test spec/commands Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  71. 71. TwineworksTest orchestration $ bin/robot test 71 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  72. 72. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin72
  73. 73. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin73
  74. 74. TwineworksTest orchestration Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin74
  75. 75. Twineworks Jenkins is a continuous integration server. It’s basic role is to run the test suite and build any artifacts upon changes in version control. Example server: http://ci.pentaho.com/
  76. 76. Twineworks Thank you!
  77. 77. Twineworks Backup Slides
  78. 78. Twineworks Testing in practice
  79. 79. TwineworksTest what you run • Verify behavior of the entity you run directly 79 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  80. 80. TwineworksTools of the trade • Helpers – utility code/etl of components reused to make tests about the what, not about the how – fixture loaders – assertion helpers – data comparison helpers 80 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  81. 81. TwineworksData Fixtures • Data Fixtures – sets of test data, encoded in a convenient way, easily loaded into data sources and sinks • JSON, CSV, SQL, XML, YAML – Use whatever is easiest to maintain for the team • Generate data fixtures through parameterized scripts if you need to generate datasets with consistent relationships 81 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com
  82. 82. TwineworksFile Fixtures • File Fixtures – sets of test files acted upon during a run • Maintain file fixtures separate from source location expected by ETL • If fixture files are changed as part of the test, copy them to a temporary location before running tests • Create a unique source location per test run, if the file location is shared (like sftp) 82 Twineworks GmbH | Helmholtzstr. 28 | 10587 Berlin | twineworks.com

×