Ce diaporama a bien été signalé.

Bubbles – Virtual Data Objects

9

Partager

1 sur 57
1 sur 57

Bubbles – Virtual Data Objects

9

Partager

Télécharger pour lire hors ligne

Description

Bubbles is a data framework for creating data processing and monitoring pipelines.

Transcription

  1. 1. Bubbles Virtual Data Objects June 2013Stefan Urbanek data brewery
  2. 2. Contents ■ Data Objects ■ Operations ■ Context ■ Stores ■ Pipeline
  3. 3. Brewery 1 Issues ■ based on streaming data by records buffering in python lists as python objects ■ stream networks were using threads hard to debug, performance penalty (GIL) ■ no use of native data operations ■ difficult to extend
  4. 4. About Python framework for data processing and quality probing v3.3
  5. 5. Objective focus on the process, not data technology
  6. 6. Data ■ keep data in their original form ■ use native operations if possible ■ performance provided by technology ■ have other options
  7. 7. for categorical data * you can do numerical too, but there are plenty of other, better tools for that *
  8. 8. Data Objects
  9. 9. data object represents structured data Data do not have to be in its final form, neither they have to exist. Promise of providing data in the future is just fine. Data are virtual.
  10. 10. virtual data object fields virtual data SQL statement iterator id product category amount unit price representations
  11. 11. Data Object ■ is defined by fields ■ has one or more representations ■ might be consumable one-time use objects such as streamed data SQL statement iterator
  12. 12. Fields ■ define structure of data object ■ storage metadata generalized storage type, concrete storage type ■ usage metadata purpose – analytical point of view, missing values, ...
  13. 13. 100 Atari 1040ST computer 10 400.0 1985 no integer string string integer float integer string typeless nominal nominal discrete measure ordinal flag id product category amount unit price year shipped Field List storage type name analytical type (purpose) sample metadata
  14. 14. SQL statement iterator SELECT * FROM products WHERE price < 100 engine.execute(statement) Representations SQL statement that can be composed actual rows fetched from database
  15. 15. Representations ■ represent actual data in some way SQL statement, CSV file, API query, iterator, ... ■ decided on runtime list might be dynamic, based on metadata, availability, … ■ used for data object operations filtering, composition, transformation, …
  16. 16. Representations SQL statement iterator natural, most efficient for operations default, all-purpose, might be very expensive
  17. 17. Representations >>> object.representations() [“sql_table”, “postgres+sql”, “sql”, “rows”] data might have been cached in a table we might use PostgreSQL dialect specific features... … or fall back to generic SQL for all other operations
  18. 18. Data Object Role ■ source: provides data various source representations such as rows() ■ target: consumes data append(row), append_from(object), ... target.append_from(source) for row in source.rows(): print(row) implementation might depend on source
  19. 19. Append From ... Iterator SQL target.append_from(source) for row in source.rows(): INSERT INTO target (...) SQLSQL INSERT INTO target SELECT … FROM source same engine
  20. 20. Operations
  21. 21. Operation ✽… ? ... … ? ...… ? ... … ? ... does something useful with data object and produces another data object or something else, also useful
  22. 22. Signature @operation(“sql”) def sample(context, object, limit): ... signature accepted representation SQL ✽ … ? ... iterator SQL
  23. 23. @operation @operation(“sql”) def sample(context, object, limit): ... @operation(“sql”, “sql”) def new_rows(context, target, source): ... @operation(“sql”, “rows”, name=“new_rows”) def new_rows_iter(context, target, source): ... unary binary binary with same name but different signature:
  24. 24. List of Objects @operation(“sql[]”) def append(context, objects): ... @operation(“rows[]”) def append(context, objects): ... matches one of common representations of all objects in the list
  25. 25. Any / Default @operation(“*”) def do_something(context, object): ... default operation – if no signature matches
  26. 26. Context
  27. 27. Context SQL iterator iterator SQL iterator ✽ ✂ ⧗ ✽ ⧗ Mongo ✽ collection of operations
  28. 28. Operation Call context = Context() context.operation(“sample”)(source, 10) sample sample iterator ⇢SQL ⇢ iterator SQL callable reference runtime dispatch sample SQL ⇢
  29. 29. Simplified Call context.operation(“sample”)(source, 10) context.o.sample(source, 10)
  30. 30. Dispatch SQL ✽iterator SQL iterator ✽iterator MongoDB operation is chosen based on signature Example: we do not have this kind of operation for MongoDB, so we use default iterator instead
  31. 31. Dispatch dynamic dispatch of operations based on representations of argument objects
  32. 32. Priority SQL ✽iterator SQL iterator ✽SQL iterator order of representations matters might be decided during runtime same representations, different order
  33. 33. Incapable?  SQL SQL join details A A  SQL SQL join details A B SQL join details SQL  same connection different connection use this fails
  34. 34. Retry!  SQL SQL A B iterator iteratorSQL join details join details  SQL  retry another signature raise RetryOperation(“rows”, “rows”) if objects are not compose-able as expected, operation might gently fail and request a retry with another signature:
  35. 35. Retry when... ■ not able to compose objects because of different connections or other reasons ■ not able to use representation as expected ■ any other reason
  36. 36. Modules *just an example collection of operations SQL Iterator MongoDB SQL iterator iterator SQL iterator ✽ ✂ ⧗ ✽ ⧗ Mongo ✽
  37. 37. Extend Context context.add_operations_from(obj) any object that has operations as attributes, such as module
  38. 38. Stores
  39. 39. Object Store ■ contains objects tables, files, collections, ... ■ objects are named get_object(name) ■ might create objects create(name, replace, ...)
  40. 40. Object Store store = open_store(“sql”, “postgres://localhost/data”) store factory Factories: sql, csv (directory), memory, ...
  41. 41. Stores and Objects source = open_store(“sql”, “postgres://localhost/data”) target = open_store(“csv”, “./data/”) source_obj = source.get_object(“products”) target_obj = target.create(“products”, fields=source_obj.fields) for row in source_obj.rows(): target_obj.append(row) target_obj.flush() copy data from SQL table to CSV
  42. 42. Pipeline
  43. 43. Pipeline SQLSQL SQL SQL Iterator sequence of operations on “trunk”
  44. 44. Pipeline Operations stores = { “source”: open_store(“sql”, “postgres://localhost/data”) ”target” = open_store(“csv”, “./data/”) } p = Pipeline(stores=stores) p.source(“source”, “products”) p.distinct(“color”) p.create(“target”, “product_colors”) operations – first argument is result from previous step extract product colors to CSV
  45. 45. Pipeline p.source(store, object_name, ...) store.get_object(...) p.create(store, object_name, ...) store.create(...) store.append_from(...)
  46. 46. Operation Library
  47. 47. Filtering ■ row filters filter_by_value, filter_by_set, filter_by_range ■ field_filter(ctx, obj, keep=[], drop=[], rename={}) keep, drop, rename fields ■ sample(ctx, obj, value, mode) first N, every Nth, random, …
  48. 48. Uniqueness ■ distinct(ctx, obj, key) distinct values for key ■ distinct_rows(ctx, obj, key) distinct whole rows (first occurence of a row) for key ■ count_duplicates(ctx, obj, key) count number of duplicates for key
  49. 49. Master-detail ■ join_detail(ctx, master, detail, master_key, detail_key) Joins detail table, such as a dimension, on a specified key. Detail key field will be dropped from the result. Note: other join-based operations will be implemented later, as they need some usability decisions to be made
  50. 50. Dimension Loading ■ added_keys(ctx, dim, source, dim_key, source_key) which keys in the source are new? ■ added_rows(ctx, dim, source, dim_key, source_key) which rows in the source are new? ■ changed_rows(ctx, target, source, dim_key, source_key, fields, version_field) which rows in the source have changed?
  51. 51. more to come…
  52. 52. Conclusion
  53. 53. To Do ■ consolidate representations API ■ define basic set of operations ■ temporaries and garbage collection ■ sequence objects for surrogate keys
  54. 54. Version 0.2 ■ processing graph connected nodes, like in Brewery ■ more basic backends at least Mongo ■ bubbles command line tool already in progress
  55. 55. Future ■ separate operation dispatcher will allow custom dispatch policies
  56. 56. Contact: @Stiivi stefan.urbanek@gmail.com
  57. 57. databrewery.org

Description

Bubbles is a data framework for creating data processing and monitoring pipelines.

Transcription

  1. 1. Bubbles Virtual Data Objects June 2013Stefan Urbanek data brewery
  2. 2. Contents ■ Data Objects ■ Operations ■ Context ■ Stores ■ Pipeline
  3. 3. Brewery 1 Issues ■ based on streaming data by records buffering in python lists as python objects ■ stream networks were using threads hard to debug, performance penalty (GIL) ■ no use of native data operations ■ difficult to extend
  4. 4. About Python framework for data processing and quality probing v3.3
  5. 5. Objective focus on the process, not data technology
  6. 6. Data ■ keep data in their original form ■ use native operations if possible ■ performance provided by technology ■ have other options
  7. 7. for categorical data * you can do numerical too, but there are plenty of other, better tools for that *
  8. 8. Data Objects
  9. 9. data object represents structured data Data do not have to be in its final form, neither they have to exist. Promise of providing data in the future is just fine. Data are virtual.
  10. 10. virtual data object fields virtual data SQL statement iterator id product category amount unit price representations
  11. 11. Data Object ■ is defined by fields ■ has one or more representations ■ might be consumable one-time use objects such as streamed data SQL statement iterator
  12. 12. Fields ■ define structure of data object ■ storage metadata generalized storage type, concrete storage type ■ usage metadata purpose – analytical point of view, missing values, ...
  13. 13. 100 Atari 1040ST computer 10 400.0 1985 no integer string string integer float integer string typeless nominal nominal discrete measure ordinal flag id product category amount unit price year shipped Field List storage type name analytical type (purpose) sample metadata
  14. 14. SQL statement iterator SELECT * FROM products WHERE price < 100 engine.execute(statement) Representations SQL statement that can be composed actual rows fetched from database
  15. 15. Representations ■ represent actual data in some way SQL statement, CSV file, API query, iterator, ... ■ decided on runtime list might be dynamic, based on metadata, availability, … ■ used for data object operations filtering, composition, transformation, …
  16. 16. Representations SQL statement iterator natural, most efficient for operations default, all-purpose, might be very expensive
  17. 17. Representations >>> object.representations() [“sql_table”, “postgres+sql”, “sql”, “rows”] data might have been cached in a table we might use PostgreSQL dialect specific features... … or fall back to generic SQL for all other operations
  18. 18. Data Object Role ■ source: provides data various source representations such as rows() ■ target: consumes data append(row), append_from(object), ... target.append_from(source) for row in source.rows(): print(row) implementation might depend on source
  19. 19. Append From ... Iterator SQL target.append_from(source) for row in source.rows(): INSERT INTO target (...) SQLSQL INSERT INTO target SELECT … FROM source same engine
  20. 20. Operations
  21. 21. Operation ✽… ? ... … ? ...… ? ... … ? ... does something useful with data object and produces another data object or something else, also useful
  22. 22. Signature @operation(“sql”) def sample(context, object, limit): ... signature accepted representation SQL ✽ … ? ... iterator SQL
  23. 23. @operation @operation(“sql”) def sample(context, object, limit): ... @operation(“sql”, “sql”) def new_rows(context, target, source): ... @operation(“sql”, “rows”, name=“new_rows”) def new_rows_iter(context, target, source): ... unary binary binary with same name but different signature:
  24. 24. List of Objects @operation(“sql[]”) def append(context, objects): ... @operation(“rows[]”) def append(context, objects): ... matches one of common representations of all objects in the list
  25. 25. Any / Default @operation(“*”) def do_something(context, object): ... default operation – if no signature matches
  26. 26. Context
  27. 27. Context SQL iterator iterator SQL iterator ✽ ✂ ⧗ ✽ ⧗ Mongo ✽ collection of operations
  28. 28. Operation Call context = Context() context.operation(“sample”)(source, 10) sample sample iterator ⇢SQL ⇢ iterator SQL callable reference runtime dispatch sample SQL ⇢
  29. 29. Simplified Call context.operation(“sample”)(source, 10) context.o.sample(source, 10)
  30. 30. Dispatch SQL ✽iterator SQL iterator ✽iterator MongoDB operation is chosen based on signature Example: we do not have this kind of operation for MongoDB, so we use default iterator instead
  31. 31. Dispatch dynamic dispatch of operations based on representations of argument objects
  32. 32. Priority SQL ✽iterator SQL iterator ✽SQL iterator order of representations matters might be decided during runtime same representations, different order
  33. 33. Incapable?  SQL SQL join details A A  SQL SQL join details A B SQL join details SQL  same connection different connection use this fails
  34. 34. Retry!  SQL SQL A B iterator iteratorSQL join details join details  SQL  retry another signature raise RetryOperation(“rows”, “rows”) if objects are not compose-able as expected, operation might gently fail and request a retry with another signature:
  35. 35. Retry when... ■ not able to compose objects because of different connections or other reasons ■ not able to use representation as expected ■ any other reason
  36. 36. Modules *just an example collection of operations SQL Iterator MongoDB SQL iterator iterator SQL iterator ✽ ✂ ⧗ ✽ ⧗ Mongo ✽
  37. 37. Extend Context context.add_operations_from(obj) any object that has operations as attributes, such as module
  38. 38. Stores
  39. 39. Object Store ■ contains objects tables, files, collections, ... ■ objects are named get_object(name) ■ might create objects create(name, replace, ...)
  40. 40. Object Store store = open_store(“sql”, “postgres://localhost/data”) store factory Factories: sql, csv (directory), memory, ...
  41. 41. Stores and Objects source = open_store(“sql”, “postgres://localhost/data”) target = open_store(“csv”, “./data/”) source_obj = source.get_object(“products”) target_obj = target.create(“products”, fields=source_obj.fields) for row in source_obj.rows(): target_obj.append(row) target_obj.flush() copy data from SQL table to CSV
  42. 42. Pipeline
  43. 43. Pipeline SQLSQL SQL SQL Iterator sequence of operations on “trunk”
  44. 44. Pipeline Operations stores = { “source”: open_store(“sql”, “postgres://localhost/data”) ”target” = open_store(“csv”, “./data/”) } p = Pipeline(stores=stores) p.source(“source”, “products”) p.distinct(“color”) p.create(“target”, “product_colors”) operations – first argument is result from previous step extract product colors to CSV
  45. 45. Pipeline p.source(store, object_name, ...) store.get_object(...) p.create(store, object_name, ...) store.create(...) store.append_from(...)
  46. 46. Operation Library
  47. 47. Filtering ■ row filters filter_by_value, filter_by_set, filter_by_range ■ field_filter(ctx, obj, keep=[], drop=[], rename={}) keep, drop, rename fields ■ sample(ctx, obj, value, mode) first N, every Nth, random, …
  48. 48. Uniqueness ■ distinct(ctx, obj, key) distinct values for key ■ distinct_rows(ctx, obj, key) distinct whole rows (first occurence of a row) for key ■ count_duplicates(ctx, obj, key) count number of duplicates for key
  49. 49. Master-detail ■ join_detail(ctx, master, detail, master_key, detail_key) Joins detail table, such as a dimension, on a specified key. Detail key field will be dropped from the result. Note: other join-based operations will be implemented later, as they need some usability decisions to be made
  50. 50. Dimension Loading ■ added_keys(ctx, dim, source, dim_key, source_key) which keys in the source are new? ■ added_rows(ctx, dim, source, dim_key, source_key) which rows in the source are new? ■ changed_rows(ctx, target, source, dim_key, source_key, fields, version_field) which rows in the source have changed?
  51. 51. more to come…
  52. 52. Conclusion
  53. 53. To Do ■ consolidate representations API ■ define basic set of operations ■ temporaries and garbage collection ■ sequence objects for surrogate keys
  54. 54. Version 0.2 ■ processing graph connected nodes, like in Brewery ■ more basic backends at least Mongo ■ bubbles command line tool already in progress
  55. 55. Future ■ separate operation dispatcher will allow custom dispatch policies
  56. 56. Contact: @Stiivi stefan.urbanek@gmail.com
  57. 57. databrewery.org

Plus De Contenu Connexe

×