3. Brewery 1 Issues
■ based on streaming data by records
buffering in python lists as python objects
■ stream networks were using threads
hard to debug, performance penalty (GIL)
■ no use of native data operations
■ difficult to extend
9. data object represents structured data
Data do not have to be in its final form,
neither they have to exist. Promise of
providing data in the future is just fine.
Data are virtual.
11. Data Object
■ is defined by fields
■ has one or more representations
■ might be consumable
one-time use objects such as streamed data
SQL statement
iterator
12. Fields
■ define structure of data object
■ storage metadata
generalized storage type, concrete storage type
■ usage metadata
purpose – analytical point of view, missing values, ...
14. SQL statement
iterator
SELECT *
FROM products
WHERE price < 100
engine.execute(statement)
Representations
SQL statement that can
be composed
actual rows fetched
from database
15. Representations
■ represent actual data in some way
SQL statement, CSV file, API query, iterator, ...
■ decided on runtime
list might be dynamic, based on metadata, availability, …
■ used for data object operations
filtering, composition, transformation, …
18. Data Object Role
■ source: provides data
various source representations such as rows()
■ target: consumes data
append(row), append_from(object), ...
target.append_from(source)
for row in source.rows():
print(row)
implementation might
depend on source
19. Append From ...
Iterator SQL
target.append_from(source)
for row in source.rows():
INSERT INTO target (...)
SQLSQL
INSERT INTO target
SELECT … FROM source
same engine
23. @operation
@operation(“sql”)
def sample(context, object, limit):
...
@operation(“sql”, “sql”)
def new_rows(context, target, source):
...
@operation(“sql”, “rows”, name=“new_rows”)
def new_rows_iter(context, target, source):
...
unary
binary
binary with same name but different signature:
24. List of Objects
@operation(“sql[]”)
def append(context, objects):
...
@operation(“rows[]”)
def append(context, objects):
...
matches one of common representations
of all objects in the list
34. Retry!
SQL
SQL
A
B
iterator
iteratorSQL
join details join details
SQL
retry another
signature
raise RetryOperation(“rows”, “rows”)
if objects are not compose-able as
expected, operation might gently fail and
request a retry with another signature:
35. Retry when...
■ not able to compose objects
because of different connections or other reasons
■ not able to use representation
as expected
■ any other reason
48. Uniqueness
■ distinct(ctx, obj, key)
distinct values for key
■ distinct_rows(ctx, obj, key)
distinct whole rows (first occurence of a row) for key
■ count_duplicates(ctx, obj, key)
count number of duplicates for key
49. Master-detail
■ join_detail(ctx, master, detail, master_key, detail_key)
Joins detail table, such as a dimension, on a specified key. Detail key
field will be dropped from the result.
Note: other join-based operations will be implemented
later, as they need some usability decisions to be made
50. Dimension Loading
■ added_keys(ctx, dim, source, dim_key, source_key)
which keys in the source are new?
■ added_rows(ctx, dim, source, dim_key, source_key)
which rows in the source are new?
■ changed_rows(ctx, target, source, dim_key, source_key,
fields, version_field)
which rows in the source have changed?
53. To Do
■ consolidate representations API
■ define basic set of operations
■ temporaries and garbage collection
■ sequence objects for surrogate keys
54. Version 0.2
■ processing graph
connected nodes, like in Brewery
■ more basic backends
at least Mongo
■ bubbles command line tool
already in progress