SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Building data flows with Celery
and SQLAlchemy
PyCon Australia 2013
Roger Barnes
@mindsocket
roger@mindsocket.com.au
http://slideshare.net/mindsocket
Coming up
● Data warehousing
– AKA data integration
● Processing data flows
– SQLAlchemy
– Celery
● Tying it all together
About me
● 15 years doing all things software
● 11 years at a Business Intelligence vendor
● Currently contracting
– This talk based on a real reporting system
Why Data Warehousing?
Why Data Warehousing?
But we need reports that are
● Timely
● Unambiguous
● Accurate
● Complete
● … and don't impact production systems
What is a Data Warehouse
"… central repository of data
which is created by integrating
data from one or more
disparate sources" - Wikipedia
Extract, Transform, Load
Source: www.imc.com
Python can help!
● Rapid prototyping
● Code re-use
● Existing libraries
● Decouple
– data flow management
– data processing
– business logic
Existing solutions
● Not a lot available in the Python space
● People roll their own
● Bubbles (Brewery 2)
– Framework for Python 3
– "Focus on the process, not the data technology"
Ways to move data around
● Flat files
● NOSQL data stores
● RDBMS
SQLAlchemy is...
Python SQL toolkit
&
Object Relational Mapper
About SQLAlchemy
● Full featured
● Mature, robust, documented, maintained
● Flexible
Enterprise!
DB support
● SQLite
● Postgresql
● MySQL
● Oracle
● MS-SQL
● Firebird
● Sybase
● ...
Python support
cPython 2.5+
cPython 3+
Jython 2.5+
Pypy 1.5+
Structure
SQLAlchemy Core
● Abstraction over Python's DBAPI
● SQL language via generative Python
expressions
SQLAlchemy Core
● Good for DB performance
– bulk operations
– complex queries
– fine-tuning
– connection/tx management
Create a table
from sqlalchemy import *
engine = create_engine('sqlite:///:memory:')
metadata = MetaData()
vehicles_table = Table('vehicles', metadata,
Column('model', String),
Column('registration', String),
Column('odometer', Integer),
Column('last_service', Date),)
vehicles_table.create(bind=engine)
Insert data
values = [
{'model': 'Ford Festiva',
'registration': 'HAX00R',
'odometer': 3141 },
{'model': 'Lotus Elise',
'registration': 'DELEG8',
'odometer': 31415 },
]
rows = engine.execute(
vehicles_table.insert(),
list(values)).rowcount
Query data
query = select(
[vehicles_table]
).where(
vehicles_table.c.odometer < 100
)
results = engine.execute(query)
for row in results:
print row
Encapsulating a unit of work
Example Processor Types
● Extract
– Extract from CSV
– Extract from DB table
– Scrape web page
● Transform
– Copy table from extract layer
– Derive column
– Join tables
Abstract Processor
class BaseProcessor(object):
def dispatch(self):
return self._run()
def _run(self):
return self.run()
def run(self):
raise NotImplementedError
Abstract Database Processor
class DatabaseProcessor(BaseProcessor):
db_class = None
engine = None
metadata = None
@contextlib.contextmanager
def _with_session(self):
with self.db_class().get_engine() as engine:
self.engine = engine
self.metadata = MetaData(bind=engine)
yield
def _run(self):
with self._with_session():
return self.run()
CSV Extract Mixin
class CSVExtractMixin(object):
input_file = None
def _run(self):
with self._with_engine():
self.reader = csv.DictReader(
self.input_file
)
return self.run()
A Concrete Extract
class SalesHistoryExtract(CSVExtractMixin,
DatabaseProcessor):
target_table_name = 'SalesHistoryExtract'
input_file = SALES_FILENAME
def run(self):
target_table = Table(self.target_table_name,
self.metadata)
columns = self.reader.next()
[target_table.append_column(Column(column, ...))
for column in columns if column]
target_table.create()
insert = target_table.insert()
new_record_count = self.engine.execute(insert,
list(self.reader)).rowcount
return new_record_count
An Abstract Derive Transform
class AbstractDeriveTransform(DatabaseProcessor):
table_name = None
key_columns = None
select_columns = None
target_columns = None
def process_row(self, row):
raise NotImplementedError
...
# Profit!
A Concrete Transform
from business_logic import derive_foo
class DeriveFooTransform(AbstractDeriveTransform):
table_name = 'SalesTransform'
key_columns = ['txn_id']
select_columns = ['location', 'username']
target_columns = [Column('foo', FOO_TYPE)]
def process_row(self, row):
foo = derive_foo(row.location, row.username)
return {'foo': foo}
Introducing Celery
Distributed Task Queue
A Processor Task
class AbstractProcessorTask(celery.Task):
abstract = True
processor_class = None
def run(self, *args, **kwargs):
processor = self.processor_class(
*args, **kwargs)
return processor.dispatch()
class DeriveFooTask(AbstractProcessorTask):
processor_class = DeriveFooTransform
DeriveFooTask().apply_async() # Run it!
Canvas: Designing Workflows
● Combines a series of tasks
● Groups run in parallel
● Chains run in series
● Can be combined in different ways
>>> new_user_workflow = (create_user.s() | group(
... import_contacts.s(),
... send_welcome_email.s()))
... new_user_workflow.delay(username='artv',
... first='Art',
... last='Vandelay',
... email='art@vandelay.com')
Sample Data Processing Flow
Extrac
t sales
Extract
customers
Extract
product
s
Copy sales to
transform
Copy customers
to transform
Copy products
to transform
Join
table
s
Aggregate sales
by customer
Normalis
e
currency
Aggregate
sales by region
Customer data
exception report
Sample Data Processing Flow
extract_flow = group((
ExtractSalesTask().si(),
ExtractCustTask().si(),
ExtractProductTask().si()))
transform_flow = group((
CopySalesTask().si() | NormaliseCurrencyTask().si(),
CopyCustTask().si(),
CopyProductTask().si())) | JoinTask().si()
load_flow = group((
QualityTask().si(),
AggregateTask().si('cust_id'),
AggregateTask().si('region_id')))
all_flow = extract_flow | transform_flow | load_flow
Monitoring – celery events
Monitoring – celery flower
Turning it up to 11
● A requires/depends structure
● Incremental data loads
● Parameterised flows
● Tracking flow history
● Hooking into other libraries
– NLTK
– SciPy/NumPy
– ...
Summary
● Intro to data warehousing
● Process data with SQLAlchemy
● Task dependencies with Celery
canvas
Resources
● SQLAlchemy core: http://bit.ly/10FdYZo
● Celery Canvas: http://bit.ly/MOjazT
● http://databrewery.org
– Bubbles: http://bit.ly/14hNsV0
– Pipeline: http://bit.ly/15RXvWa
● http://schoolofdata.org
Thank You!
Questions?
http://slideshare.net/mindsocket

Contenu connexe

Tendances

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
Single Sign-On for APEX apps (Important: latest version on edocr!)
Single Sign-On for APEX apps (Important: latest version on edocr!)Single Sign-On for APEX apps (Important: latest version on edocr!)
Single Sign-On for APEX apps (Important: latest version on edocr!)Niels de Bruijn
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...Carlos Sierra
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsJignesh Shah
 
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Ontico
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scalapramode_ce
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLChristian Antognini
 

Tendances (20)

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Single Sign-On for APEX apps (Important: latest version on edocr!)
Single Sign-On for APEX apps (Important: latest version on edocr!)Single Sign-On for APEX apps (Important: latest version on edocr!)
Single Sign-On for APEX apps (Important: latest version on edocr!)
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Presto
PrestoPresto
Presto
 
PostgreSQL replication
PostgreSQL replicationPostgreSQL replication
PostgreSQL replication
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
Understanding How is that Adaptive Cursor Sharing (ACS) produces multiple Opt...
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
 
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQL
 

En vedette

Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsStefan Urbanek
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialjbellis
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
Sqlalchemy sqlの錬金術
Sqlalchemy  sqlの錬金術Sqlalchemy  sqlの錬金術
Sqlalchemy sqlの錬金術Atsushi Odagiri
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Stefan Urbanek
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explainedStefan Urbanek
 
Introducing resinOS: An Operating System Tailored for Containers and Built fo...
Introducing resinOS: An Operating System Tailored for Containers and Built fo...Introducing resinOS: An Operating System Tailored for Containers and Built fo...
Introducing resinOS: An Operating System Tailored for Containers and Built fo...Balena
 
Introduction to SQLAlchemy ORM
Introduction to SQLAlchemy ORMIntroduction to SQLAlchemy ORM
Introduction to SQLAlchemy ORMJason Myers
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

En vedette (18)

Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorial
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Sqlalchemy sqlの錬金術
Sqlalchemy  sqlの錬金術Sqlalchemy  sqlの錬金術
Sqlalchemy sqlの錬金術
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
 
Resin.io
Resin.ioResin.io
Resin.io
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
 
Introducing resinOS: An Operating System Tailored for Containers and Built fo...
Introducing resinOS: An Operating System Tailored for Containers and Built fo...Introducing resinOS: An Operating System Tailored for Containers and Built fo...
Introducing resinOS: An Operating System Tailored for Containers and Built fo...
 
Introduction to SQLAlchemy ORM
Introduction to SQLAlchemy ORMIntroduction to SQLAlchemy ORM
Introduction to SQLAlchemy ORM
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similaire à Building data flows with Celery and SQLAlchemy

Having Fun Building Web Applications (Day 2 slides)
Having Fun Building Web Applications (Day 2 slides)Having Fun Building Web Applications (Day 2 slides)
Having Fun Building Web Applications (Day 2 slides)Clarence Ngoh
 
Built-in query caching for all PHP MySQL extensions/APIs
Built-in query caching for all PHP MySQL extensions/APIsBuilt-in query caching for all PHP MySQL extensions/APIs
Built-in query caching for all PHP MySQL extensions/APIsUlf Wendel
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQMariaDB plc
 
Python SQite3 database Tutorial | SQlite Database
Python SQite3 database Tutorial | SQlite DatabasePython SQite3 database Tutorial | SQlite Database
Python SQite3 database Tutorial | SQlite DatabaseElangovanTechNotesET
 
Advance java session 5
Advance java session 5Advance java session 5
Advance java session 5Smita B Kumar
 
Acutate - Using Stored Procedure
Acutate - Using Stored ProcedureAcutate - Using Stored Procedure
Acutate - Using Stored ProcedureAishwarya Savant
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptxLadduAnanu
 
Slideshare - Magento Imagine - Do You Queue
Slideshare - Magento Imagine - Do You QueueSlideshare - Magento Imagine - Do You Queue
Slideshare - Magento Imagine - Do You Queue10n Software, LLC
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'sAntônio Roberto Silva
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
[2C2]PredictionIO
[2C2]PredictionIO[2C2]PredictionIO
[2C2]PredictionIONAVER D2
 
Database continuous integration, unit test and functional test
Database continuous integration, unit test and functional testDatabase continuous integration, unit test and functional test
Database continuous integration, unit test and functional testHarry Zheng
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)Steve Min
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Introduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRIntroduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRPeter Elst
 

Similaire à Building data flows with Celery and SQLAlchemy (20)

Having Fun Building Web Applications (Day 2 slides)
Having Fun Building Web Applications (Day 2 slides)Having Fun Building Web Applications (Day 2 slides)
Having Fun Building Web Applications (Day 2 slides)
 
Built-in query caching for all PHP MySQL extensions/APIs
Built-in query caching for all PHP MySQL extensions/APIsBuilt-in query caching for all PHP MySQL extensions/APIs
Built-in query caching for all PHP MySQL extensions/APIs
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQ
 
Python SQite3 database Tutorial | SQlite Database
Python SQite3 database Tutorial | SQlite DatabasePython SQite3 database Tutorial | SQlite Database
Python SQite3 database Tutorial | SQlite Database
 
Advance java session 5
Advance java session 5Advance java session 5
Advance java session 5
 
Acutate - Using Stored Procedure
Acutate - Using Stored ProcedureAcutate - Using Stored Procedure
Acutate - Using Stored Procedure
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptx
 
Sqllite
SqlliteSqllite
Sqllite
 
Slideshare - Magento Imagine - Do You Queue
Slideshare - Magento Imagine - Do You QueueSlideshare - Magento Imagine - Do You Queue
Slideshare - Magento Imagine - Do You Queue
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API's
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
[2C2]PredictionIO
[2C2]PredictionIO[2C2]PredictionIO
[2C2]PredictionIO
 
Database continuous integration, unit test and functional test
Database continuous integration, unit test and functional testDatabase continuous integration, unit test and functional test
Database continuous integration, unit test and functional test
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Introduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRIntroduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIR
 

Plus de Roger Barnes

The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...Roger Barnes
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Roger Barnes
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and PythonRoger Barnes
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with DjangoTowards Continuous Deployment with Django
Towards Continuous Deployment with DjangoRoger Barnes
 
Scraping recalcitrant web sites with Python & Selenium
Scraping recalcitrant web sites with Python & SeleniumScraping recalcitrant web sites with Python & Selenium
Scraping recalcitrant web sites with Python & SeleniumRoger Barnes
 
Intro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django AppsIntro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django AppsRoger Barnes
 

Plus de Roger Barnes (6)

The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and Python
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with DjangoTowards Continuous Deployment with Django
Towards Continuous Deployment with Django
 
Scraping recalcitrant web sites with Python & Selenium
Scraping recalcitrant web sites with Python & SeleniumScraping recalcitrant web sites with Python & Selenium
Scraping recalcitrant web sites with Python & Selenium
 
Intro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django AppsIntro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django Apps
 

Dernier

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Dernier (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Building data flows with Celery and SQLAlchemy

  • 1. Building data flows with Celery and SQLAlchemy PyCon Australia 2013 Roger Barnes @mindsocket roger@mindsocket.com.au http://slideshare.net/mindsocket
  • 2. Coming up ● Data warehousing – AKA data integration ● Processing data flows – SQLAlchemy – Celery ● Tying it all together
  • 3. About me ● 15 years doing all things software ● 11 years at a Business Intelligence vendor ● Currently contracting – This talk based on a real reporting system
  • 5. Why Data Warehousing? But we need reports that are ● Timely ● Unambiguous ● Accurate ● Complete ● … and don't impact production systems
  • 6. What is a Data Warehouse "… central repository of data which is created by integrating data from one or more disparate sources" - Wikipedia
  • 7.
  • 8.
  • 10. Python can help! ● Rapid prototyping ● Code re-use ● Existing libraries ● Decouple – data flow management – data processing – business logic
  • 11. Existing solutions ● Not a lot available in the Python space ● People roll their own ● Bubbles (Brewery 2) – Framework for Python 3 – "Focus on the process, not the data technology"
  • 12. Ways to move data around ● Flat files ● NOSQL data stores ● RDBMS
  • 13. SQLAlchemy is... Python SQL toolkit & Object Relational Mapper
  • 14. About SQLAlchemy ● Full featured ● Mature, robust, documented, maintained ● Flexible
  • 16. DB support ● SQLite ● Postgresql ● MySQL ● Oracle ● MS-SQL ● Firebird ● Sybase ● ...
  • 17. Python support cPython 2.5+ cPython 3+ Jython 2.5+ Pypy 1.5+
  • 19. SQLAlchemy Core ● Abstraction over Python's DBAPI ● SQL language via generative Python expressions
  • 20. SQLAlchemy Core ● Good for DB performance – bulk operations – complex queries – fine-tuning – connection/tx management
  • 21. Create a table from sqlalchemy import * engine = create_engine('sqlite:///:memory:') metadata = MetaData() vehicles_table = Table('vehicles', metadata, Column('model', String), Column('registration', String), Column('odometer', Integer), Column('last_service', Date),) vehicles_table.create(bind=engine)
  • 22. Insert data values = [ {'model': 'Ford Festiva', 'registration': 'HAX00R', 'odometer': 3141 }, {'model': 'Lotus Elise', 'registration': 'DELEG8', 'odometer': 31415 }, ] rows = engine.execute( vehicles_table.insert(), list(values)).rowcount
  • 23. Query data query = select( [vehicles_table] ).where( vehicles_table.c.odometer < 100 ) results = engine.execute(query) for row in results: print row
  • 25. Example Processor Types ● Extract – Extract from CSV – Extract from DB table – Scrape web page ● Transform – Copy table from extract layer – Derive column – Join tables
  • 26. Abstract Processor class BaseProcessor(object): def dispatch(self): return self._run() def _run(self): return self.run() def run(self): raise NotImplementedError
  • 27. Abstract Database Processor class DatabaseProcessor(BaseProcessor): db_class = None engine = None metadata = None @contextlib.contextmanager def _with_session(self): with self.db_class().get_engine() as engine: self.engine = engine self.metadata = MetaData(bind=engine) yield def _run(self): with self._with_session(): return self.run()
  • 28. CSV Extract Mixin class CSVExtractMixin(object): input_file = None def _run(self): with self._with_engine(): self.reader = csv.DictReader( self.input_file ) return self.run()
  • 29. A Concrete Extract class SalesHistoryExtract(CSVExtractMixin, DatabaseProcessor): target_table_name = 'SalesHistoryExtract' input_file = SALES_FILENAME def run(self): target_table = Table(self.target_table_name, self.metadata) columns = self.reader.next() [target_table.append_column(Column(column, ...)) for column in columns if column] target_table.create() insert = target_table.insert() new_record_count = self.engine.execute(insert, list(self.reader)).rowcount return new_record_count
  • 30. An Abstract Derive Transform class AbstractDeriveTransform(DatabaseProcessor): table_name = None key_columns = None select_columns = None target_columns = None def process_row(self, row): raise NotImplementedError ... # Profit!
  • 31. A Concrete Transform from business_logic import derive_foo class DeriveFooTransform(AbstractDeriveTransform): table_name = 'SalesTransform' key_columns = ['txn_id'] select_columns = ['location', 'username'] target_columns = [Column('foo', FOO_TYPE)] def process_row(self, row): foo = derive_foo(row.location, row.username) return {'foo': foo}
  • 33. A Processor Task class AbstractProcessorTask(celery.Task): abstract = True processor_class = None def run(self, *args, **kwargs): processor = self.processor_class( *args, **kwargs) return processor.dispatch() class DeriveFooTask(AbstractProcessorTask): processor_class = DeriveFooTransform DeriveFooTask().apply_async() # Run it!
  • 34. Canvas: Designing Workflows ● Combines a series of tasks ● Groups run in parallel ● Chains run in series ● Can be combined in different ways >>> new_user_workflow = (create_user.s() | group( ... import_contacts.s(), ... send_welcome_email.s())) ... new_user_workflow.delay(username='artv', ... first='Art', ... last='Vandelay', ... email='art@vandelay.com')
  • 35. Sample Data Processing Flow Extrac t sales Extract customers Extract product s Copy sales to transform Copy customers to transform Copy products to transform Join table s Aggregate sales by customer Normalis e currency Aggregate sales by region Customer data exception report
  • 36. Sample Data Processing Flow extract_flow = group(( ExtractSalesTask().si(), ExtractCustTask().si(), ExtractProductTask().si())) transform_flow = group(( CopySalesTask().si() | NormaliseCurrencyTask().si(), CopyCustTask().si(), CopyProductTask().si())) | JoinTask().si() load_flow = group(( QualityTask().si(), AggregateTask().si('cust_id'), AggregateTask().si('region_id'))) all_flow = extract_flow | transform_flow | load_flow
  • 39. Turning it up to 11 ● A requires/depends structure ● Incremental data loads ● Parameterised flows ● Tracking flow history ● Hooking into other libraries – NLTK – SciPy/NumPy – ...
  • 40. Summary ● Intro to data warehousing ● Process data with SQLAlchemy ● Task dependencies with Celery canvas
  • 41. Resources ● SQLAlchemy core: http://bit.ly/10FdYZo ● Celery Canvas: http://bit.ly/MOjazT ● http://databrewery.org – Bubbles: http://bit.ly/14hNsV0 – Pipeline: http://bit.ly/15RXvWa ● http://schoolofdata.org