SlideShare une entreprise Scribd logo
1  sur  34
Using Python for Data Logistics
Ken Farmer
Data Science and Business Analytics Meetup
http://www.meetup.com/Data-Science-Business-Analytics/events/120727322/
2013-06-25
About Data Logistics
My definition: Management of Data in Motion
Which includes: Extract, Transform, Validation,
Change Detection, Loading,
Summarizing, Aggration
(and some other stuff I don't care about*)
In Context: A part of every big data analytical project
Primary objective: Make analysis efficient & effective
* SOA, Enterprise Service Buses (ESB), Enterprise Application Integration (EAI) etc. But since this these don't drive big data
analytics we're not going to talk about them.
Data Logistics Characteristics
- there will be many flows
Note:
● There may be many sources of
any type of data
● There will be many different
source constraints – operating
systems, networks, etc
● There will be upstream
changes that will not be
communicated – you will just
see them in the data
Typical Large Security Data Warehouse
Data Logistics Characteristics
Side Note - this is why there are many flows
Lots of low-hanging fruitA year of data mining will
produce almost nothing
- or -
1 Feed 11 Feeds
So, which will produce the best analysis?
Data Logistics Characteristics
- and each flow can be complex
Parts not shown:
● File Movement
● Logging & Auditing & Alerting
● Process Monitoring
● Scheduling
Considerations not shown:
● Recovery
● Performance with High Volumes
● Management
Data Logistics Characteristics
- and there's no simple alternative
The Great Idea The Sad Reality
No delta processing ● Explodes data volumes
● Reduces functionality
No lookups ● Explodes data volumes
● Reduces reporting query performance
No dimensions ● Explodes data volumes
● Reduces reporting functionality
● Reduces reporting query performance
No validation ● Increases maintenance costs
● Increases reporting errors
No standardization ● Increases reporting costs
● Increases reporting errors
● Increased documentation costs
No management features ● Decreases reliability
● Increases maintenance costs
Data Logistics Nightmares
So, what's the worst that can happen anyway?
Nightmare #1 – Data Quality
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0
10
20
30
40
50
60
70
80
90
100
ACME
Widget Production by Month
Month
Widgets
● Credibility
● Value
● Productivity
Nightmare #2 - Reliability
● Extended outages
● Frequent outages
● Missed SLAs
● Distractions from
new development
Nightmare #3 - Performance
● Missed SLAs
● Concurrency issues
● Lack of productivity
Nightmare #4 - Disorganization
● Productivity
● Outages
● Reliability
● Communication
● Learning Curves
Data Logistics
- Most Common Nightmare Root Causes
How the heck did we get here?
Root Cause #1 - magical thinking
There are no fairies
likewise there are no silver bullets
and your CRUD experience won't help you
Root Cause #2 - non-linear scalability
Gorillas don't scale gracefully
Neither will your feeds
he problem isn't performance -
It's maintenance. Dependencies,
cascading errors, and institutional
Knowledge.
Root Cause #3
– too much consistency or adaptability
These two conflicting forces are at odds
You need a balance
You have to have consistency
To help with learning curves
And organization.
You have to have adaptability
To get access to all the data
Sources you'll want.
ETL to the Rescue
- data logistics from the corporate world!
● The corporate world
started working on this 20
years ago
● It's still a hard problem, but
it's less of a nightmare
● Starting to make inroads
to Data Science/Big Data
projects
ETL
- Batch Pipelines not messages or
transactions
Data is batched
Feeds are organized like assembly or pipe lines
Each feed is broken into different programs / steps
ETL
- Most tools use diagram-driven-development
Which seems
great to almost all
management
And seems
pretty cool for a
while to some
developers
ETL
- Most tools use diagram-driven-development
But then
someone always
has to over do it
And we are
reminded that
tools are seldom
solutions
ETL
- So all is not wonderful
ETL – the last bastion of Computer-Aided Software Engineering (CASE) tools
Feature ETL
Tool
Custom
Code
Unit test harnesses no yes
TDD no yes
Version control flexibility no yes
Static code analysis no yes
Deployment tool
flexibility
no yes
Language flexibility no yes
Continuous Integration no yes
Virtual environments no yes
Diagrams yes yes
So, why don't we use
metadata-driven or code-
generation tools for
everything?
Why not use tools like
Frontpage for all websites?
ETL
- So, Buy (& Customize) vs Build
The ETL Tool Paradox:
● Programmers don't want to work on it
● But can only handle 80% of the problem without programming
Where the Buy option is a great fit:
● 100+ simple feeds
● Lack of programmer culture
● Standard already exists
Most typically – the “corporate data
warehouse” - a single database for an entire
company (usually a bad idea anyway)
Python
- a perfect fit for data logistics
● You can use the same language for
ETL, systems management and
data analysis
● The language is high-level and
maintenance-oriented
● It's easy for users to understand the
code
● It allows you use use all the
programming tools
● It's free
● It's a language for enthusiasts
● And it's fun
- http://xkcd.com/353/
Python
- Build List
For each Feed Application
● Program: Extract
● Program: Transform
● Config: File-Image Delta
● Config: Loader
● Config: File Mover
Services, Libraries and Utilities
● Service: metadata, auditing & logging,
dashboard
● Service: data movement
● Library: data validation
● Utility: file-image delta
● Utility: publisher
● Utility: loader
Python
- Typical Module List
Third-Party
● appdirs
● database drivers
● sqlalchemy
● pyyaml
● validictory
● requests
● envoy
● pytest
● virtualenv
● virtualenvwrapper
Standard Library
● os
● csv
● logging
● unittest
● collections
● argparse
● functools
Environmentals
● Version control – git, svn, etc
● Deployment – Fabric, Chef, etc
● Static analysis – pylint
● Testing – pytest, tox, buildbot, etc
● Documentation - sphinx
Bottom line: a mostly vanilla and very free environment will get you very far
Python ETL Components
- Scheduling
● Typically cron
● Daemon if you want more than one run > minute
● Should have suppression capability beyond commenting
out the cron job
● Event-driven > temporally-driven
● Need checking for more than one instance running
● Level of effort: very little
Python ETL Components
- Audit System
● Analyze performance & rule issues over time
● Centralize alerting
● Level of effort: weeks
Python ETL Components
- File Transporter
File movement is extremely failure-prone:
- out of space errors
- permission errors
- credential expiration errors
- network errors
So, use a process external to feed processing to move files – and
simplify their recovery.
Note this is not the same as data mirroring:
- moves files from source to destination
- renames file during movement
- moves/deletes/renames source after move
- So, you may need to write this yourself – rsync is not ideal
Level of Effort: pretty simple, 1-3 weeks to write reusable utility
Python ETL Components
- Load Utility
Functionality
● Validates data
● Continuously loads
● Moves files as necessary
● May run delta operation
● Handles recoveries
● Writes to audit tables
Bottom line: pretty simple, 1-3 weeks to write reusable utility
Python ETL Components
- Publish Utility
Functionality
● Extracts all data since the last time it ran
● Can handle max rows
● Moves files as necessary
● Handles recoveries
● Writes to audit tables
● Writes all data to a compressed tarball
Bottom line: pretty simple, 1-3 weeks to write reusable utility
Python ETL Components
- Delta Utility
Functionality
● Like diff – but for structured files
● Distinguishes between key fields vs non-key fields
● Can be configured to skip comparisons of certain fields
● Can perform minor transformations
● May be built into Load utility, or a transformation library
Bottom line: pretty simple, 1-3 weeks to write reusable utility
Python Program
- Simple Transform
def transform_gender(input_gender):
“”” Transforms a gender code to the standard format.
:param input_gender – in either VARCOPS or SITHOUSE formats
:returns standard gender code
“””
if input_gender.lower() in ['m','male','1','transgender_to_male']:
output_gender = 'male'
elif input_gender.lower() in ['f','female','2','transgender_to_female']:
output_gender = 'female'
elif input_gender.lower() in ['transsexual','intersex']:
output_gender = 'transgender'
else:
output_gender = 'unknown'
return output_gender
Observation:
Simple transforms &
Rules can be easily
read by non-programmers.
Observation: Transforms
can be kept in a module
and easily documented.
Observation:
Even simple transforms
Can have a lot of subtleties.
And are likely to be referenced
Or changed by users.
Python Program
- Complex Transformation
def explode_ip_range_list(ip_range_list):
“”” Transforms an ip range list to a list of individual ip addresses.
:param ip_range_list – comma or space delimited ip ranges
or ips. Ranges are separated with a dash, or use CIDR notation.
Individual IP addresses can be represented with a dotted quad,
integer (unsigned), hex or CIDR notation.
ex: "10.10/16, 192.168.1.0 - 192.168.1.255, 192.168.2.3,
192.168.3.5 – 192.168.5.10, 192.168.5, 0.0.0.0/1"
“””
output_ip_list = []
for ip in whitelist.ip_expansion(ip_range_list):
output_ip_list.append(ip)
return output_ip_list
Ok, this is a cheat – the complexity is in the library
Observation:
Complex transforms would
That would be a nightmare in
A tool can be easy in Python.
Especially, as in this case, when
There's a great module to use.
Observation:
Unit-testing frameworks
Are incredibly valuable
For complex transforms.
Using Python Data Flow Example
The Bottom Line
Thank You – Any Questions?
The Good:
● Python for attracting & retaining developers
● Python for handling complexity
● Python for costs
● Python for adaptability
● Python for modern development environment
The Not Good:
● Lack of good practices adds risk
● Lack of a rigid framework requires discipline
The Tangential:
● Hadoop – who said anything about hadoop?

Contenu connexe

Similaire à Python for Data Logistics

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
XPDays Ukraine: Legacy
XPDays Ukraine: LegacyXPDays Ukraine: Legacy
XPDays Ukraine: LegacyVictor_Cr
 
Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...Marco Tusa
 
Advanced web application architecture - Talk
Advanced web application architecture - TalkAdvanced web application architecture - Talk
Advanced web application architecture - TalkMatthias Noback
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Pythondidip
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrJohn Allspaw
 
The working architecture of NodeJS applications, Виктор Турский
The working architecture of NodeJS applications, Виктор ТурскийThe working architecture of NodeJS applications, Виктор Турский
The working architecture of NodeJS applications, Виктор ТурскийSigma Software
 
The working architecture of node js applications open tech week javascript ...
The working architecture of node js applications   open tech week javascript ...The working architecture of node js applications   open tech week javascript ...
The working architecture of node js applications open tech week javascript ...Viktor Turskyi
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendNicolas Carlier
 
Continuous integration with business intelligence and analytics
Continuous integration with business intelligence and analyticsContinuous integration with business intelligence and analytics
Continuous integration with business intelligence and analyticsAlex Meadows
 
Salesforce Flows Architecture Best Practices
Salesforce Flows Architecture Best PracticesSalesforce Flows Architecture Best Practices
Salesforce Flows Architecture Best Practicespanayaofficial
 
How to Manage the Risk of your Polyglot Environments
How to Manage the Risk of your Polyglot EnvironmentsHow to Manage the Risk of your Polyglot Environments
How to Manage the Risk of your Polyglot EnvironmentsDevOps.com
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, CriteoParis Open Source Summit
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUGslandelle
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureYshay Yaacobi
 
Challenges and Best Practices of Database Continuous Delivery
Challenges and Best Practices of Database Continuous DeliveryChallenges and Best Practices of Database Continuous Delivery
Challenges and Best Practices of Database Continuous DeliveryDBmaestro - Database DevOps
 

Similaire à Python for Data Logistics (20)

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
 
XPDays Ukraine: Legacy
XPDays Ukraine: LegacyXPDays Ukraine: Legacy
XPDays Ukraine: Legacy
 
Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...
 
Advanced web application architecture - Talk
Advanced web application architecture - TalkAdvanced web application architecture - Talk
Advanced web application architecture - Talk
 
Os Solomon
Os SolomonOs Solomon
Os Solomon
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
 
The working architecture of NodeJS applications, Виктор Турский
The working architecture of NodeJS applications, Виктор ТурскийThe working architecture of NodeJS applications, Виктор Турский
The working architecture of NodeJS applications, Виктор Турский
 
The working architecture of node js applications open tech week javascript ...
The working architecture of node js applications   open tech week javascript ...The working architecture of node js applications   open tech week javascript ...
The working architecture of node js applications open tech week javascript ...
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
 
Continuous integration with business intelligence and analytics
Continuous integration with business intelligence and analyticsContinuous integration with business intelligence and analytics
Continuous integration with business intelligence and analytics
 
sat_presentation
sat_presentationsat_presentation
sat_presentation
 
Salesforce Flows Architecture Best Practices
Salesforce Flows Architecture Best PracticesSalesforce Flows Architecture Best Practices
Salesforce Flows Architecture Best Practices
 
How to Manage the Risk of your Polyglot Environments
How to Manage the Risk of your Polyglot EnvironmentsHow to Manage the Risk of your Polyglot Environments
How to Manage the Risk of your Polyglot Environments
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
 
Challenges and Best Practices of Database Continuous Delivery
Challenges and Best Practices of Database Continuous DeliveryChallenges and Best Practices of Database Continuous Delivery
Challenges and Best Practices of Database Continuous Delivery
 

Dernier

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Dernier (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Python for Data Logistics

  • 1. Using Python for Data Logistics Ken Farmer Data Science and Business Analytics Meetup http://www.meetup.com/Data-Science-Business-Analytics/events/120727322/ 2013-06-25
  • 2. About Data Logistics My definition: Management of Data in Motion Which includes: Extract, Transform, Validation, Change Detection, Loading, Summarizing, Aggration (and some other stuff I don't care about*) In Context: A part of every big data analytical project Primary objective: Make analysis efficient & effective * SOA, Enterprise Service Buses (ESB), Enterprise Application Integration (EAI) etc. But since this these don't drive big data analytics we're not going to talk about them.
  • 3. Data Logistics Characteristics - there will be many flows Note: ● There may be many sources of any type of data ● There will be many different source constraints – operating systems, networks, etc ● There will be upstream changes that will not be communicated – you will just see them in the data Typical Large Security Data Warehouse
  • 4. Data Logistics Characteristics Side Note - this is why there are many flows Lots of low-hanging fruitA year of data mining will produce almost nothing - or - 1 Feed 11 Feeds So, which will produce the best analysis?
  • 5. Data Logistics Characteristics - and each flow can be complex Parts not shown: ● File Movement ● Logging & Auditing & Alerting ● Process Monitoring ● Scheduling Considerations not shown: ● Recovery ● Performance with High Volumes ● Management
  • 6. Data Logistics Characteristics - and there's no simple alternative The Great Idea The Sad Reality No delta processing ● Explodes data volumes ● Reduces functionality No lookups ● Explodes data volumes ● Reduces reporting query performance No dimensions ● Explodes data volumes ● Reduces reporting functionality ● Reduces reporting query performance No validation ● Increases maintenance costs ● Increases reporting errors No standardization ● Increases reporting costs ● Increases reporting errors ● Increased documentation costs No management features ● Decreases reliability ● Increases maintenance costs
  • 7. Data Logistics Nightmares So, what's the worst that can happen anyway?
  • 8. Nightmare #1 – Data Quality Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0 10 20 30 40 50 60 70 80 90 100 ACME Widget Production by Month Month Widgets ● Credibility ● Value ● Productivity
  • 9. Nightmare #2 - Reliability ● Extended outages ● Frequent outages ● Missed SLAs ● Distractions from new development
  • 10. Nightmare #3 - Performance ● Missed SLAs ● Concurrency issues ● Lack of productivity
  • 11. Nightmare #4 - Disorganization ● Productivity ● Outages ● Reliability ● Communication ● Learning Curves
  • 12. Data Logistics - Most Common Nightmare Root Causes How the heck did we get here?
  • 13. Root Cause #1 - magical thinking There are no fairies likewise there are no silver bullets and your CRUD experience won't help you
  • 14. Root Cause #2 - non-linear scalability Gorillas don't scale gracefully Neither will your feeds he problem isn't performance - It's maintenance. Dependencies, cascading errors, and institutional Knowledge.
  • 15. Root Cause #3 – too much consistency or adaptability These two conflicting forces are at odds You need a balance You have to have consistency To help with learning curves And organization. You have to have adaptability To get access to all the data Sources you'll want.
  • 16. ETL to the Rescue - data logistics from the corporate world! ● The corporate world started working on this 20 years ago ● It's still a hard problem, but it's less of a nightmare ● Starting to make inroads to Data Science/Big Data projects
  • 17. ETL - Batch Pipelines not messages or transactions Data is batched Feeds are organized like assembly or pipe lines Each feed is broken into different programs / steps
  • 18. ETL - Most tools use diagram-driven-development Which seems great to almost all management And seems pretty cool for a while to some developers
  • 19. ETL - Most tools use diagram-driven-development But then someone always has to over do it And we are reminded that tools are seldom solutions
  • 20. ETL - So all is not wonderful ETL – the last bastion of Computer-Aided Software Engineering (CASE) tools Feature ETL Tool Custom Code Unit test harnesses no yes TDD no yes Version control flexibility no yes Static code analysis no yes Deployment tool flexibility no yes Language flexibility no yes Continuous Integration no yes Virtual environments no yes Diagrams yes yes So, why don't we use metadata-driven or code- generation tools for everything? Why not use tools like Frontpage for all websites?
  • 21. ETL - So, Buy (& Customize) vs Build The ETL Tool Paradox: ● Programmers don't want to work on it ● But can only handle 80% of the problem without programming Where the Buy option is a great fit: ● 100+ simple feeds ● Lack of programmer culture ● Standard already exists Most typically – the “corporate data warehouse” - a single database for an entire company (usually a bad idea anyway)
  • 22. Python - a perfect fit for data logistics ● You can use the same language for ETL, systems management and data analysis ● The language is high-level and maintenance-oriented ● It's easy for users to understand the code ● It allows you use use all the programming tools ● It's free ● It's a language for enthusiasts ● And it's fun - http://xkcd.com/353/
  • 23. Python - Build List For each Feed Application ● Program: Extract ● Program: Transform ● Config: File-Image Delta ● Config: Loader ● Config: File Mover Services, Libraries and Utilities ● Service: metadata, auditing & logging, dashboard ● Service: data movement ● Library: data validation ● Utility: file-image delta ● Utility: publisher ● Utility: loader
  • 24. Python - Typical Module List Third-Party ● appdirs ● database drivers ● sqlalchemy ● pyyaml ● validictory ● requests ● envoy ● pytest ● virtualenv ● virtualenvwrapper Standard Library ● os ● csv ● logging ● unittest ● collections ● argparse ● functools Environmentals ● Version control – git, svn, etc ● Deployment – Fabric, Chef, etc ● Static analysis – pylint ● Testing – pytest, tox, buildbot, etc ● Documentation - sphinx Bottom line: a mostly vanilla and very free environment will get you very far
  • 25. Python ETL Components - Scheduling ● Typically cron ● Daemon if you want more than one run > minute ● Should have suppression capability beyond commenting out the cron job ● Event-driven > temporally-driven ● Need checking for more than one instance running ● Level of effort: very little
  • 26. Python ETL Components - Audit System ● Analyze performance & rule issues over time ● Centralize alerting ● Level of effort: weeks
  • 27. Python ETL Components - File Transporter File movement is extremely failure-prone: - out of space errors - permission errors - credential expiration errors - network errors So, use a process external to feed processing to move files – and simplify their recovery. Note this is not the same as data mirroring: - moves files from source to destination - renames file during movement - moves/deletes/renames source after move - So, you may need to write this yourself – rsync is not ideal Level of Effort: pretty simple, 1-3 weeks to write reusable utility
  • 28. Python ETL Components - Load Utility Functionality ● Validates data ● Continuously loads ● Moves files as necessary ● May run delta operation ● Handles recoveries ● Writes to audit tables Bottom line: pretty simple, 1-3 weeks to write reusable utility
  • 29. Python ETL Components - Publish Utility Functionality ● Extracts all data since the last time it ran ● Can handle max rows ● Moves files as necessary ● Handles recoveries ● Writes to audit tables ● Writes all data to a compressed tarball Bottom line: pretty simple, 1-3 weeks to write reusable utility
  • 30. Python ETL Components - Delta Utility Functionality ● Like diff – but for structured files ● Distinguishes between key fields vs non-key fields ● Can be configured to skip comparisons of certain fields ● Can perform minor transformations ● May be built into Load utility, or a transformation library Bottom line: pretty simple, 1-3 weeks to write reusable utility
  • 31. Python Program - Simple Transform def transform_gender(input_gender): “”” Transforms a gender code to the standard format. :param input_gender – in either VARCOPS or SITHOUSE formats :returns standard gender code “”” if input_gender.lower() in ['m','male','1','transgender_to_male']: output_gender = 'male' elif input_gender.lower() in ['f','female','2','transgender_to_female']: output_gender = 'female' elif input_gender.lower() in ['transsexual','intersex']: output_gender = 'transgender' else: output_gender = 'unknown' return output_gender Observation: Simple transforms & Rules can be easily read by non-programmers. Observation: Transforms can be kept in a module and easily documented. Observation: Even simple transforms Can have a lot of subtleties. And are likely to be referenced Or changed by users.
  • 32. Python Program - Complex Transformation def explode_ip_range_list(ip_range_list): “”” Transforms an ip range list to a list of individual ip addresses. :param ip_range_list – comma or space delimited ip ranges or ips. Ranges are separated with a dash, or use CIDR notation. Individual IP addresses can be represented with a dotted quad, integer (unsigned), hex or CIDR notation. ex: "10.10/16, 192.168.1.0 - 192.168.1.255, 192.168.2.3, 192.168.3.5 – 192.168.5.10, 192.168.5, 0.0.0.0/1" “”” output_ip_list = [] for ip in whitelist.ip_expansion(ip_range_list): output_ip_list.append(ip) return output_ip_list Ok, this is a cheat – the complexity is in the library Observation: Complex transforms would That would be a nightmare in A tool can be easy in Python. Especially, as in this case, when There's a great module to use. Observation: Unit-testing frameworks Are incredibly valuable For complex transforms.
  • 33. Using Python Data Flow Example
  • 34. The Bottom Line Thank You – Any Questions? The Good: ● Python for attracting & retaining developers ● Python for handling complexity ● Python for costs ● Python for adaptability ● Python for modern development environment The Not Good: ● Lack of good practices adds risk ● Lack of a rigid framework requires discipline The Tangential: ● Hadoop – who said anything about hadoop?

Notes de l'éditeur

  1. About me: - 20 years working on data logistics for big data Projects on variety of clients - 10+ years working with python on data logistics - live in manitou springs - currently work for IBM as a data architect responsible for their security data warehouse - presented on this topic often
  2. I didn't say “Big Data Project” - a big social networking site with 1 PB of content may not be doing as much analysis – may not require as many feeds Many would say this is the hardest part of data science Many would say this can consume 90% of a data science budget
  3. As I'll get to in the next slide – you will probably have ***many*** feeds This shows an ideal security data warehouse set of feeds 24 FEEDS – but it could really be > 50
  4. Firewall only - stuck with looking for patterns - might identify scans - might identify recon - will miss all distributed attacks Firewall+ - can tell if a scan came from a whitelist - can see if activity involves known bad guys - can see if activity involves high-value, Or vulnerable assets
  5. Acknowledgements to Mike Koenig, and Drum 8. “An Upsetting Theme” by Kevin MacLeod. Licensed under Creative Commons “Attribution 3.0″ http://creativecommons.org/licenses/by/3.0/ and used here by permission, and with appreciation and thanks. Herbert Morrison’s on-the-scene recordings of the Disaster are Public Domain. Thanks to http://www.americanrhetoric.com for access.
  6. Above example – problem won't disappear for 11 months. Users will be reminded of problem until it does. This is unlike transactional system, in which evidence of problems is hidden. Quality problems are one of the top reasons for analytical system failure. Examples: - Country threaten to go to UN if my company didn't retract an apology for its wrong analysis based on my data. Pretty intense.
  7. Source systems won't tell you of changes they've made Many business complex feeds to maintain
  8. http://creativecommons.org/licenses/by/2.0/deed.en Examples: - A system I'm familiar with is spending 4x what we're spending on hardware & support and loads 1/8000 our speed.
  9. http://creativecommons.org/licenses/by-nc-sa/2.0/deed.en http://www.flickr.com/photos/slworking/5328601506/ You could eventually paint yourself into a corner – in which the maintenance of your feeds is nearly impossible to keep up with. Examples: - I know of some systems that take 6 months to build feeds. Others that can do the exact same feed in 1 month. -
  10. foo
  11. - ETL Tools aren't silver bullets - XML isn't a silver bullet - Your experience building transactional systems won't help you This is not your world, it's your father's world. It's the world of mainframe batch systems from the 60s & 70s: - Few streams - Web services are too slow for the big feeds - No fat object layers - No record-by-record transactions + batch processing + bulk loading + merging of files
  12. Gorillas don't scale – king kong couldn't exist because the square-cube law would require his bones to be disproportionately larger in cross-section at that size. Likewise, the work to build and maintain 50 feeds is more than 50x the work to do 1: Overhead services become more important – and take up more time Feeds have interdependencies Plus, they don't age terribly well – as you discover that upstream systems make changes, say annually, without telling you.
  13. You need consistency to keep maintenance costs low. Too much inconsistency and you'll have an unmaintainable nightmare. But you need adaptability to work around source system requirements. Too much consistency here and you'll be unable to add new data. Ex: - you may have to use a client library in some other language - you may have to use RSS, SSL, RMI, etc - you may have an extract on the other side of a firewall
  14. These two worlds just don't talk much. Especially since most ETL solutions have been closed source – it's a domain that's invisible to open source projects. Plus, ETL just isn't SEXY. Now that Big Data projects are happening in Corporate environments, and open source ETL is getting coverage – it's getting more visibility.
  15. From http://professional.robertbui.com/2009/10/kettle-cuts-80-off-data-extraction-transformation-and-loading/ And most solutions involve diagraming your feed, and the solution either: - generates code - runs metadata through an engine
  16. From http://professional.robertbui.com/2009/10/kettle-cuts-80-off-data-extraction-transformation-and-loading/
  17. CASE tools were pretty much abandoned by the mid-90s But not for ETL – since its main adherents were those that didn't program much anyway So, they've lingered. And so has the myth that ETL is too hard to write by hand. In the late 90s the Meta Group released a study that showed that COBOL programmers were more productive than the users of any ETL software.
  18. My apologies to the Ruby guys who are all sick of this cartoon by now