Python for Data Logistics

Using Python for Data Logistics
Ken Farmer
Data Science and Business Analytics Meetup
http://www.meetup.com/Data-Science-Business-Analytics/events/120727322/
2013-06-25

About Data Logistics
My definition: Management of Data in Motion
Which includes: Extract, Transform, Validation,
Change Detection, Loading,
Summarizing, Aggration
(and some other stuff I don't care about*)
In Context: A part of every big data analytical project
Primary objective: Make analysis efficient & effective
* SOA, Enterprise Service Buses (ESB), Enterprise Application Integration (EAI) etc. But since this these don't drive big data
analytics we're not going to talk about them.

Data Logistics Characteristics
- there will be many flows
Note:
● There may be many sources of
any type of data
● There will be many different
source constraints – operating
systems, networks, etc
● There will be upstream
changes that will not be
communicated – you will just
see them in the data
Typical Large Security Data Warehouse

Side Note - this is why there are many flows
Lots of low-hanging fruitA year of data mining will
produce almost nothing
- or -
1 Feed 11 Feeds
So, which will produce the best analysis?

- and each flow can be complex
Parts not shown:
● File Movement
● Logging & Auditing & Alerting
● Process Monitoring
● Scheduling
Considerations not shown:
● Recovery
● Performance with High Volumes
● Management

- and there's no simple alternative
The Great Idea The Sad Reality
No delta processing ● Explodes data volumes
● Reduces functionality
No lookups ● Explodes data volumes
● Reduces reporting query performance
No dimensions ● Explodes data volumes
● Reduces reporting functionality
● Reduces reporting query performance
No validation ● Increases maintenance costs
● Increases reporting errors
No standardization ● Increases reporting costs
● Increases reporting errors
● Increased documentation costs
No management features ● Decreases reliability
● Increases maintenance costs

Data Logistics Nightmares
So, what's the worst that can happen anyway?

Nightmare #1 – Data Quality
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0
10
20
30
40
50
60
70
80
90
100
ACME
Widget Production by Month
Month
Widgets
● Credibility
● Value
● Productivity

Nightmare #2 - Reliability
● Extended outages
● Frequent outages
● Missed SLAs
● Distractions from
new development

Nightmare #3 - Performance
● Missed SLAs
● Concurrency issues
● Lack of productivity

Nightmare #4 - Disorganization
● Productivity
● Outages
● Reliability
● Communication
● Learning Curves

Data Logistics
- Most Common Nightmare Root Causes
How the heck did we get here?

Root Cause #1 - magical thinking
There are no fairies
likewise there are no silver bullets
and your CRUD experience won't help you

Root Cause #2 - non-linear scalability
Gorillas don't scale gracefully
Neither will your feeds
he problem isn't performance -
It's maintenance. Dependencies,
cascading errors, and institutional
Knowledge.

Root Cause #3
– too much consistency or adaptability
These two conflicting forces are at odds
You need a balance
You have to have consistency
To help with learning curves
And organization.
You have to have adaptability
To get access to all the data
Sources you'll want.

ETL to the Rescue
- data logistics from the corporate world!
● The corporate world
started working on this 20
years ago
● It's still a hard problem, but
it's less of a nightmare
● Starting to make inroads
to Data Science/Big Data
projects

ETL
- Batch Pipelines not messages or
transactions
Data is batched
Feeds are organized like assembly or pipe lines
Each feed is broken into different programs / steps

ETL
- Most tools use diagram-driven-development
Which seems
great to almost all
management
And seems
pretty cool for a
while to some
developers

ETL
- Most tools use diagram-driven-development
But then
someone always
has to over do it
And we are
reminded that
tools are seldom
solutions

ETL
- So all is not wonderful
ETL – the last bastion of Computer-Aided Software Engineering (CASE) tools
Feature ETL
Tool
Custom
Code
Unit test harnesses no yes
TDD no yes
Version control flexibility no yes
Static code analysis no yes
Deployment tool
flexibility
no yes
Language flexibility no yes
Continuous Integration no yes
Virtual environments no yes
Diagrams yes yes
So, why don't we use
metadata-driven or code-
generation tools for
everything?
Why not use tools like
Frontpage for all websites?

ETL
- So, Buy (& Customize) vs Build
The ETL Tool Paradox:
● Programmers don't want to work on it
● But can only handle 80% of the problem without programming
Where the Buy option is a great fit:
● 100+ simple feeds
● Lack of programmer culture
● Standard already exists
Most typically – the “corporate data
warehouse” - a single database for an entire
company (usually a bad idea anyway)

Python
- a perfect fit for data logistics
● You can use the same language for
ETL, systems management and
data analysis
● The language is high-level and
maintenance-oriented
● It's easy for users to understand the
code
● It allows you use use all the
programming tools
● It's free
● It's a language for enthusiasts
● And it's fun
- http://xkcd.com/353/

Python
- Build List
For each Feed Application
● Program: Extract
● Program: Transform
● Config: File-Image Delta
● Config: Loader
● Config: File Mover
Services, Libraries and Utilities
● Service: metadata, auditing & logging,
dashboard
● Service: data movement
● Library: data validation
● Utility: file-image delta
● Utility: publisher
● Utility: loader

Python
- Typical Module List
Third-Party
● appdirs
● database drivers
● sqlalchemy
● pyyaml
● validictory
● requests
● envoy
● pytest
● virtualenv
● virtualenvwrapper
Standard Library
● os
● csv
● logging
● unittest
● collections
● argparse
● functools
Environmentals
● Version control – git, svn, etc
● Deployment – Fabric, Chef, etc
● Static analysis – pylint
● Testing – pytest, tox, buildbot, etc
● Documentation - sphinx
Bottom line: a mostly vanilla and very free environment will get you very far

Python ETL Components
- Scheduling
● Typically cron
● Daemon if you want more than one run > minute
● Should have suppression capability beyond commenting
out the cron job
● Event-driven > temporally-driven
● Need checking for more than one instance running
● Level of effort: very little

- Audit System
● Analyze performance & rule issues over time
● Centralize alerting
● Level of effort: weeks

- File Transporter
File movement is extremely failure-prone:
- out of space errors
- permission errors
- credential expiration errors
- network errors
So, use a process external to feed processing to move files – and
simplify their recovery.
Note this is not the same as data mirroring:
- moves files from source to destination
- renames file during movement
- moves/deletes/renames source after move
- So, you may need to write this yourself – rsync is not ideal
Level of Effort: pretty simple, 1-3 weeks to write reusable utility

- Load Utility
Functionality
● Validates data
● Continuously loads
● Moves files as necessary
● May run delta operation
● Handles recoveries
● Writes to audit tables
Bottom line: pretty simple, 1-3 weeks to write reusable utility

- Publish Utility
Functionality
● Extracts all data since the last time it ran
● Can handle max rows
● Moves files as necessary
● Handles recoveries
● Writes to audit tables
● Writes all data to a compressed tarball

- Delta Utility
Functionality
● Like diff – but for structured files
● Distinguishes between key fields vs non-key fields
● Can be configured to skip comparisons of certain fields
● Can perform minor transformations
● May be built into Load utility, or a transformation library

Python Program
- Simple Transform
def transform_gender(input_gender):
“”” Transforms a gender code to the standard format.
:param input_gender – in either VARCOPS or SITHOUSE formats
:returns standard gender code
“””
if input_gender.lower() in ['m','male','1','transgender_to_male']:
output_gender = 'male'
elif input_gender.lower() in ['f','female','2','transgender_to_female']:
output_gender = 'female'
elif input_gender.lower() in ['transsexual','intersex']:
output_gender = 'transgender'
else:
output_gender = 'unknown'
return output_gender
Observation:
Simple transforms &
Rules can be easily
read by non-programmers.
Observation: Transforms
can be kept in a module
and easily documented.
Observation:
Even simple transforms
Can have a lot of subtleties.
And are likely to be referenced
Or changed by users.

Python Program
- Complex Transformation
def explode_ip_range_list(ip_range_list):
“”” Transforms an ip range list to a list of individual ip addresses.
:param ip_range_list – comma or space delimited ip ranges
or ips. Ranges are separated with a dash, or use CIDR notation.
Individual IP addresses can be represented with a dotted quad,
integer (unsigned), hex or CIDR notation.
ex: "10.10/16, 192.168.1.0 - 192.168.1.255, 192.168.2.3,
192.168.3.5 – 192.168.5.10, 192.168.5, 0.0.0.0/1"
“””
output_ip_list = []
for ip in whitelist.ip_expansion(ip_range_list):
output_ip_list.append(ip)
return output_ip_list
Ok, this is a cheat – the complexity is in the library
Observation:
Complex transforms would
That would be a nightmare in
A tool can be easy in Python.
Especially, as in this case, when
There's a great module to use.
Observation:
Unit-testing frameworks
Are incredibly valuable
For complex transforms.

Using Python Data Flow Example

The Bottom Line
Thank You – Any Questions?
The Good:
● Python for attracting & retaining developers
● Python for handling complexity
● Python for costs
● Python for adaptability
● Python for modern development environment
The Not Good:
● Lack of good practices adds risk
● Lack of a rigid framework requires discipline
The Tangential:
● Hadoop – who said anything about hadoop?

Python for Data Logistics

Recommandé

Recommandé

Contenu connexe

Similaire à Python for Data Logistics

Similaire à Python for Data Logistics (20)

Dernier

Dernier (20)

Python for Data Logistics

Notes de l'éditeur