1. Using Python for Data Logistics
Ken Farmer
Data Science and Business Analytics Meetup
http://www.meetup.com/Data-Science-Business-Analytics/events/120727322/
2013-06-25
2. About Data Logistics
My definition: Management of Data in Motion
Which includes: Extract, Transform, Validation,
Change Detection, Loading,
Summarizing, Aggration
(and some other stuff I don't care about*)
In Context: A part of every big data analytical project
Primary objective: Make analysis efficient & effective
* SOA, Enterprise Service Buses (ESB), Enterprise Application Integration (EAI) etc. But since this these don't drive big data
analytics we're not going to talk about them.
3. Data Logistics Characteristics
- there will be many flows
Note:
● There may be many sources of
any type of data
● There will be many different
source constraints – operating
systems, networks, etc
● There will be upstream
changes that will not be
communicated – you will just
see them in the data
Typical Large Security Data Warehouse
4. Data Logistics Characteristics
Side Note - this is why there are many flows
Lots of low-hanging fruitA year of data mining will
produce almost nothing
- or -
1 Feed 11 Feeds
So, which will produce the best analysis?
5. Data Logistics Characteristics
- and each flow can be complex
Parts not shown:
● File Movement
● Logging & Auditing & Alerting
● Process Monitoring
● Scheduling
Considerations not shown:
● Recovery
● Performance with High Volumes
● Management
6. Data Logistics Characteristics
- and there's no simple alternative
The Great Idea The Sad Reality
No delta processing ● Explodes data volumes
● Reduces functionality
No lookups ● Explodes data volumes
● Reduces reporting query performance
No dimensions ● Explodes data volumes
● Reduces reporting functionality
● Reduces reporting query performance
No validation ● Increases maintenance costs
● Increases reporting errors
No standardization ● Increases reporting costs
● Increases reporting errors
● Increased documentation costs
No management features ● Decreases reliability
● Increases maintenance costs
8. Nightmare #1 – Data Quality
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0
10
20
30
40
50
60
70
80
90
100
ACME
Widget Production by Month
Month
Widgets
● Credibility
● Value
● Productivity
9. Nightmare #2 - Reliability
● Extended outages
● Frequent outages
● Missed SLAs
● Distractions from
new development
13. Root Cause #1 - magical thinking
There are no fairies
likewise there are no silver bullets
and your CRUD experience won't help you
14. Root Cause #2 - non-linear scalability
Gorillas don't scale gracefully
Neither will your feeds
he problem isn't performance -
It's maintenance. Dependencies,
cascading errors, and institutional
Knowledge.
15. Root Cause #3
– too much consistency or adaptability
These two conflicting forces are at odds
You need a balance
You have to have consistency
To help with learning curves
And organization.
You have to have adaptability
To get access to all the data
Sources you'll want.
16. ETL to the Rescue
- data logistics from the corporate world!
● The corporate world
started working on this 20
years ago
● It's still a hard problem, but
it's less of a nightmare
● Starting to make inroads
to Data Science/Big Data
projects
17. ETL
- Batch Pipelines not messages or
transactions
Data is batched
Feeds are organized like assembly or pipe lines
Each feed is broken into different programs / steps
18. ETL
- Most tools use diagram-driven-development
Which seems
great to almost all
management
And seems
pretty cool for a
while to some
developers
19. ETL
- Most tools use diagram-driven-development
But then
someone always
has to over do it
And we are
reminded that
tools are seldom
solutions
20. ETL
- So all is not wonderful
ETL – the last bastion of Computer-Aided Software Engineering (CASE) tools
Feature ETL
Tool
Custom
Code
Unit test harnesses no yes
TDD no yes
Version control flexibility no yes
Static code analysis no yes
Deployment tool
flexibility
no yes
Language flexibility no yes
Continuous Integration no yes
Virtual environments no yes
Diagrams yes yes
So, why don't we use
metadata-driven or code-
generation tools for
everything?
Why not use tools like
Frontpage for all websites?
21. ETL
- So, Buy (& Customize) vs Build
The ETL Tool Paradox:
● Programmers don't want to work on it
● But can only handle 80% of the problem without programming
Where the Buy option is a great fit:
● 100+ simple feeds
● Lack of programmer culture
● Standard already exists
Most typically – the “corporate data
warehouse” - a single database for an entire
company (usually a bad idea anyway)
22. Python
- a perfect fit for data logistics
● You can use the same language for
ETL, systems management and
data analysis
● The language is high-level and
maintenance-oriented
● It's easy for users to understand the
code
● It allows you use use all the
programming tools
● It's free
● It's a language for enthusiasts
● And it's fun
- http://xkcd.com/353/
23. Python
- Build List
For each Feed Application
● Program: Extract
● Program: Transform
● Config: File-Image Delta
● Config: Loader
● Config: File Mover
Services, Libraries and Utilities
● Service: metadata, auditing & logging,
dashboard
● Service: data movement
● Library: data validation
● Utility: file-image delta
● Utility: publisher
● Utility: loader
24. Python
- Typical Module List
Third-Party
● appdirs
● database drivers
● sqlalchemy
● pyyaml
● validictory
● requests
● envoy
● pytest
● virtualenv
● virtualenvwrapper
Standard Library
● os
● csv
● logging
● unittest
● collections
● argparse
● functools
Environmentals
● Version control – git, svn, etc
● Deployment – Fabric, Chef, etc
● Static analysis – pylint
● Testing – pytest, tox, buildbot, etc
● Documentation - sphinx
Bottom line: a mostly vanilla and very free environment will get you very far
25. Python ETL Components
- Scheduling
● Typically cron
● Daemon if you want more than one run > minute
● Should have suppression capability beyond commenting
out the cron job
● Event-driven > temporally-driven
● Need checking for more than one instance running
● Level of effort: very little
26. Python ETL Components
- Audit System
● Analyze performance & rule issues over time
● Centralize alerting
● Level of effort: weeks
27. Python ETL Components
- File Transporter
File movement is extremely failure-prone:
- out of space errors
- permission errors
- credential expiration errors
- network errors
So, use a process external to feed processing to move files – and
simplify their recovery.
Note this is not the same as data mirroring:
- moves files from source to destination
- renames file during movement
- moves/deletes/renames source after move
- So, you may need to write this yourself – rsync is not ideal
Level of Effort: pretty simple, 1-3 weeks to write reusable utility
28. Python ETL Components
- Load Utility
Functionality
● Validates data
● Continuously loads
● Moves files as necessary
● May run delta operation
● Handles recoveries
● Writes to audit tables
Bottom line: pretty simple, 1-3 weeks to write reusable utility
29. Python ETL Components
- Publish Utility
Functionality
● Extracts all data since the last time it ran
● Can handle max rows
● Moves files as necessary
● Handles recoveries
● Writes to audit tables
● Writes all data to a compressed tarball
Bottom line: pretty simple, 1-3 weeks to write reusable utility
30. Python ETL Components
- Delta Utility
Functionality
● Like diff – but for structured files
● Distinguishes between key fields vs non-key fields
● Can be configured to skip comparisons of certain fields
● Can perform minor transformations
● May be built into Load utility, or a transformation library
Bottom line: pretty simple, 1-3 weeks to write reusable utility
31. Python Program
- Simple Transform
def transform_gender(input_gender):
“”” Transforms a gender code to the standard format.
:param input_gender – in either VARCOPS or SITHOUSE formats
:returns standard gender code
“””
if input_gender.lower() in ['m','male','1','transgender_to_male']:
output_gender = 'male'
elif input_gender.lower() in ['f','female','2','transgender_to_female']:
output_gender = 'female'
elif input_gender.lower() in ['transsexual','intersex']:
output_gender = 'transgender'
else:
output_gender = 'unknown'
return output_gender
Observation:
Simple transforms &
Rules can be easily
read by non-programmers.
Observation: Transforms
can be kept in a module
and easily documented.
Observation:
Even simple transforms
Can have a lot of subtleties.
And are likely to be referenced
Or changed by users.
32. Python Program
- Complex Transformation
def explode_ip_range_list(ip_range_list):
“”” Transforms an ip range list to a list of individual ip addresses.
:param ip_range_list – comma or space delimited ip ranges
or ips. Ranges are separated with a dash, or use CIDR notation.
Individual IP addresses can be represented with a dotted quad,
integer (unsigned), hex or CIDR notation.
ex: "10.10/16, 192.168.1.0 - 192.168.1.255, 192.168.2.3,
192.168.3.5 – 192.168.5.10, 192.168.5, 0.0.0.0/1"
“””
output_ip_list = []
for ip in whitelist.ip_expansion(ip_range_list):
output_ip_list.append(ip)
return output_ip_list
Ok, this is a cheat – the complexity is in the library
Observation:
Complex transforms would
That would be a nightmare in
A tool can be easy in Python.
Especially, as in this case, when
There's a great module to use.
Observation:
Unit-testing frameworks
Are incredibly valuable
For complex transforms.
34. The Bottom Line
Thank You – Any Questions?
The Good:
● Python for attracting & retaining developers
● Python for handling complexity
● Python for costs
● Python for adaptability
● Python for modern development environment
The Not Good:
● Lack of good practices adds risk
● Lack of a rigid framework requires discipline
The Tangential:
● Hadoop – who said anything about hadoop?
Notes de l'éditeur
About me: - 20 years working on data logistics for big data Projects on variety of clients - 10+ years working with python on data logistics - live in manitou springs - currently work for IBM as a data architect responsible for their security data warehouse - presented on this topic often
I didn't say “Big Data Project” - a big social networking site with 1 PB of content may not be doing as much analysis – may not require as many feeds Many would say this is the hardest part of data science Many would say this can consume 90% of a data science budget
As I'll get to in the next slide – you will probably have ***many*** feeds This shows an ideal security data warehouse set of feeds 24 FEEDS – but it could really be > 50
Firewall only - stuck with looking for patterns - might identify scans - might identify recon - will miss all distributed attacks Firewall+ - can tell if a scan came from a whitelist - can see if activity involves known bad guys - can see if activity involves high-value, Or vulnerable assets
Acknowledgements to Mike Koenig, and Drum 8. “An Upsetting Theme” by Kevin MacLeod. Licensed under Creative Commons “Attribution 3.0″ http://creativecommons.org/licenses/by/3.0/ and used here by permission, and with appreciation and thanks. Herbert Morrison’s on-the-scene recordings of the Disaster are Public Domain. Thanks to http://www.americanrhetoric.com for access.
Above example – problem won't disappear for 11 months. Users will be reminded of problem until it does. This is unlike transactional system, in which evidence of problems is hidden. Quality problems are one of the top reasons for analytical system failure. Examples: - Country threaten to go to UN if my company didn't retract an apology for its wrong analysis based on my data. Pretty intense.
Source systems won't tell you of changes they've made Many business complex feeds to maintain
http://creativecommons.org/licenses/by/2.0/deed.en Examples: - A system I'm familiar with is spending 4x what we're spending on hardware & support and loads 1/8000 our speed.
http://creativecommons.org/licenses/by-nc-sa/2.0/deed.en http://www.flickr.com/photos/slworking/5328601506/ You could eventually paint yourself into a corner – in which the maintenance of your feeds is nearly impossible to keep up with. Examples: - I know of some systems that take 6 months to build feeds. Others that can do the exact same feed in 1 month. -
foo
- ETL Tools aren't silver bullets - XML isn't a silver bullet - Your experience building transactional systems won't help you This is not your world, it's your father's world. It's the world of mainframe batch systems from the 60s & 70s: - Few streams - Web services are too slow for the big feeds - No fat object layers - No record-by-record transactions + batch processing + bulk loading + merging of files
Gorillas don't scale – king kong couldn't exist because the square-cube law would require his bones to be disproportionately larger in cross-section at that size. Likewise, the work to build and maintain 50 feeds is more than 50x the work to do 1: Overhead services become more important – and take up more time Feeds have interdependencies Plus, they don't age terribly well – as you discover that upstream systems make changes, say annually, without telling you.
You need consistency to keep maintenance costs low. Too much inconsistency and you'll have an unmaintainable nightmare. But you need adaptability to work around source system requirements. Too much consistency here and you'll be unable to add new data. Ex: - you may have to use a client library in some other language - you may have to use RSS, SSL, RMI, etc - you may have an extract on the other side of a firewall
These two worlds just don't talk much. Especially since most ETL solutions have been closed source – it's a domain that's invisible to open source projects. Plus, ETL just isn't SEXY. Now that Big Data projects are happening in Corporate environments, and open source ETL is getting coverage – it's getting more visibility.
From http://professional.robertbui.com/2009/10/kettle-cuts-80-off-data-extraction-transformation-and-loading/ And most solutions involve diagraming your feed, and the solution either: - generates code - runs metadata through an engine
From http://professional.robertbui.com/2009/10/kettle-cuts-80-off-data-extraction-transformation-and-loading/
CASE tools were pretty much abandoned by the mid-90s But not for ETL – since its main adherents were those that didn't program much anyway So, they've lingered. And so has the myth that ETL is too hard to write by hand. In the late 90s the Meta Group released a study that showed that COBOL programmers were more productive than the users of any ETL software.
My apologies to the Ruby guys who are all sick of this cartoon by now