Contenu connexe Similaire à Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Retailers (20) Plus de Cloudera, Inc. (20) Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Retailers1. 1
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Turning Petabytes of Data into
Millions in Cost Recapture for the
World’s Biggest Retailers
Case Study with PRGX Global
Jonathon Whitton| Director Data Services
2. 2
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
DRIVE CUSTOMER INSIGHTS
IMPROVE PRODUCT &
SERVICES EFFICIENCY LOWER BUSINESS RISKS
Data is Transforming Business
MODERNIZE ARCHITECTURE
3. 3
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
OPERATIONS
DATA
MANAGEMENT
BATCH REAL-TIME
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure for the Modernized Architecture
Hadoop is a new kind
of data platform.
• One place for unlimited data
• Unified data access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
4. 4
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Talend: A History of Delivering Innovation
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
(Revenue Growth)
Data
Integration
Master Data
Management
Data
Quality
Big Data
Application
Integration
Hadoop 2.0
Spark &
Cloud
• Our mission: enable the
data driven enterprise
• The most advanced Big
Data integration platform
• 10X faster development
• One solution for batch &
streaming
• Deploy machine learning in
minutes
Data
Preparation
1st to market for
Spark
1st on YARN &
Hadoop 2.0
1st to deliver a DI and
AI platform
1st ETL open source
tool
5. 5
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
CLOUDERA ENTERPRISE
Talend and Cloudera
Enterprise Data Warehouse
Unstructured Data
Data Sources
Structured Data
Archive
Load
Applications
Reporting
BI System
Modeling
Model
OPERATIONS
DATAMANAGEMENT
UNIFIED SERVICES
PROCESS,ANALYZE, SERVE
STORE
INTEGRATE
APPLICATION
INTEGRATION
CLOUD
INTEGRATION
DATA
INTEGRATION
BIG DATA
INTEGRATION
MASTER DATA
MANAGEMENT
DATA
PREPARATION
Data Fabric
Connect
Serve
Serve
Ingest
Ingest
6. 6
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Polling Question
• What is your top requirement for modernizing your big data infrastructure?
• Capture and secure all data
• Operational simplicity
• Variety of tools to interact with data
• Integration and portability across multiple platforms
• Timely availability of data
• Other
7. 7
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
About me
Jonathon Whitton, Director Data Services at PRGX
Started working at PRGX in 2000
BA from Duke University
MBA from Kennesaw State University
LinkedIn.com/in/whitton
Twitter: @JonathonWhitton
8. 8
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Global leader in accounts payable recovery audit
Serve 75% of the Top 20 Global Retailers, most for more than 10 years
Nearly half of recovery audit revenue from outside the U.S., in more than 30 countries
and across 5 continents
~1400 employees
9. 9
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
What do we do for a living? …. We are in search of errors!
Aggregate & Manipulate
Client Data
• Receive over 2 million client
files annually
• 2.3 petabytes of data “live”
for auditing on average
• Data includes purchasing,
payment, receiving, deals,
point of sale, and emails
Mine Client Data for
Overpayments
• Utilize proprietary data
mining tools and techniques
• Fully document claim
Recover Overpayments
from Vendors
• Handle majority of vendor
communications
• Verify that deductions taken
or payments received
10. 10
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
We are a data-driven business dealing with huge data sizes…
MONTHLY ANNUALLY200,000 2,400,000
# FILES FOR STRUCTURED DATA
MONTHLY 40TB
ANNUALLY 480TB
MONTHLY 200TB
ANNUALLY 2,375TB
INBOUND FOR STRUCTURED DATA
MONTHLY ANNUALLY12,500,000 150,000,000
# EMAILS UNSTRUCTURED DATA
11. 11
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
… and huge data variety
CLIENT DATA ARRIVES
CHALLENGES METHOD RECEIVED
DELIVER TO AUDIT
TYPES OF FILES % BY FILES
EDI
XML
Flat file csv
Flat file delimited
database backups
spreadsheets
Pdfs
Tiff
Jpeg
Png
Prns
Emails
Microfiche
Proprietary formats
(FOR STRUCTURED DATA)
28% EBCDIC Flat
40% ASCII Flat
25% DB back-ups
7% Proprietary
Daily
Weekly
Bi-weekly
Monthly
Quarterly
Semi-annually
Annually
Unexpected
Under- estimated
New schema
Wrong schema
Tape
Hard drive
sFTP
Email
Other media – flash drives,
DVDs
DAILY WEEKLY BI-WEEKLY MONTHLY
QUARTERLY SEMI ANNUALLY ANNUALLY
12. 12
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Our Legacy solution was designed in the 1990s…
It had all the problems
that legacy systems
have
• High lead times
• Re-runs take a lot of
operations
• Highly labor intensive
operations
Tweaked and upgraded
over the years
but
architecturally
remained the same
Transformation process
Significant
re-engineering
zero downtime
Our solution had to be
flexible
and
lightning fast!
HiPer
High
Performance
Computing
13. 13
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Our evaluation was on potential “architectural stacks”
Functional Requirements
Information Security
Requirements and Privacy
Protection
Performance Requirements
Technical Requirements
Vendor Information
Filter 1:
High level assessment focused on vendor’s overall company
strength, ability for the solution to scale, technology
feasibility in PRGX
19 potential vendors Short list of 6 vendors 5 POC vendors
Filter 2:
Based on willingness of vendor to work with us on a POC
and their flexibility
14. 14
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Timeline of HiPer program
Q3 2013 –
Explore
technology that
can do Data
Transformation
much faster
Q4 2013 –
Ready with
problem
statement
Q1 2014 – Level
1 screening of
technology
stack options
Q2 2014 –Proof
of Concept
Q3 2014 –
Selection of
Hadoop
(Cloudera),
Talend stack
Q4 2014 – Pilot
Accounts;
Planning,
prioritization
for 2015
Q1 2015 –
Production
begins for top
15 accounts;
unstructured
analytics Proof
of concept
By Q2 2016 –
Remove
reliance on
AS400; pilot
accounts for
unstructured
analytics
By Q4 2016 –
60% accounts in
Production;
majority of
accounts for
unstructured
analytics
15. 15
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Our solution for structured data processing
Data Acquisition and Load Data Transformation Auditing
Infrastructure management
Data Long Term Storage Hadoop Cluster
Data Set 1
Data Set 2
Data Set 3
Data Set 4
Data Set 5
Audit Tool Set 1
Audit Tool Set 2
Audit Tool Set 3
1. Talend and related
automation
Data Privacy, Security and Data Lineage
1. HDFS – for long term data storage
2. Batch processing – Apache Hive and Apache Spark
3. Investigative queries (QC) – Apache Impala
4. Output to RDBMS – Apache Scoop and Talend
1. Cloudera Manager
1. Cloudera Navigator
2. Kerberos
1. Continue and invest in Legacy
technology (MSSQL)
2. Explore use of Hadoop
oriented BI tools
16. 16
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Our solution for unstructured data processing
Data Acquisition and Load Data Transformation Auditing
Infrastructure management
Data
(pst, nsf, …)
Email Auditing Tool
Data Privacy, Security and Data Lineage
1. Cloudera Manager
1. Kerberos
2. Mutiple Solr indexes / Hbase Tables
EML Files
EML files
EML files
Long term
storage - HBASE
Search Index -
SOLR
1. In-house automation logic in .Net
to convert files to eml files
2. Load to Hbase via Apache Thrift
3. Apache Tika to Read Email Data
(Headers, Body, Attachments).
1. Hbase – email storage and updates
to individual records
2. NGData Keystore indexer (Lily) to
update Solr in near real time
3. SOLR – index to query emails
4. SOLRNET - .Net Api to SOLR querying
5. Apache Thrift - Update to HBASE for
audit findings
1. Refreshed .Net
application to read from
SOLR/update to HBASE
17. 17
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
How does Cloudera Hadoop impact our business?
Load and Prepare Data Audit
Load and
Prepare Data
Audit
X weeks
X/2 weeks
More time for Auditing
Limited time to process and audit data
HiPer’s Impact
Delivery Cycle
Structured data processing
Improved performance in data delivery for structured
and unstructured data about 9-10 times the original
timelines via Hive
Starting to use Spark
18. 18
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
How does Cloudera Hadoop impact our business?
Faster analysis in Hadoop
Point of Sales (POS) data was aggregated
list of stores
number of transactions
sum of quantity
Would have run for hours in our AS400
environment
2014 would not have been readily available before
Hadoop’s lower cost of storage made it affordable
for us to keep more data online
No need to go back to our archive
No need to request a restore
2015 data
13 months: 20141201 to 20151231
Row count: 2,048,666,122
Returned: 1385 rows returned (1 per store)
Returned in: 90 seconds
2014 data
13 months: 20131201 to 20141231
Row count: 2,159,626,445
Returned: 1365 rows returned (1 per store)
Returned in: 90 seconds
Field count in source POS data was 14
19. 19
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
How does Talend impact our development?
Hive QL hand coding
Talend Big Data 40% faster
Move on to the next item
this much faster
Development Cycle
Development Cycle
Reduced developing processes in Hadoop by at least
40%.
With over 200 distinct process to get through this
year, the time savings is very meaningful in achieving
our goals
No more hand coding Hive QL
Talend has the logic de-coupled from the code that is
executing on the Hadoop cluster
“Upgrade” to Spark via dropdown and then
addressing limited component changes that are
clearly identified
“Upgrade” to the next big thing
20. 20
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Cloudera Hadoop and Talend impact the future of our business
Structured data processing
Support for “machine-learning” techniques in the email and buyer pulls significantly
enhancing productivity for audits
Expanded focus
Key enabler to mine large volumes of data from our customers
Ability to re-run and experiment on large sets of data
Framework for industry standard Master Data
Infrastructure
Backup and archival solution for the future
Begin standardization and reusability of processes; eliminate person dependency
21. 21
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Key success factors
Create the business case…for multiple years, with quick wins
– no IT programs succeed without business backing
Plan the program – for multiple years
Architect it right
• Architect keeping your current process and tools in mind
• Architect it for different uses
• Architect it for time!
Sell your program…you need champions everywhere
– from the CEO/Board to your business, to marketing to even HR
Find the right partner; develop the partnership
22. 22
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Polling Question
• Which Hadoop tools do you foresee using the most to help with your big data
challenges?
• Data ingestion tools
• Search
• Impala
• Kudu
• Spark
24. 24
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Call to Action
Test Drive Talend Software Define a Big Data ProjectLearn about Hadoop with
Cloudera
25. 25
© Cloudera, Inc. All rights reserved.
© PRGX Global, Inc. All Rights Reserved
Why Cloudera and Talend?
Scalability
Performance
Flexibility
Stability
Support