Capital onehadoopintro

Capital One
Hadoop Intro:
History
ETL/Analytics Practices in LinkedIn/Netflix/Yahoo
Next Gen ETL 2014+
Scaling Layers
Hadoop Distributions
Analytics
1/7/2014

Hadoop/HBase


Original requirements
−

GFS: Storing internet html pages on disk for
analytics later

−

BigTable: 2002/Book pages had metadata.
Requirement return book pages to user, no joins
(no memory 2002, different now)





Latency determines requirements (analytics/Netflix later)
Semireal time. Schema for book pages. Where to store
the metadata? In BigTable
My role: not going to give you slides w/pics, everything
presented has code behind it w/documentation

Bigdata >>50% failure rate





After POCs very few enter production
Why? Workhabits for distributed computing.
Have to write distributed computing
components, J2EE idioms don't work.
Fail b/c Performance/Administration in
Production
e.g. Performance not an issue to support top 100
abinitio queries in Hadoop, 130k will be issue or
perhaps 10%

Measuring performance in POCS,
wrong means they can't build
components


Wrong

Server/Thre
ad

DN1/RS

NN
DN2/RS
Server/Thre
ad
DN2/RS

Performance Measurement, leader
election, countdown latch, test
failure/handoff w/chaos monkey


Zookeeper+Jetty

DN1/RS
Server

DN1/RS
Zookeeper

Server

DN1/RS

Hive at LinkedIn (bottom left). All 3
similar

Linkedin Simple Abstractions


Teradata with Hadoop



Multiple clusters:Prod/Dev/Research(POC?)



Hive: adhoc small ETL lower left hand corner





Pig/DataFu + enhancements for ETL
production
Multiple data stages in green box, (POC
Abinitio Datastaging, REST API for staging).



Workflow POC; Oozie+Pig+Hive. Add Web UI



Data Staging POC: CDK as example

POC Coding Style






High level directory with Maven subprojects,
Simple Archetype ok
Define Data Repositories with Avro schemas,
start with a simple file repository with files
copied from Abinitio file system. No need to
spend time reverse engineering; just copy
Add pig and hive directories to cdk-examples

POC Simple extensions








Define a webserver in the cdk and create a
REST API. Jersey/.../DI if you want more
advanced coding styles
Webserver graphs performance of
Hive/Pig/ETL metrics with JVM metrics and by
sending dummy queries in.
Start Nagios/Ganglia monitoring and Puppet
deployment of CDK as learning for larger scale
Integrate CDK into Bigtop for Capital One
distribution practice

Simple Netflix Abstractions



http://www.slideshare.net/adrianco/netflix-architectu
Automated Develop and deploy s/w process on
APIs. Perforce/Ivy/Jenkins. Hadoop POC,
github, Jenkins, deploy to demo webpage. No
code sitting in an Eclipse project

Netflix Automated App Dev/Deploy


REST specification makes Web Uis easier. C1
ETL REST I/F

Netflix Instance config


Do same for Capital One, exercise to help
w/deployment; Apache Bigtop, define 1) NN
instance, 2) DN/RS instance, customize the
scripts/instance

Netflix Security


Default turn off iptables/selinux. Define Capital
One POC testing? Start w/auditing
requirements on test cluster (w/Aravind )

Netflix Metrics


Send dummy queries through to measure
latency

Netflix Scaling Layer, do simpler
first, JDBC manage connection
pool,Pig/Hive

Yahoo Block Diagram, Pig, Hive,
Spark, Storm

LinkedIn/Yahoo/Netflix References






Reference: LinkedIn: Muhammed Islam
http://www.slideshare.net/mislam77/hive-at-linkedin
Yahoo:Chris Drome, for outside business
users. Very similar to slide before.
Netflix: Jeff Magnusson Hive used for adhoc
queries and lightweight ETL (on web also)

ETL - Pig


Original Pig paper:
http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf
−

ETL language based on relational algebra
(reorder/Set) vs. SQL queries. Each step M/R ETL

−

No transactional consistency or indexes (other
projects have this)

−

Nested Data model vs. Flat SQL E/R model. Why?

Faster scan performance, replace joins.
e.g.MongoDB
Requires development UDFs, LinkedIn: DataFu






Netflix: Lipstick for debugging Pig DAGs. Will need
some debugging tool. Better than Spill

ETL - Pig


M/R ETL Points
−

Data distributed on several nodes, merge sort
results at end. Careful sending data across the
network. Doesn't scale with more users. Network
limitation

−

Google custom network switch, 1k+ ports. Custom
TCP stack, modified OS

−

Careful: streams scale, do ETL with Streams.
Real time performance. Send results to a separate
server. Do not embed writes into stream POCs

Pig Usage






Yahoo(http://www.linkedin.com/pub/chrisdrome/2/a2/346): thousands of ETL jobs daily,
Hive for small user base external to Yahoo
Netflix(http://www.linkedin.com/in/jmagnuss):
Thousands of jobs, at analyst level. Open
sourced Lipstick, Pig UI debugging tool
LinkedIn(http://www.slideshare.net/hadoopuser
group/pig-at-linkedin): thousands of jobs, open
sourced DataFu UDFs

PIG POCS(~2009)


Possible Pig POCs:
−

Top XX queries, manually code up Abinitio queries.
This is already completed 2012? Which queries?

−

Add a JDBC connection type scaling layer to
PigServer.java

−

Out of scope for 4/30/14:




POC Tez on Pig:
https://issues.apache.org/jira/browse/PIG-3446
Apache's Pig Optimizer (MR->MR->MR goes to MRRR)
by writing optimizer in YARN AM.

POC quality


Turn the POCs into Bigtop integration tests and
get open source approval. Commit changes to
verify quality and accountability

Hive 0.11







More difficult to configure, add mysql metastore
Moving to Hcatalog for metadata to be
accessible by other Hadoop Components
Access using WebHCat, in progress
Hive Stinger using TEZ, additional in memory
optimization
No time spent on this yet; starting 1/2014
w/Hortonworks. Last day 4/30/2014

Hive 0.11
−

Hive 0.11 POCs







User guide for Abinitio programmers using Hive/Pig
Test multitenancy features w/Pig/HDFS
Test jdk 1.7 features. Hadoop 2.x works with 1.7
HiveMetastore/MySQL/HCatalog/HWebCat
Test cluster performance using benchpress
Next gen: 0.12-0.13;Spark/Shark hiveql compatability

Next Gen ETL Frameworks for
2014+


Faster Reads/Scans w/o using HBase. 3
Developments(wibidata)
−
−

Spark/Shark

−


Dremel:Impala/Apache Drill
Hive/Tez

Dremel Paper review, Interactive analysis of
Web Scale datasets
−

Don't use M/R for speed, 100x faster

−

Column schema: Nested Column oriented storage,
not rows, faster for some types of queries!!!

−

Partition key (not in paper)

Dremel Schema/Column Perf, sim
to kiji w/o Hbase? Sqoop objects

Next Gen ETL




Shark/Spark; distributed memory RDD, analysis
and ETL
Hive/Tez

Next Gen ETL POCs (combine
mem)




Goal: develop skill for getting to higher Read HDFS
performance.
Stage Data Schema/Representation effects on
Performance. Dremel nested columns:
−
−



Data w/ avro schemas and partition strategies.
Partition by timestamp, partition by custom rowkey,
partition by schema definitions

Measure effect of data schema on M/R and nonM/R
implementations. Conversion or staging process for
data

Next Gen ETL


Addition of new components into Hadoop
−

CDH will come with Spark/Shark

−

CDH comes with Impala

−

HDP status unknown for now (clear EOM)

Hadoop Distributions


Create a Capital One distribution


Why? Production is 3-4x the amount of work compared to
Dev
−









Make sure ready for production before development completed

Refactoring of scripts, bin and sbin to allow admin and
users access to admin/user scripts
Customize and Add components, (scaling layer)
Puppet/Chef scripts for cluster deployment
Real Time Monitoring(not provided in CDH/HDP), hotspot
detection for long running jobs
Ready for cluster deployment allows integration of
functional requirements like security into functional
Groovy iTests.

Possible Hadoop Distro POCs


Beginner POCs:



Goal: smooth handoff from dev to production
−

Build Apache Bigtop (will need reference doc)

−

Add components you are currently using not in
distro (e.g. mongodb + hbase for schema)

−

Add integration tests,

−

Add puppet recipes

−

Learn how to apply patches, how to customize for
simple modifications, production stability

POC framework



Goal: contribute open source code
Start with the documentation and s/w
processes first
−

DocBook;

−

Jenkins server;
http://apachebigtop.pbworks.com/w/file/49310946/A
pache%20Bigtop%20%20Jenkins.docx

POC Framework/Roadmap


Track the Jiras!!!
−

Multitenancy needs a test plan.


−

Development environment using Vagrant instead of
EC2. Cheaper, easier to administer


−

https://issues.apache.org/jira/browse/BIGTOP-1171

Create a Capital One Hadoop* user guide


−



Create a functional spec for missing components


Include test cases for security, multiuser access,
minimum performance to meet SLAs

Scaling


Astyanax on Cassandra (Netflix)
−





Small companies don't have 300 users accessing
HDFS. Manage the clients.

Some examples. Scaling involves multiple
components above the cluster h/w and Hadoop
daemons. This is NOT running CDH or HDP
using Ambari or Cloudera Manager
Gives SLA and Adhoc high priority jobs

Capital One will need a custom
component




Either for Security or scaling or … even to
separate batch analytics queries from adhoc
queries
Break down into 2 bigger steps:
−

Cluster Testing tool for scaling/security

−

Develop multiuser client layer using above and
measure performance and modified use cases

Building a scaling layer


Need a tool for testing. Need to know how to
use zookeeper at a minimum.
−

Impossible to figure out via web searches

−

Leader election and countdown latch

−

Most people do their POCs incorrectly.





Worst mistake is multiple threads on a single server
Second worst mistake is using HBase
PerformanceEvaluation.java as a reference. PE.java is
not cluster aware
Test cluster throughput for cluster scaling

Analytics





Review and Demo (weblog targeting)
Concepts to agree on first: modeling and
targeting
http://www.slideshare.net/DougChang1/demogr
aphics-andweblogtargeting-10757778

Analytics, (wibidata), schema,
model, targeting, use db vs hbase

Analytics


Model iteration performance key. O(n^2) #
users
−





Random Forest 6-8h on macbook

Sponsorship from EMC, free 1k node cluster +
Gemfire for faster model building
Hadoop;HDFS + M/R for certain specific use
cases
−

Batch analysis, log analysis. Click log analysis from
large disk files

−

ETL, M/R ETL only. Much much slower than any
commercial system

Analytics 2014+


Visualizations
−



Deep Learning case studies:
−



Tableau/Datameer POC? Data+Queries?
Google Now >> Apple Siri. Deep Learning models
replaced Gaussian MM

Background refresher speech recognition
−

Deep learning as a replacement for GMMs in the
Acoustic model,
http://www.stanford.edu/class/cs224s/2006/

−

Can do POCs here for innovation. Requires outside
consultant assistance

Deliverables avail today


Start the Capital One distribution
−
−



Build instructions
Functional Specification Capital One Hadoop Distro
POC

Planned, need approval before starting
−

Data Staging





Functional Specification Capital One Data Staging POC
Functional Specification Data Staging API

ETL Performance POC
−

Functional Specification Top 100 queries from
Abinitio

Capital One Block Diagram
REST:
Batch
ETL
M/R

REST:
AdHoc M/R

Real Time
ETL
No M/R

Streams/Storm
Real Time Anaytics

HCatalog/Schema
Scaling Layer

HDFS

POCs




Data Ingestion: POC w/Apache Kafka; test
fixture needed. Current abilities may not be
there
Hadoop ETL:
−

Schema definition


Write/Read query performance of top 10/100 Abinitio
queries. How close is current ETL to Abinitio? Assume
this answer exists.

POCs




Hadoop Dev->Production: Building Capital One
distribution Apache Bigtop, replicate CDH
configuration with
HDFS/Pig/Hive/OOzie/Flume/Spark. Leave out
Impala, not currently in Bigtop
Scaling: POC intermediate layer.

Capital onehadoopintro

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Capital onehadoopintro

Similaire à Capital onehadoopintro (20)

Plus de Doug Chang

Plus de Doug Chang (12)

Dernier

Dernier (20)

Capital onehadoopintro