2. • Architect and Dev Mgr at ubix.ai … data science platform
• Infrastructure & real-time services —-> automating data science at scale
• 3rd Seattle Spark Meetup!
• xPatterns Big Data Platform (Spark, Mesos, Tachyon, Cassandra …)
• Strata, Spark, C* summits & local meetups
About Me
3. • Ubix Data Eng & Science Platform Architecture
• High dimensional sparse feature spaces
• OKA (OverKill Analytics) and Composite Modelling
• (Kaggle)Outbrain Click Prediction: demo in DSL Workbench
• pymap deep dive: distributed scikit-learn through Spark
• python injection into DSL: pySpark scala JVM interop
• Q&A
Agenda
4. Data Eng & Science Platform: “Engine”
Unified big data technology stack (spark, cassandra, hadoop, kafka, es..)
Cloud agnostic architecture
Universal predictive interface (MlLib, ML Pipeline, VW, scikit-learn, R, H20 … TF)
Extensible and integration via fluent and expressive API (DSL)
Enterprise grade: scalability, performance, high availability, geo-replication,
resilience, security, manageability, interoperability, testability
5.
6.
7. • high dimensional feature engineering often demands sparse representation
• spark and scipy support vs ubix DSL: compress-sparse, merge-sparse, expand-
sparse, filter-sparse, load sparse (libsvm format)
• sparse format: native input to mllib, spark.ml, scikit-learn algos
• exceptions: spark 1.6 mllib’s kmeans, gmm, RF (breeze linear algebra or … slow)
• feature (2-way) encoding + vocabulary extraction (error analysis, importance)
• Dimensionality Reduction via Feature Selection (ChiSquare) and Hashing (text)
High dimensional sparse feature spaces
8. • OKA: “design philosophy for predictive models favors volume over precision, utility over
elegance, and CPU over IQ. … brute force attack on data science, compromise fine-tuning
• Alternative to Dimensionality reduction - train on full sparse feature space!
• Composite Modeling = managing part models as one ensemble
• distributed scikit-learn/TF/VW models -> prediction table output for averaging, voting
• unsupervised learning output -> input supervised learning (clustering + ensembling)`
• dimensionality reduction or building semantically different models within clusters
• OKA + Comp: larger feature spaces (lower variance in parts -> higher bias in part models)
OKA (OverKill Analytics) & Composite Modelling
10. • Outbrain: content discovery platform … 250 billion personalized recommendations/month
• Kaggle: predict which recommended content each user will click?
• sample of users’ page views and clicks (14 days) .. sets of content recommendations
served to a specific user in a specific context +
• document metadata: mentioned entities (person, organization, location), a taxonomy of
categories, the topics mentioned, and the publisher.
• 2 Billion page views, 16,900,000 clicks of 700 Million unique users, across 560 sites
Outbrain Click Prediction
11. • primitives for model management (model + metadata)
• optimizations for clustering + composite modeling techniques
• compute partition size/count to avoid OOM (simple with static allocation of resources
(Mesos/Coarse Grained or YARN))
• wrapped pySpark (jvmContext) through gateway servercontext (JavaGateway)
• python-scala interop through cached temp tables (registerTempTable)
pymap - distributed python