SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Gimel Data Platform Overview
Agenda
©2018 PayPal Inc. Confidential and proprietary. 2
• Introduction
• PayPal & Big Data Space
• Analytics Platform & Gimel
• Why Gimel
• Challenges in Analytics
• Walk through simple use case
• Gimel – Implementation Details
• Gimel – Open Source & future
• Q & A
About Us
• Product manager, data processing products
at PayPal
• 20 years in data and analytics across
networking, semi-conductors, telecom,
security and fintech industries
• Data warehouse developer, BI program
manager, Data product manager
romehta@paypal.com
https://www.linkedin.com/in/romit-mehta/
©2018 PayPal Inc. Confidential and proprietary. 3
Romit Mehta
• Big data platform engineer at PayPal
• 13 years in data engineering, 5 years in
scalable solutions with big data
• Developed several Spark-based solutions
across NoSQL, Key-Value, Messaging,
Document based & relational systems
dmohanakumarchan@paypal.com
https://www.linkedin.com/in/deepakmc/
Deepak Mohanakumar Chandramouli
PayPal – Key Metrics
4©2018 PayPal Inc. Confidential and proprietary.
PayPal Customers, Transactions and Growth
5
From: https://www.paypal.com/us/webapps/mpp/about
PayPal Big Data Platform
6
13 prod clusters, 12 non-
prod clusters
GPU co-located with
Hadoop
150+ PB Data
40,000+ YARN
jobs/day
One of the largest
Aerospike,
Teradata,
Hortonworks and
Oracle installations
Compute supported:
MR, Pig, Hive, Spark,
Beam
PayPal Analytics Ecosystem
and
Gimel Platform
(Unified Data Processing Platform)
7©2018 PayPal Inc. Confidential and proprietary.
8
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
PCatalog Data API
Infrastructure services leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Logging
Monitoring
Alerting
Security
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Why Gimel?
9
Use case - Flights Cancelled
11
Kafka Teradata External
HDFS / Hive
Data Prep / Availability
ProcessStream Ingest LoadExtract/Load
Parquet/ORC/Text?
Productionalize, Logging, Monitoring, Alerting, Auditing, Data Quality
Data SourcesData Points
Flights Events
Airports
Airlines
Carrier
Geography & Geo
Tags
Analysis Publish
Use case challenges
…
©2018 PayPal Inc. Confidential and proprietary.
Real-time/
processed data
©2018 PayPal Inc. Confidential and proprietary. 12
Spark Read From Hbase
Data Access Code is Cumbersome and Fragile
©2018 PayPal Inc. Confidential and proprietary. 13
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
Data Access Code is Cumbersome and Fragile
©2018 PayPal Inc. Confidential and proprietary. 14
Datasets Challenges
Data access tied
to compute and
data store
versions
Hard to find
available
data sets
Storage-specific
dataset creation
results in
duplication and
increased latency
No audit
trail for
dataset
access
No standards for
on-boarding data
sets for others to
discover
No statistics
on data set
usage and
access trends
Datasets
©2018 PayPal Inc. Confidential and proprietary. 15
High-friction Data Application Lifecycle
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Learn Code Optimize Build Deploy RunCompute Engine Changed
Learn Code Optimize Build Deploy RunCompute Version Upgraded
Learn Code Optimize Build Deploy RunStorage API Changed
Learn Code Optimize Build Deploy RunStorage Connector Upgraded
Learn Code Optimize Build Deploy RunStorage Hosts Migrated
Learn Code Optimize Build Deploy RunStorage Changed
Learn Code Optimize Build Deploy Run*********************
Gimel
16
Gimel | Flights Cancelled
Search PCatalog
17
Sign in PCatalog portal and search for your datasets
Find your datasets
Gimel DataSet | Overview
Gimel DataSet | Schema Spec
Relational DataSet
Kafka DataSet
Gimel DataSet | System Spec
Relational DataSet
Kafka DataSet
Gimel DataSet | Object Spec
Kafka DataSet
Relational DataSet
Gimel DataSet | Availability
Find your datasets | Recap
Gimel | Flights Cancelled
Analyze & Productionalize App
26
Access datasets: Navigate to Jupyter notebooks & analyze data
Setup the Application
Data API
Gimel | Flights App | Summary
30
31
API, PCatalog, Tools
With Gimel & Notebooks
©2018 PayPal Inc. Confidential and proprietary.
Kafka Teradata External
HDFS/ Hive
Data Prep / Availability
ProcessIngest LoadExtract/Load
Parquet/ORC/Text?
Productionalize, Logging, Monitoring, Alerting, Auditing, Data QC
Data SourcesData Points
Flights Events
Airports
Airlines
Carrier
Geography & Geo Tags
Analysis Publish
Use case challenges - Simplified with Gimel
©2018 PayPal Inc. Confidential and proprietary.
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
With Data API
✔
Data Access Simplified with Gimel Data API
32
©2018 PayPal Inc. Confidential and proprietary.
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
With Data API
✔
SQL Support in Gimel Data Platform
33
©2018 PayPal Inc. Confidential and proprietary. 34
Data Application Lifecycle with Data API
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
RunCompute Engine Changed
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run
Gimel – Deep Dive
35
Job
LIVY GRID
Job Server
Batch
Livy
API
NAS
Batch
In InIn
Interactive
Sparkling
Water
Interactive
Interactive
Metrics
History Server
Thrift Server
In InIn
Interactive
Interactive
Log
Log
Indexing
Search
xDiscovery
Maintain
Catalog
Scan
Discover
Metadata
Services
PCatalog UI
Explore
Configure
Log
Indexing
Search
PayPal Analytics Ecosystem
©2018 PayPal Inc. Confidential and proprietary.
©2018 PayPal Inc. Confidential and proprietary. 37
A peek into
Streaming SQL
Launches … Spark Streaming App
-- Streaming Window Seconds
set gimel.kafka.throttle.streaming.window.seconds=10;
-- Throttling
set gimel.kafka.throttle.streaming.maxRatePerPartition=1500;
-- ZK checkpoint root path
set gimel.kafka.consumer.checkpoint.root=/checkpoints/appname;
-- Checkpoint enabling flag - implicitly checkpoints after each mini-batch in streaming
set gimel.kafka.reader.checkpoint.save.enabled=true;
-- Jupyter Magic for streaming SQL on Notebooks | Interactive Usecases
-- Livy REPL - Same magic for streaming SQL works | Streaming Usecases
%%gimel-stream
-- Assume Pre-Split HBASE Table as an example
insert into pcatalog.HBASE_dataset
select
cust_id,
kafka_ds.*
from pcatalog.KAFKA_dataset kafka_ds;
Batch SQL
Launches … Spark Batch App
-- Establish 10 concurrent connections per Topic-Partition
set gimel.kafka.throttle.batch.parallelsPerPartition=10;
-- Fetch at max - 10 M messages from each partition
set gimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;
-- Jupyter Magic on Notebooks | Interactive Usecases
-- Livy REPL - Same magic works | Batch Usecases
%%gimel
insert into pcatalog.HIVE_dataset
partition(yyyy,mm,dd,hh,mi)
select kafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4) as yyyy
,substr(commit_timestamp,6,2) as mm
,substr(commit_timestamp,9,2) as dd
,substr(commit_timestamp,12,2) as hh
,case when cast(substr(commit_timestamp,15,2) as INT) <= 30 then "00" else "30" end as mi
from pcatalog.KAFKA_dataset kafka_ds;
Following are Jupyter/Livy Magic terms
• %%gimel : calls gimel.executeBatch(sql)
• %%gimel-stream : calls
gimel.executeStream(sql)
gimel.dataset.factory {
KafkaDataSet
ElasticSearchDataSet
DruidDataSet
HiveDataSet
AerospikeDataSet
HbaseDataSet
CassandraDataSet
JDBCDataSet
}
Metadata
Services
dataSet.read(“dataSetName”,options)
dataSet.write(dataToWrite,”dataSetName”, options)
dataStream.read(“dataSetName”, options)
val storageDataSet = getFromFactory(type=“Hive”)
{
Core Connector Implementation, example – Kafka
Combination of Open Source Connector and
In-house implementations
Open source connector such as DataStax / SHC / ES-
Spark
}
& Anatomy of API
gimel.datastream.factory {
KafkaDataStream
}
CatalogProvider.getDataSetProperties(“dataSetName”)
val storageDataStream = getFromStreamFactory(type=“kafka”)
kafkaDataSet.read(“dataSetName”,options)
hiveDataSet.write(dataToWrite,”dataSetName”, options)
storageDataStream.read(“dataSetName”, options)
dataSet.write(”pcatalog.HIVE_dataset”,readDf , options)
val dataSet : gimel.DataSet = DataSet(sparkSession)
val df1 = dataSet.read(“pcatalog.KAFKA_dataset”, options);
df1.createGlobalTempView(“tmp_abc123”)
Val resolvedSelectSQL =
selectSQL.replace(“pcatalog.KAFKA_dataset”,”tmp_abc123”)
Val readDf : DataFrame = sparkSession.sql(resolvedSelectSQL);
select kafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4) as yyyy
,substr(commit_timestamp,6,2) as mm
,substr(commit_timestamp,9,2) as dd
,substr(commit_timestamp,12,2) as hh
from pcatalog.KAFKA_dataset kafka_ds
join default.geo_lkp lkp
on kafka_ds.zip = geo_lkp.zip
where geo_lkp.region = ‘MIDWEST’
%%gimel
insert into pcatalog.HIVE_dataset
partition(yyyy,mm,dd,hh,mi)
-- Establish 10 concurrent connections per Topic-Partition
set gimel.kafka.throttle.batch.parallelsPerPartition=10;
-- Fetch at max - 10 M messages from each partition
set gimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;
©2018 PayPal Inc. Confidential and proprietary.
Set gimel.catalog.provider=PCATALOG
CatalogProvider.getDataSetProperties(“dataSetName”)
Metadata
Services
Set gimel.catalog.provider=USER
CatalogProvider.getDataSetProperties(“dataSetName”)
Set gimel.catalog.provider=HIVE
CatalogProvider.getDataSetProperties(“dataSetName”)
sql> set dataSetProperties={
"key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer",
"auto.offset.reset":"earliest",
"gimel.kafka.checkpoint.zookeeper.host":"zookeeper:2181",
"gimel.storage.type":"kafka",
"gimel.kafka.whitelist.topics":"kafka_topic",
"datasetName":"test_table1",
"value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserialize
r",
"value.serializer":"org.apache.kafka.common.serialization.ByteArraySerializer",
"gimel.kafka.checkpoint.zookeeper.path":"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.avro.schema.source":"CSR",
"gimel.kafka.zookeeper.connection.timeout.ms":"10000",
"gimel.kafka.avro.schema.source.url":"http://schema_registry:8081",
"key.serializer":"org.apache.kafka.common.serialization.StringSerializer",
"gimel.kafka.avro.schema.source.wrapper.key":"schema_registry_key",
"gimel.kafka.bootstrap.servers":"localhost:9092"
}
sql> Select * from pcatalog.test_table1.
spark.sql("set gimel.catalog.provider=USER");
val dataSetOptions = DataSetProperties(
"KAFKA",
Array(Field("payload","string",true)) ,
Array(),
Map(
"datasetName" -> "test_table1",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer"
)
)
dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions))
CREATE EXTERNAL TABLE `pcatalog.test_table1`
(payload string)
LOCATION 'hdfs://tmp/'
TBLPROPERTIES (
"datasetName" -> "dummy",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"->
"org.apache.kafka.common.serialization.ByteArraySerializer"
);
Spark-sql> Select * from pcatalog.test_table1
Scala> dataSet.read(”test_table1",Map("dataSetProperties"-
>dataSetOptions))
Catalog Provider – USER | HIVE | PCATALOG | Your Own Catalog
Metadata
Set gimel.catalog.provider=YOUR_CATALOG
CatalogProvider.getDataSetProperties(“dataSetName”)
{
// Implement this !
}
©2018 PayPal Inc. Confidential and proprietary.
Spark Thrift Server
org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.sc
ala
//result = sqlContext.sql(statement)  Original SQL Execution
//Integration of Gimel in Spark
result = GimelQueryProcessor.executeBatch(statement, sqlContext.sparkSession)
Integration with ecosystems
class SparkSqlInterpreter(conf: SparkConf) extends SparkInterpreter(conf) {
private val SCALA_MAGIC = "%%[sS][cC][aA][lL][aA] (.*)".r
private val PCATALOG_BATCH_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".r
private val PCATALOG_STREAM_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".sS][tT][rR][eE][aA][mM] (.*)".r
// ........
// .....
case PCATALOG_BATCH_MAGIC(gimelCode) => GimelQueryProcessor.executeBatch(gimelCode,
sparkSession)
case PCATALOG_STREAM_MAGIC(gimelCode) => GimelQueryProcessor.executeStream(gimelCode,
sparkSession)
case _ =>
// ........
// .....
com/cloudera/livy/repl/SparkSqlInterpreter.scala
Livy REPL
sparkmagic/sparkmagic/kernels/sparkkernel/kernel.js
define(['base/js/namespace'], function(IPython){
var onload = function() {
IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-sql'] =
{'reg':[/^%%gimel/]};}
return { onload: onload }})
Jupyter Notebooks
©2018 PayPal Inc. Confidential and proprietary.
Systems
Data Stores Supported
©2018 PayPal Inc. Confidential and proprietary. 41
Gimel – Open Source & Future
42©2018 PayPal Inc. Confidential and proprietary.
What’s Next
• Query optimization
• Open source
PCatalog:
• Metadata
services
• Discovery
services
• Catalog UI
• Livy features
committed back to
open source
• Python support
Jupyter features
committed back to
open source
©2018 PayPal Inc. Confidential and proprietary.
• Open source Gimel
(http://try.gimel.io)
Gimel - Open Sourced
Gimel: http://gimel.io
Codebase available: https://github.com/paypal/gimel
Slack: https://gimel-dev.slack.com
Google Groups: https://groups.google.com/d/forum/gimel-dev
©2017 PayPal Inc. Confidential and proprietary. 44
Acknowledgements
45
Acknowledgements
Gimel and PayPal Notebooks team:
Andrew Alves
Anisha Nainani
Ayushi Agarwal
Baskaran Gopalan
Dheeraj Rampally
Deepak Chandramouli
Laxmikant Patil
Meisam Fathi Salmi
Prabhu Kasinathan
Praveen Kanamarlapudi
Romit Mehta
Thilak Balasubramanian
Weijun Qian
46
Q&A
( 1 0:55 A M ) G i m e l C o d e la bs: h t t p:/ /tr y.gime l.i o
S l a ck : h t t ps :// gime l - de v.s la ck .com
G o o gle G roups: h t t p s:/ /groups .google .com/ d/for um/ gim el - dev
47
Appendix
48©2018 PayPal Inc. Confidential and proprietary.
References Used
Images Referred :
https://www.google.com/search?q=big+data+stack+images&source=lnms&tbm=isch&sa=X&ved=0ahUKEwip1Jz3voPaAhU
oxFQKHV33AsgQ_AUICigB&biw=1440&bih=799
49©2018 PayPal Inc. Confidential and proprietary.
Spark Thrift Server - Integration
spark/sql/hive-
thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
//result = sqlContext.sql(statement)  Original SQL Execution
//Integration of Gimel in Spark
result = GimelQueryProcessor.executeBatch(statement, sqlContext.sparkSession)
©2018 PayPal Inc. Confidential and proprietary.
Livy - Integration
class SparkSqlInterpreter(conf: SparkConf) extends SparkInterpreter(conf) {
private val SCALA_MAGIC = "%%[sS][cC][aA][lL][aA] (.*)".r
private val PCATALOG_BATCH_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".r
private val PCATALOG_STREAM_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".sS][tT][rR][eE][aA][mM] (.*)".r
// ........
// .....
override def execute(code: String, outputPath: String): Interpreter.ExecuteResponse = {
require(sparkContext != null && sqlContext != null && sparkSession != null)
code match {
case SCALA_MAGIC(scalaCode) =>
super.execute(scalaCode, null)
case PCATALOG_BATCH_MAGIC(gimelCode) =>
Try {
GimelQueryProcessor.executeBatch(gimelCode, sparkSession)
} match {
case Success(x) => Interpreter.ExecuteSuccess(TEXT_PLAIN -> x)
case _ => Interpreter.ExecuteError("Failed", " ")
}
case PCATALOG_STREAM_MAGIC(gimelCode) =>
Try {
GimelQueryProcessor.executeStream(gimelCode, sparkSession)
} match {
case Success(x) => Interpreter.ExecuteSuccess(TEXT_PLAIN -> x)
case _ => Interpreter.ExecuteError("Failed", " ")
}
case _ =>
// ........
// .....
/repl/src/main/scala/com/cloudera/livy/repl/SparkSqlInterpreter.s
cala
©2018 PayPal Inc. Confidential and proprietary.
PayPal Notebooks (Jupyter) - Integration
def _scala_pcatalog_command(self, sql_context_variable_name):
if sql_context_variable_name == u'spark':
command = u'val output= {{import java.io.{{ByteArrayOutputStream, StringReader}};val outCapture = new
ByteArrayOutputStream;Console.withOut(outCapture){{gimel.GimelQueryProcessor.executeBatch("""{}""",sparkSession)}}}}'.format(self.query)
else:
command = u'val output= {{import java.io.{{ByteArrayOutputStream, StringReader}};val outCapture = new
ByteArrayOutputStream;Console.withOut(outCapture){{gimel..GimelQueryProcessor.executeBatch("""{}""",{})}}}}'.format(self.query, sql_context_variable_name)
if self.samplemethod == u'sample':
command = u'{}.sample(false, {})'.format(command, self.samplefraction)
if self.maxrows >= 0:
command = u'{}.take({})'.format(command, self.maxrows)
else:
command = u'{}.collect'.format(command)
return Command(u'{}.foreach(println)'.format(command+';noutput'))
sparkmagic/sparkmagic/livyclientlib/sqlquery.py
sparkmagic/sparkmagic/kernels/sparkkernel/kernel.js
define(['base/js/namespace'], function(IPython){
var onload = function() {
IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-sql'] =
{'reg':[/^%%sql/]};
IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-python'] =
{'reg':[/^%%local/]};
IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-sql'] =
{'reg':[/^%%gimel/]};}
return { onload: onload }
})
©2018 PayPal Inc. Confidential and proprietary.
Connectors | High level
©2018 PayPal Inc. Confidential and proprietary. 53
Storage Version API Implementation
Kafka 0.10.2 Batch & Stream Connectors – Implementation from scratch
Elastic Search 5.4.6 Connector | https://www.elastic.co/guide/en/elasticsearch/hadoop/5.4/spark.html
Additional implementations added in Gimel to support daily / monthly partitioned indexes in ES
Aerospike 3.1x Read | Aerospike Spark Connector(Aerospark) is used to read data directly into a DataFrame
(https://github.com/sasha-polev/aerospark)
Write | Aerospike Native Java Client Put API is used.
For each partition of the Dataframe a client connection is established, to write data from that partition to Aerospike.
HBASE 1.2 Connector | Horton Works HBASE Connector for Spark (SHC)
https://github.com/hortonworks-spark/shc
Cassandra 2.x Connector | DataStax Connector
https://github.com/datastax/spark-cassandra-connector
HIVE 1.2 Leverages spark APIs under the hood.
Druid 0.82 Connector | Leverages Tranquility under the hood
https://github.com/druid-io/tranquility
Teradata /
Relational
Leverages JDBC Storage Handler
Support for Batch Reads/Loads , FAST Load & FAST Exports
Alluxio Leverage Cross cluster access via reads using Spark Conf : spark.yarn.access.namenodes
Dataset Registration Process Flow
©2018 PayPal Inc. Confidential and proprietary. 54
Data Platform
Onboard
Fill Meta &
Submit Approval
Request
Requestor
Approver
Approved Create
Dataset API
PCatalog
Storage
User/Developer
Submit Job Create Dataset Meta
on PCatalog
Create Catalog
on Storage
Compute
(Data API)
Data
Get Dataset Meta
Access
1 2
3
4
5
6
1 2
3
4
Auto-Approve2
RESTAPI
Gimel Data Catalog Features
©2018 PayPal Inc. Confidential and proprietary. 55
Dashboard and Alerts
Query and BI integration
Explorer
Discovery
• Auto-discover datasets
across all data stores
• View available datasets
• View schema
• View system and object
attributes
• Integration with Jupyter
notebooks
• Integration with BI tools
• Operational metrics: stats, refresh time,
trends
• Approvals and audits
• Admin alerts: Capacity issues, data
access violations, data classification
violations
• User alerts: refresh delays, profile
anomalies

Contenu connexe

Tendances

Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDataWorks Summit
 
Oracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionOracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionJeffrey T. Pollock
 
Highly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMaticHighly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMaticDataWorks Summit
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionTorsten Steinbach
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGateJeffrey T. Pollock
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowSnapLogic
 
Securing and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industrySecuring and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industryDataWorks Summit
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockJeffrey T. Pollock
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Amazon Web Services
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data IntegrationJeffrey T. Pollock
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksDatabricks
 
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data LakeFrom BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data LakeRittman Analytics
 
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Databricks
 
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...DataWorks Summit
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Jeffrey T. Pollock
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoRomit Mehta
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopDataWorks Summit
 

Tendances (20)

Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
 
Oracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionOracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer Introduction
 
Highly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMaticHighly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMatic
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To Know
 
Securing and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industrySecuring and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industry
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for Databricks
 
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data LakeFrom BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
 
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
 
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on Hadoop
 

Similaire à QCon 2018 | Gimel | PayPal's Analytic Platform

Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Romit Mehta
 
PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018Romit Mehta
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?SnapLogic
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsGreg Makowski
 
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with FargateDEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with FargateAmazon Web Services
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelDeepak Chandramouli
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...TigerGraph
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)Amazon Web Services
 
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)Ontico
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonSynerzip
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
 
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven EnterprisePivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven EnterpriseVMware Tanzu
 
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data AnalyticsMotadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data Analyticsnovsela
 
Big Data LDN 2018: LESSONS LEARNED FROM DEPLOYING REAL-WORLD AI SYSTEMS
Big Data LDN 2018: LESSONS LEARNED FROM DEPLOYING REAL-WORLD AI SYSTEMSBig Data LDN 2018: LESSONS LEARNED FROM DEPLOYING REAL-WORLD AI SYSTEMS
Big Data LDN 2018: LESSONS LEARNED FROM DEPLOYING REAL-WORLD AI SYSTEMSMatt Stubbs
 
Saama Presents Is your Big Data Solution Ready for Streaming
Saama Presents Is your Big Data Solution Ready for StreamingSaama Presents Is your Big Data Solution Ready for Streaming
Saama Presents Is your Big Data Solution Ready for StreamingSaama
 
Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances Amazon Web Services
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentCA | Automic Software
 
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...Amazon Web Services
 

Similaire à QCon 2018 | Gimel | PayPal's Analytic Platform (20)

Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018
 
PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018PayPal Notebooks at Jupytercon 2018
PayPal Notebooks at Jupytercon 2018
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
 
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with FargateDEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with Fargate
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
 
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
 
Top 5 Lessons Learned in Deploying AI in the Real World
Top 5 Lessons Learned in Deploying AI in the Real WorldTop 5 Lessons Learned in Deploying AI in the Real World
Top 5 Lessons Learned in Deploying AI in the Real World
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven EnterprisePivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
 
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data AnalyticsMotadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
 
Big Data LDN 2018: LESSONS LEARNED FROM DEPLOYING REAL-WORLD AI SYSTEMS
Big Data LDN 2018: LESSONS LEARNED FROM DEPLOYING REAL-WORLD AI SYSTEMSBig Data LDN 2018: LESSONS LEARNED FROM DEPLOYING REAL-WORLD AI SYSTEMS
Big Data LDN 2018: LESSONS LEARNED FROM DEPLOYING REAL-WORLD AI SYSTEMS
 
Saama Presents Is your Big Data Solution Ready for Streaming
Saama Presents Is your Big Data Solution Ready for StreamingSaama Presents Is your Big Data Solution Ready for Streaming
Saama Presents Is your Big Data Solution Ready for Streaming
 
Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop Agent
 
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...
How Trek10 Uses Datadog's Distributed Tracing to Improve AWS Lambda Projects ...
 

Dernier

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 

Dernier (20)

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 

QCon 2018 | Gimel | PayPal's Analytic Platform

  • 2. Agenda ©2018 PayPal Inc. Confidential and proprietary. 2 • Introduction • PayPal & Big Data Space • Analytics Platform & Gimel • Why Gimel • Challenges in Analytics • Walk through simple use case • Gimel – Implementation Details • Gimel – Open Source & future • Q & A
  • 3. About Us • Product manager, data processing products at PayPal • 20 years in data and analytics across networking, semi-conductors, telecom, security and fintech industries • Data warehouse developer, BI program manager, Data product manager romehta@paypal.com https://www.linkedin.com/in/romit-mehta/ ©2018 PayPal Inc. Confidential and proprietary. 3 Romit Mehta • Big data platform engineer at PayPal • 13 years in data engineering, 5 years in scalable solutions with big data • Developed several Spark-based solutions across NoSQL, Key-Value, Messaging, Document based & relational systems dmohanakumarchan@paypal.com https://www.linkedin.com/in/deepakmc/ Deepak Mohanakumar Chandramouli
  • 4. PayPal – Key Metrics 4©2018 PayPal Inc. Confidential and proprietary.
  • 5. PayPal Customers, Transactions and Growth 5 From: https://www.paypal.com/us/webapps/mpp/about
  • 6. PayPal Big Data Platform 6 13 prod clusters, 12 non- prod clusters GPU co-located with Hadoop 150+ PB Data 40,000+ YARN jobs/day One of the largest Aerospike, Teradata, Hortonworks and Oracle installations Compute supported: MR, Pig, Hive, Spark, Beam
  • 7. PayPal Analytics Ecosystem and Gimel Platform (Unified Data Processing Platform) 7©2018 PayPal Inc. Confidential and proprietary.
  • 8. 8 Developer Data scientist Analyst Operator Gimel SDK Notebooks PCatalog Data API Infrastructure services leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation Logging Monitoring Alerting Security Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools
  • 10. Use case - Flights Cancelled
  • 11. 11 Kafka Teradata External HDFS / Hive Data Prep / Availability ProcessStream Ingest LoadExtract/Load Parquet/ORC/Text? Productionalize, Logging, Monitoring, Alerting, Auditing, Data Quality Data SourcesData Points Flights Events Airports Airlines Carrier Geography & Geo Tags Analysis Publish Use case challenges … ©2018 PayPal Inc. Confidential and proprietary. Real-time/ processed data
  • 12. ©2018 PayPal Inc. Confidential and proprietary. 12 Spark Read From Hbase Data Access Code is Cumbersome and Fragile
  • 13. ©2018 PayPal Inc. Confidential and proprietary. 13 Spark Read From Hbase Spark Read From Elastic Search Spark Read From AeroSpike Spark Read From Druid Data Access Code is Cumbersome and Fragile
  • 14. ©2018 PayPal Inc. Confidential and proprietary. 14 Datasets Challenges Data access tied to compute and data store versions Hard to find available data sets Storage-specific dataset creation results in duplication and increased latency No audit trail for dataset access No standards for on-boarding data sets for others to discover No statistics on data set usage and access trends Datasets
  • 15. ©2018 PayPal Inc. Confidential and proprietary. 15 High-friction Data Application Lifecycle Learn Code Optimize Build Deploy RunOnboarding Big Data Apps Learn Code Optimize Build Deploy RunCompute Engine Changed Learn Code Optimize Build Deploy RunCompute Version Upgraded Learn Code Optimize Build Deploy RunStorage API Changed Learn Code Optimize Build Deploy RunStorage Connector Upgraded Learn Code Optimize Build Deploy RunStorage Hosts Migrated Learn Code Optimize Build Deploy RunStorage Changed Learn Code Optimize Build Deploy Run*********************
  • 17. Gimel | Flights Cancelled Search PCatalog 17
  • 18. Sign in PCatalog portal and search for your datasets
  • 20. Gimel DataSet | Overview
  • 21. Gimel DataSet | Schema Spec Relational DataSet Kafka DataSet
  • 22. Gimel DataSet | System Spec Relational DataSet Kafka DataSet
  • 23. Gimel DataSet | Object Spec Kafka DataSet Relational DataSet
  • 24. Gimel DataSet | Availability
  • 26. Gimel | Flights Cancelled Analyze & Productionalize App 26
  • 27. Access datasets: Navigate to Jupyter notebooks & analyze data
  • 30. Gimel | Flights App | Summary 30
  • 31. 31 API, PCatalog, Tools With Gimel & Notebooks ©2018 PayPal Inc. Confidential and proprietary. Kafka Teradata External HDFS/ Hive Data Prep / Availability ProcessIngest LoadExtract/Load Parquet/ORC/Text? Productionalize, Logging, Monitoring, Alerting, Auditing, Data QC Data SourcesData Points Flights Events Airports Airlines Carrier Geography & Geo Tags Analysis Publish Use case challenges - Simplified with Gimel
  • 32. ©2018 PayPal Inc. Confidential and proprietary. Spark Read From Hbase Spark Read From Elastic Search Spark Read From AeroSpike Spark Read From Druid With Data API ✔ Data Access Simplified with Gimel Data API 32
  • 33. ©2018 PayPal Inc. Confidential and proprietary. Spark Read From Hbase Spark Read From Elastic Search Spark Read From AeroSpike Spark Read From Druid With Data API ✔ SQL Support in Gimel Data Platform 33
  • 34. ©2018 PayPal Inc. Confidential and proprietary. 34 Data Application Lifecycle with Data API Learn Code Optimize Build Deploy RunOnboarding Big Data Apps RunCompute Engine Changed Compute Version Upgraded Storage API Changed Storage Connector Upgraded Storage Hosts Migrated Storage Changed ********************* Run Run Run Run Run Run
  • 35. Gimel – Deep Dive 35
  • 36. Job LIVY GRID Job Server Batch Livy API NAS Batch In InIn Interactive Sparkling Water Interactive Interactive Metrics History Server Thrift Server In InIn Interactive Interactive Log Log Indexing Search xDiscovery Maintain Catalog Scan Discover Metadata Services PCatalog UI Explore Configure Log Indexing Search PayPal Analytics Ecosystem ©2018 PayPal Inc. Confidential and proprietary.
  • 37. ©2018 PayPal Inc. Confidential and proprietary. 37 A peek into Streaming SQL Launches … Spark Streaming App -- Streaming Window Seconds set gimel.kafka.throttle.streaming.window.seconds=10; -- Throttling set gimel.kafka.throttle.streaming.maxRatePerPartition=1500; -- ZK checkpoint root path set gimel.kafka.consumer.checkpoint.root=/checkpoints/appname; -- Checkpoint enabling flag - implicitly checkpoints after each mini-batch in streaming set gimel.kafka.reader.checkpoint.save.enabled=true; -- Jupyter Magic for streaming SQL on Notebooks | Interactive Usecases -- Livy REPL - Same magic for streaming SQL works | Streaming Usecases %%gimel-stream -- Assume Pre-Split HBASE Table as an example insert into pcatalog.HBASE_dataset select cust_id, kafka_ds.* from pcatalog.KAFKA_dataset kafka_ds; Batch SQL Launches … Spark Batch App -- Establish 10 concurrent connections per Topic-Partition set gimel.kafka.throttle.batch.parallelsPerPartition=10; -- Fetch at max - 10 M messages from each partition set gimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000; -- Jupyter Magic on Notebooks | Interactive Usecases -- Livy REPL - Same magic works | Batch Usecases %%gimel insert into pcatalog.HIVE_dataset partition(yyyy,mm,dd,hh,mi) select kafka_ds.*,gimel_load_id ,substr(commit_timestamp,1,4) as yyyy ,substr(commit_timestamp,6,2) as mm ,substr(commit_timestamp,9,2) as dd ,substr(commit_timestamp,12,2) as hh ,case when cast(substr(commit_timestamp,15,2) as INT) <= 30 then "00" else "30" end as mi from pcatalog.KAFKA_dataset kafka_ds; Following are Jupyter/Livy Magic terms • %%gimel : calls gimel.executeBatch(sql) • %%gimel-stream : calls gimel.executeStream(sql)
  • 38. gimel.dataset.factory { KafkaDataSet ElasticSearchDataSet DruidDataSet HiveDataSet AerospikeDataSet HbaseDataSet CassandraDataSet JDBCDataSet } Metadata Services dataSet.read(“dataSetName”,options) dataSet.write(dataToWrite,”dataSetName”, options) dataStream.read(“dataSetName”, options) val storageDataSet = getFromFactory(type=“Hive”) { Core Connector Implementation, example – Kafka Combination of Open Source Connector and In-house implementations Open source connector such as DataStax / SHC / ES- Spark } & Anatomy of API gimel.datastream.factory { KafkaDataStream } CatalogProvider.getDataSetProperties(“dataSetName”) val storageDataStream = getFromStreamFactory(type=“kafka”) kafkaDataSet.read(“dataSetName”,options) hiveDataSet.write(dataToWrite,”dataSetName”, options) storageDataStream.read(“dataSetName”, options) dataSet.write(”pcatalog.HIVE_dataset”,readDf , options) val dataSet : gimel.DataSet = DataSet(sparkSession) val df1 = dataSet.read(“pcatalog.KAFKA_dataset”, options); df1.createGlobalTempView(“tmp_abc123”) Val resolvedSelectSQL = selectSQL.replace(“pcatalog.KAFKA_dataset”,”tmp_abc123”) Val readDf : DataFrame = sparkSession.sql(resolvedSelectSQL); select kafka_ds.*,gimel_load_id ,substr(commit_timestamp,1,4) as yyyy ,substr(commit_timestamp,6,2) as mm ,substr(commit_timestamp,9,2) as dd ,substr(commit_timestamp,12,2) as hh from pcatalog.KAFKA_dataset kafka_ds join default.geo_lkp lkp on kafka_ds.zip = geo_lkp.zip where geo_lkp.region = ‘MIDWEST’ %%gimel insert into pcatalog.HIVE_dataset partition(yyyy,mm,dd,hh,mi) -- Establish 10 concurrent connections per Topic-Partition set gimel.kafka.throttle.batch.parallelsPerPartition=10; -- Fetch at max - 10 M messages from each partition set gimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000; ©2018 PayPal Inc. Confidential and proprietary.
  • 39. Set gimel.catalog.provider=PCATALOG CatalogProvider.getDataSetProperties(“dataSetName”) Metadata Services Set gimel.catalog.provider=USER CatalogProvider.getDataSetProperties(“dataSetName”) Set gimel.catalog.provider=HIVE CatalogProvider.getDataSetProperties(“dataSetName”) sql> set dataSetProperties={ "key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer", "auto.offset.reset":"earliest", "gimel.kafka.checkpoint.zookeeper.host":"zookeeper:2181", "gimel.storage.type":"kafka", "gimel.kafka.whitelist.topics":"kafka_topic", "datasetName":"test_table1", "value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserialize r", "value.serializer":"org.apache.kafka.common.serialization.ByteArraySerializer", "gimel.kafka.checkpoint.zookeeper.path":"/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.avro.schema.source":"CSR", "gimel.kafka.zookeeper.connection.timeout.ms":"10000", "gimel.kafka.avro.schema.source.url":"http://schema_registry:8081", "key.serializer":"org.apache.kafka.common.serialization.StringSerializer", "gimel.kafka.avro.schema.source.wrapper.key":"schema_registry_key", "gimel.kafka.bootstrap.servers":"localhost:9092" } sql> Select * from pcatalog.test_table1. spark.sql("set gimel.catalog.provider=USER"); val dataSetOptions = DataSetProperties( "KAFKA", Array(Field("payload","string",true)) , Array(), Map( "datasetName" -> "test_table1", "auto.offset.reset"-> "earliest", "gimel.kafka.bootstrap.servers"-> "localhost:9092", "gimel.kafka.avro.schema.source"-> "CSR", "gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081", "gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key", "gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181", "gimel.kafka.checkpoint.zookeeper.path"-> "/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.whitelist.topics"-> "kafka_topic", "gimel.kafka.zookeeper.connection.timeout.ms"-> "10000", "gimel.storage.type"-> "kafka", "key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer", "value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer" ) ) dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions)) CREATE EXTERNAL TABLE `pcatalog.test_table1` (payload string) LOCATION 'hdfs://tmp/' TBLPROPERTIES ( "datasetName" -> "dummy", "auto.offset.reset"-> "earliest", "gimel.kafka.bootstrap.servers"-> "localhost:9092", "gimel.kafka.avro.schema.source"-> "CSR", "gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081", "gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key", "gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181", "gimel.kafka.checkpoint.zookeeper.path"-> "/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.whitelist.topics"-> "kafka_topic", "gimel.kafka.zookeeper.connection.timeout.ms"-> "10000", "gimel.storage.type"-> "kafka", "key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer", "value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer" ); Spark-sql> Select * from pcatalog.test_table1 Scala> dataSet.read(”test_table1",Map("dataSetProperties"- >dataSetOptions)) Catalog Provider – USER | HIVE | PCATALOG | Your Own Catalog Metadata Set gimel.catalog.provider=YOUR_CATALOG CatalogProvider.getDataSetProperties(“dataSetName”) { // Implement this ! } ©2018 PayPal Inc. Confidential and proprietary.
  • 40. Spark Thrift Server org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.sc ala //result = sqlContext.sql(statement)  Original SQL Execution //Integration of Gimel in Spark result = GimelQueryProcessor.executeBatch(statement, sqlContext.sparkSession) Integration with ecosystems class SparkSqlInterpreter(conf: SparkConf) extends SparkInterpreter(conf) { private val SCALA_MAGIC = "%%[sS][cC][aA][lL][aA] (.*)".r private val PCATALOG_BATCH_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".r private val PCATALOG_STREAM_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".sS][tT][rR][eE][aA][mM] (.*)".r // ........ // ..... case PCATALOG_BATCH_MAGIC(gimelCode) => GimelQueryProcessor.executeBatch(gimelCode, sparkSession) case PCATALOG_STREAM_MAGIC(gimelCode) => GimelQueryProcessor.executeStream(gimelCode, sparkSession) case _ => // ........ // ..... com/cloudera/livy/repl/SparkSqlInterpreter.scala Livy REPL sparkmagic/sparkmagic/kernels/sparkkernel/kernel.js define(['base/js/namespace'], function(IPython){ var onload = function() { IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-sql'] = {'reg':[/^%%gimel/]};} return { onload: onload }}) Jupyter Notebooks ©2018 PayPal Inc. Confidential and proprietary.
  • 41. Systems Data Stores Supported ©2018 PayPal Inc. Confidential and proprietary. 41
  • 42. Gimel – Open Source & Future 42©2018 PayPal Inc. Confidential and proprietary.
  • 43. What’s Next • Query optimization • Open source PCatalog: • Metadata services • Discovery services • Catalog UI • Livy features committed back to open source • Python support Jupyter features committed back to open source ©2018 PayPal Inc. Confidential and proprietary. • Open source Gimel (http://try.gimel.io)
  • 44. Gimel - Open Sourced Gimel: http://gimel.io Codebase available: https://github.com/paypal/gimel Slack: https://gimel-dev.slack.com Google Groups: https://groups.google.com/d/forum/gimel-dev ©2017 PayPal Inc. Confidential and proprietary. 44
  • 46. Acknowledgements Gimel and PayPal Notebooks team: Andrew Alves Anisha Nainani Ayushi Agarwal Baskaran Gopalan Dheeraj Rampally Deepak Chandramouli Laxmikant Patil Meisam Fathi Salmi Prabhu Kasinathan Praveen Kanamarlapudi Romit Mehta Thilak Balasubramanian Weijun Qian 46
  • 47. Q&A ( 1 0:55 A M ) G i m e l C o d e la bs: h t t p:/ /tr y.gime l.i o S l a ck : h t t ps :// gime l - de v.s la ck .com G o o gle G roups: h t t p s:/ /groups .google .com/ d/for um/ gim el - dev 47
  • 48. Appendix 48©2018 PayPal Inc. Confidential and proprietary.
  • 49. References Used Images Referred : https://www.google.com/search?q=big+data+stack+images&source=lnms&tbm=isch&sa=X&ved=0ahUKEwip1Jz3voPaAhU oxFQKHV33AsgQ_AUICigB&biw=1440&bih=799 49©2018 PayPal Inc. Confidential and proprietary.
  • 50. Spark Thrift Server - Integration spark/sql/hive- thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala //result = sqlContext.sql(statement)  Original SQL Execution //Integration of Gimel in Spark result = GimelQueryProcessor.executeBatch(statement, sqlContext.sparkSession) ©2018 PayPal Inc. Confidential and proprietary.
  • 51. Livy - Integration class SparkSqlInterpreter(conf: SparkConf) extends SparkInterpreter(conf) { private val SCALA_MAGIC = "%%[sS][cC][aA][lL][aA] (.*)".r private val PCATALOG_BATCH_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".r private val PCATALOG_STREAM_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".sS][tT][rR][eE][aA][mM] (.*)".r // ........ // ..... override def execute(code: String, outputPath: String): Interpreter.ExecuteResponse = { require(sparkContext != null && sqlContext != null && sparkSession != null) code match { case SCALA_MAGIC(scalaCode) => super.execute(scalaCode, null) case PCATALOG_BATCH_MAGIC(gimelCode) => Try { GimelQueryProcessor.executeBatch(gimelCode, sparkSession) } match { case Success(x) => Interpreter.ExecuteSuccess(TEXT_PLAIN -> x) case _ => Interpreter.ExecuteError("Failed", " ") } case PCATALOG_STREAM_MAGIC(gimelCode) => Try { GimelQueryProcessor.executeStream(gimelCode, sparkSession) } match { case Success(x) => Interpreter.ExecuteSuccess(TEXT_PLAIN -> x) case _ => Interpreter.ExecuteError("Failed", " ") } case _ => // ........ // ..... /repl/src/main/scala/com/cloudera/livy/repl/SparkSqlInterpreter.s cala ©2018 PayPal Inc. Confidential and proprietary.
  • 52. PayPal Notebooks (Jupyter) - Integration def _scala_pcatalog_command(self, sql_context_variable_name): if sql_context_variable_name == u'spark': command = u'val output= {{import java.io.{{ByteArrayOutputStream, StringReader}};val outCapture = new ByteArrayOutputStream;Console.withOut(outCapture){{gimel.GimelQueryProcessor.executeBatch("""{}""",sparkSession)}}}}'.format(self.query) else: command = u'val output= {{import java.io.{{ByteArrayOutputStream, StringReader}};val outCapture = new ByteArrayOutputStream;Console.withOut(outCapture){{gimel..GimelQueryProcessor.executeBatch("""{}""",{})}}}}'.format(self.query, sql_context_variable_name) if self.samplemethod == u'sample': command = u'{}.sample(false, {})'.format(command, self.samplefraction) if self.maxrows >= 0: command = u'{}.take({})'.format(command, self.maxrows) else: command = u'{}.collect'.format(command) return Command(u'{}.foreach(println)'.format(command+';noutput')) sparkmagic/sparkmagic/livyclientlib/sqlquery.py sparkmagic/sparkmagic/kernels/sparkkernel/kernel.js define(['base/js/namespace'], function(IPython){ var onload = function() { IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-sql'] = {'reg':[/^%%sql/]}; IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-python'] = {'reg':[/^%%local/]}; IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-sql'] = {'reg':[/^%%gimel/]};} return { onload: onload } }) ©2018 PayPal Inc. Confidential and proprietary.
  • 53. Connectors | High level ©2018 PayPal Inc. Confidential and proprietary. 53 Storage Version API Implementation Kafka 0.10.2 Batch & Stream Connectors – Implementation from scratch Elastic Search 5.4.6 Connector | https://www.elastic.co/guide/en/elasticsearch/hadoop/5.4/spark.html Additional implementations added in Gimel to support daily / monthly partitioned indexes in ES Aerospike 3.1x Read | Aerospike Spark Connector(Aerospark) is used to read data directly into a DataFrame (https://github.com/sasha-polev/aerospark) Write | Aerospike Native Java Client Put API is used. For each partition of the Dataframe a client connection is established, to write data from that partition to Aerospike. HBASE 1.2 Connector | Horton Works HBASE Connector for Spark (SHC) https://github.com/hortonworks-spark/shc Cassandra 2.x Connector | DataStax Connector https://github.com/datastax/spark-cassandra-connector HIVE 1.2 Leverages spark APIs under the hood. Druid 0.82 Connector | Leverages Tranquility under the hood https://github.com/druid-io/tranquility Teradata / Relational Leverages JDBC Storage Handler Support for Batch Reads/Loads , FAST Load & FAST Exports Alluxio Leverage Cross cluster access via reads using Spark Conf : spark.yarn.access.namenodes
  • 54. Dataset Registration Process Flow ©2018 PayPal Inc. Confidential and proprietary. 54 Data Platform Onboard Fill Meta & Submit Approval Request Requestor Approver Approved Create Dataset API PCatalog Storage User/Developer Submit Job Create Dataset Meta on PCatalog Create Catalog on Storage Compute (Data API) Data Get Dataset Meta Access 1 2 3 4 5 6 1 2 3 4 Auto-Approve2 RESTAPI
  • 55. Gimel Data Catalog Features ©2018 PayPal Inc. Confidential and proprietary. 55 Dashboard and Alerts Query and BI integration Explorer Discovery • Auto-discover datasets across all data stores • View available datasets • View schema • View system and object attributes • Integration with Jupyter notebooks • Integration with BI tools • Operational metrics: stats, refresh time, trends • Approvals and audits • Admin alerts: Capacity issues, data access violations, data classification violations • User alerts: refresh delays, profile anomalies