SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Scalable and Incremental
Data Profiling with Spark
Amelia Arbisser
Adam Silberstein
Trifacta: Self-service data preparation
What data analysts hope to achieve in data projects
Analysis & Consumption
MACHINE LEARNING VISUALIZATION STATISTICS
Trifacta: Self-service data preparation
What data analysts hope to achieve in data projects
80% of time spent cleaning and preparing the data
to be analyzed
Analysis & Consumption
MACHINE LEARNING VISUALIZATION STATISTICS
Trifacta: Self-service data preparation
We speed up and take the pain out of preparation
Transforming Data in Trifacta
Transforming Data in Trifacta
Predictive Interaction and Immediate Feedback
Predictive Interaction
Immediate Feedback
Transforming Data in Trifacta
Transforming Data in Trifacta
Where Profiling Fits In
We validate preparation via profiling
Spark Profiling at Trifacta
➔ Profiling results of transformation at scale
➔ Validation through profiles
➔ Challenges
➔Scale
➔Automatic job generation
➔ Our solution
➔Spark profiling job server
➔JSON spec
➔Pay-as-you-go
The Case for Profiling
Transforming Data in Trifacta
Job Results
➔ Even clean raw data is not
informative.
➔Especially when it is too
large to inspect manually.
➔Need a summary representation.
➔Statistical and visual
➔ Generate profile
programmatically.
Visual Result Validation
➔ Similar to the profiling we saw above
the data grid.
➔ Applied at scale, to the full result.
➔ Reveals mismatched, missing
values.
Visual Result Validation
➔ Similar to the profiling we saw above
the data grid.
➔ Applied at scale, to the full result.
➔ Reveals mismatched, missing
values.
Visual Result Validation
Challenges
The Goal: Profiling for Every Job
➔ We’ve talked about the value of profiling, but...
➔Profiling is not a tablestakes feature, at least not on day 1.
➔Don’t want our users to disable it!
➔Profiling is potentially more expensive than transformation.
Profiling at Scale
➔ Large data volume
➔Long running times
➔Memory constraints
Performance vs. Accuracy
➔Approximation brings performance gains while still
meeting profiler summarization requirements.
➔Off-the-shelf libraries (count-min-sketch, T-digest) great
for approximating counts and non-distributive stats.
➔Not a silver bullet though...sometimes approximations are
confusing.
Flexible Job Generation
➔Profile for all users, all use cases: not
a one-off
➔Diverse schemas
➔Any number of columns
➔ Calculate statistics for all data types
➔Numeric vs. categorical
➔Container types (maps, arrays,...)
➔User-defined
Flexible Job Generation
➔Profile for all users, all use cases: not
a one-off
➔Diverse schemas
➔Any number of columns
➔ Calculate statistics for all data types
➔Numeric vs. categorical
➔Container types (maps, arrays,...)
➔User-defined
Transformation vs. Profiling
➔ Transformation is primarily row-based.
➔ Profiling is column-based.
➔ Different execution concerns.
Solution: Spark in the
Trifacta System
Why Spark
➔ Effective at OLAP
➔ Flexible enough for UDFs
➔ Not tied to distributions
➔ Mass adoption
➔ Low latency
Spark Profile Jobs Server
UI
Webapp
Batch
Server
Profiler
Service
Client
Server
Cluster
HDFS
➔ We built a Spark job server
➔ Automatically profiles output of
transform jobs
➔ Outputs profiling results to
HDFS
➔ Renders graphs in Trifacta
product Batch
ServerBatSBatS
Workers
Profiler Service ➔ Which columns to profile and
their types
➔ Uses Trifacta type system, eg:
➔Dates
➔Geography
➔User-defined
➔Requested profiling statistics
➔Histograms
➔Outliers
➔Empty, invalid, valid
➔ Extensible
➔Pairwise statistics
➔User-defined functions
Profiler
Service
JSON
Specification
Wrangled
Data
{
“input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“schema”: {
“order”: [“column1”, “column2”, “column3”],
“types”: {
“column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}],
“column2”: [...],
“column3”: [...]
}
},
“commands”: [
{ “column”: “*”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “histogram”,
“params”: {...} },
{ “column”: “column1”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “type-check”,
“params”: {...} } ]
}
Profiling DSL
{
“input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“schema”: {
“order”: [“column1”, “column2”, “column3”],
“types”: {
“column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}],
“column2”: [...],
“column3”: [...]
}
},
“commands”: [
{ “column”: “*”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “histogram”,
“params”: {...} },
{ “column”: “column1”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “type-check”,
“params”: {...} } ]
}
Profiling DSL
{
“input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“schema”: {
“order”: [“column1”, “column2”, “column3”],
“types”: {
“column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}],
“column2”: [...],
“column3”: [...]
}
},
“commands”: [
{ “column”: “*”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “histogram”,
“params”: {...} },
{ “column”: “column1”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “type-check”,
“params”: {...} } ]
}
Profiling DSL
{
“input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“schema”: {
“order”: [“column1”, “column2”, “column3”],
“types”: {
“column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}],
“column2”: [...],
“column3”: [...]
}
},
“commands”: [
{ “column”: “*”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “histogram”,
“params”: {...} },
{ “column”: “column1”,
“output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”,
“profiler-type”: “type-check”,
“params”: {...} } ]
}
Profiling DSL
Performance Improvements
➔ A few troublesome use cases:
➔High cardinality
➔Non-distributive stats
➔ Huge gains for large numbers of
columns
Few
Columns
Many Columns
High
Cardinality
2X 4X
Low
Cardinality
10X 20X
Spark profiling speed-up vs. MapReduce
Pay-as-you-go Profiling
➔ Users do not always need
profiles for all columns
➔ More complex and
expensive statistics
➔ Drilldown
➔ Iterative exploration
Conclusion
Profiling informs data preparation.
Questions?

Contenu connexe

Tendances

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Request to Fulfill Presentation (IT4IT)
Request to Fulfill Presentation (IT4IT)Request to Fulfill Presentation (IT4IT)
Request to Fulfill Presentation (IT4IT)Rob Akershoek
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata ManagementDATAVERSITY
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudDATAVERSITY
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?DATAVERSITY
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Anastasija Nikiforova
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeKent Graziano
 
SQL Server 2019 ctp2.2
SQL Server 2019 ctp2.2SQL Server 2019 ctp2.2
SQL Server 2019 ctp2.2Gianluca Hotz
 
Understanding ITIL CMDB
Understanding ITIL CMDBUnderstanding ITIL CMDB
Understanding ITIL CMDBManageEngine
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata ManagementDATAVERSITY
 
Using ITIL 4 and IT4IT together
Using ITIL 4 and IT4IT togetherUsing ITIL 4 and IT4IT together
Using ITIL 4 and IT4IT togetherRob Akershoek
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
 

Tendances (20)

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Request to Fulfill Presentation (IT4IT)
Request to Fulfill Presentation (IT4IT)Request to Fulfill Presentation (IT4IT)
Request to Fulfill Presentation (IT4IT)
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-Cloud
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
Data Insights and Analytics Webinar: CDO vs. CAO - What’s the Difference?
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...Data Quality as a prerequisite for you business success: when should I start ...
Data Quality as a prerequisite for you business success: when should I start ...
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
SQL Server 2019 ctp2.2
SQL Server 2019 ctp2.2SQL Server 2019 ctp2.2
SQL Server 2019 ctp2.2
 
Understanding ITIL CMDB
Understanding ITIL CMDBUnderstanding ITIL CMDB
Understanding ITIL CMDB
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
 
Using ITIL 4 and IT4IT together
Using ITIL 4 and IT4IT togetherUsing ITIL 4 and IT4IT together
Using ITIL 4 and IT4IT together
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
 

Similaire à Spark Scalable Incremental Data Profiling Trifacta

Building Audience Analytics Platform
Building Audience Analytics PlatformBuilding Audience Analytics Platform
Building Audience Analytics PlatformInMobi Technology
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...Noriaki Tatsumi
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jNeo4j
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming AppsWSO2
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jugGerald Muecke
 
Honey I Shrunk the Database
Honey I Shrunk the DatabaseHoney I Shrunk the Database
Honey I Shrunk the DatabaseVanessa Hurst
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...marcja
 
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...Cathrine Wilhelmsen
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital EnterpriseWSO2
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeTom Kerkhove
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
 

Similaire à Spark Scalable Incremental Data Profiling Trifacta (20)

Building Audience Analytics Platform
Building Audience Analytics PlatformBuilding Audience Analytics Platform
Building Audience Analytics Platform
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 
Honey I Shrunk the Database
Honey I Shrunk the DatabaseHoney I Shrunk the Database
Honey I Shrunk the Database
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
 
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
Visually Transform Data in Azure Data Factory or Azure Synapse Analytics (PAS...
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data LakeNDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 

Plus de Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkJen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkJen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 

Plus de Jen Aman (20)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 

Dernier

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Dernier (20)

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

Spark Scalable Incremental Data Profiling Trifacta

  • 1. Scalable and Incremental Data Profiling with Spark Amelia Arbisser Adam Silberstein
  • 2. Trifacta: Self-service data preparation What data analysts hope to achieve in data projects Analysis & Consumption MACHINE LEARNING VISUALIZATION STATISTICS
  • 3. Trifacta: Self-service data preparation What data analysts hope to achieve in data projects 80% of time spent cleaning and preparing the data to be analyzed Analysis & Consumption MACHINE LEARNING VISUALIZATION STATISTICS
  • 4. Trifacta: Self-service data preparation We speed up and take the pain out of preparation
  • 7. Predictive Interaction and Immediate Feedback
  • 12.
  • 13.
  • 14. Where Profiling Fits In We validate preparation via profiling
  • 15. Spark Profiling at Trifacta ➔ Profiling results of transformation at scale ➔ Validation through profiles ➔ Challenges ➔Scale ➔Automatic job generation ➔ Our solution ➔Spark profiling job server ➔JSON spec ➔Pay-as-you-go
  • 16. The Case for Profiling
  • 18. Job Results ➔ Even clean raw data is not informative. ➔Especially when it is too large to inspect manually. ➔Need a summary representation. ➔Statistical and visual ➔ Generate profile programmatically.
  • 19. Visual Result Validation ➔ Similar to the profiling we saw above the data grid. ➔ Applied at scale, to the full result. ➔ Reveals mismatched, missing values.
  • 20. Visual Result Validation ➔ Similar to the profiling we saw above the data grid. ➔ Applied at scale, to the full result. ➔ Reveals mismatched, missing values.
  • 23. The Goal: Profiling for Every Job ➔ We’ve talked about the value of profiling, but... ➔Profiling is not a tablestakes feature, at least not on day 1. ➔Don’t want our users to disable it! ➔Profiling is potentially more expensive than transformation.
  • 24. Profiling at Scale ➔ Large data volume ➔Long running times ➔Memory constraints
  • 25. Performance vs. Accuracy ➔Approximation brings performance gains while still meeting profiler summarization requirements. ➔Off-the-shelf libraries (count-min-sketch, T-digest) great for approximating counts and non-distributive stats. ➔Not a silver bullet though...sometimes approximations are confusing.
  • 26. Flexible Job Generation ➔Profile for all users, all use cases: not a one-off ➔Diverse schemas ➔Any number of columns ➔ Calculate statistics for all data types ➔Numeric vs. categorical ➔Container types (maps, arrays,...) ➔User-defined
  • 27. Flexible Job Generation ➔Profile for all users, all use cases: not a one-off ➔Diverse schemas ➔Any number of columns ➔ Calculate statistics for all data types ➔Numeric vs. categorical ➔Container types (maps, arrays,...) ➔User-defined
  • 28. Transformation vs. Profiling ➔ Transformation is primarily row-based. ➔ Profiling is column-based. ➔ Different execution concerns.
  • 29. Solution: Spark in the Trifacta System
  • 30. Why Spark ➔ Effective at OLAP ➔ Flexible enough for UDFs ➔ Not tied to distributions ➔ Mass adoption ➔ Low latency
  • 31. Spark Profile Jobs Server UI Webapp Batch Server Profiler Service Client Server Cluster HDFS ➔ We built a Spark job server ➔ Automatically profiles output of transform jobs ➔ Outputs profiling results to HDFS ➔ Renders graphs in Trifacta product Batch ServerBatSBatS Workers
  • 32. Profiler Service ➔ Which columns to profile and their types ➔ Uses Trifacta type system, eg: ➔Dates ➔Geography ➔User-defined ➔Requested profiling statistics ➔Histograms ➔Outliers ➔Empty, invalid, valid ➔ Extensible ➔Pairwise statistics ➔User-defined functions Profiler Service JSON Specification Wrangled Data
  • 33. { “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”: [“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL
  • 34. { “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”: [“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL
  • 35. { “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”: [“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL
  • 36. { “input”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “schema”: { “order”: [“column1”, “column2”, “column3”], “types”: { “column1”: [“Datetime”, {“regexes”: , “groupLocs”: {...}}], “column2”: [...], “column3”: [...] } }, “commands”: [ { “column”: “*”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “histogram”, “params”: {...} }, { “column”: “column1”, “output”: “hdfs://hadoop.trifacta-dev.net:8020:/trifacta…”, “profiler-type”: “type-check”, “params”: {...} } ] } Profiling DSL
  • 37. Performance Improvements ➔ A few troublesome use cases: ➔High cardinality ➔Non-distributive stats ➔ Huge gains for large numbers of columns Few Columns Many Columns High Cardinality 2X 4X Low Cardinality 10X 20X Spark profiling speed-up vs. MapReduce
  • 38. Pay-as-you-go Profiling ➔ Users do not always need profiles for all columns ➔ More complex and expensive statistics ➔ Drilldown ➔ Iterative exploration