If you have your own Columnar format, stop now and use Parquet 😛

•

12 likes•3,255 views

Lightning talk presented at HPTS 2015: http://hpts.ws/ Apache Parquet is the de facto standard columnar storage for big data. Open source and proprietary SQL engines already integrate with it as their users don’t want to load and duplicate their data in every tool. Users want an open, interoperable, efficient format to experiment with the many options they have. The format is defined by the open source community integrating feedback from many teams working on query engines (including but not limited to Impala, Drill, Hawq, SparkSQL, Presto, Hive, etc) or on infrastructure at scale (Twitter, Netflix, Stripe, Criteo, ...). Building on its initial success, the Parquet community is defining new features for the next iteration of the format. For example: improved metadata layout, type system completude or mergeable statistics used for planning.

Technology

© 2015 Dremio Corporation
If you have your own Columnar format,
stop now and use Parquet
😛
Julien Le Dem
Principal Architect, Dremio
VP Apache Parquet

© 2015 Dremio Corporation
About Dremio
Jacques Nadeau
Founder & CTO
•Apache Drill PMC Chair
•Recognized SQL & NoSQL expert
•Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer Shiran
Founder & CEO
•Apache Drill Founder
•MapR (VP Product); Microsoft; IBM
Research
•Carnegie Mellon, Technion
Julien Le Dem
Architect
•Apache Parquet Founder
•Apache Pig PMC Member
•Twitter (Lead, Analytics Data Pipeline);
Yahoo! (Architect)
Top Silicon Valley VCs• Enabling self-service data discovery, exploration and
analysis on modern data
• Founded in June 2015
• Building on open source technologies including Drill,
Parquet, Spark

© 2015 Dremio Corporation
Background of Parquet
• Twitter’s data
• Lots of data: Instrumentation, User graph, Derived data, ...
• Complex: deeply nested structures
• Analytics infrastructure:
• Several 1000s nodes Hadoop clusters
• Log collection to HDFS in Thrift
• Parquet
• Columnar: space and query efficient
• Inspired from the Google Dremel Paper
• supports complex data
• interoperable
Caillebotte: The Parquet Planers

© 2015 Dremio Corporation
Parquet timeline
• Fall 2012: Twitter & Cloudera’s Impala team
merge efforts to develop columnar formats.
• March 2013: OSS announcement; Criteo
signs on for Hive integration.
• July 2013: 1.0 release. 18 contributors from
more than 5 organizations.
• August 2013: Drill chose Parquet as its
primary storage format.
• May 2014: Apache Incubator. 40+
contributors, 18 with 1000+ LOC. 26
incremental releases.
• Apr 2015: Parquet graduates from the
Apache Incubator.

© 2015 Dremio Corporation
What does Parquet do?
Interoperability
Space efﬁciency
Query efﬁciency
@EmrgencyKittens

© 2015 Dremio Corporation
Columnar storage
Logical table
representation
Row layout
Column layout
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
On Disk:
Encodings: Dictionary, RLE, Delta, Preﬁx

© 2015 Dremio Corporation
Nested representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet

© 2015 Dremio Corporation
Statistics
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =

© 2015 Dremio Corporation
Interoperability
Library level integration Query engine integration
Avro Thrift
Protocol
Buffer
Pig Tuple Hive SerDe
Assembly/striping (model agnostic)
Parquet ﬁle format (language agnostic)
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encodings
Impala
...
...
Encodings (C)
PrestoDrill …

© 2015 Dremio Corporation
Query engines, frameworks and libraries
integrated with Parquet (non exhaustive)
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto, SparkSQL
Frameworks:
Spark, MapReduce, Cascading,
Crunch, Scalding, Kite
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs

© 2015 Dremio Corporation
Loose coupling
• Users don’t want to load
their data into every tool.
• Many tools are available
and show up every day.
• The cost of trying a new
tool should be minimal
Storage (HDFS/S3/…)
Interactive
queries
(Drill, Impala,
Presto, …)
automated
dashboard
machine
learning
Query-efﬁcient
format
Parquet
Graph
Processing
(Giraph, …)
Batch
computation
(Pig, Cascading,
Scalding, Spark,
…)

© 2015 Dremio Corporation
Get involved
Twitter:
- @ApacheParquet
Mailing list:
- dev@parquet.apache.org
Github repo:
- https://github.com/apache/parquet-mr
Parquet sync ups:
- Regular meetings on google hangout

Apache Parquet brings the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. Apache Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces. Apache Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Apache Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented. This talk highlights the internal implementation of Apache Parquet.

How Apache Arrow and Parquet boost cross-language interoperability

Uwe Korn

Strata NY 2018: The deconstructed database

Julien Le Dem

Sql on everything with drill

Julien Le Dem

HUG_Ireland_Apache_Arrow_Tomer_Shiran

John Mulhall

From flat files to deconstructed database

Julien Le Dem

Apache Arrow (Strata-Hadoop World San Jose 2016)

Wes McKinney

Parquet and AVRO

airisData

Python Data Wrangling: Preparing for the Future

Wes McKinney

ORC Deep Dive 2020

Owen O'Malley

Apache Arrow - An Overview

Dremio Corporation

HUG France - Apache Drill

MapR Technologies

Apache Arrow: In Theory, In Practice

Dremio Corporation

Apache Arrow is designed to make things faster. Its focused on speeding communication between systems as well as processing within any one system. In this talk I'll start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. I’ll then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data. Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

DataWorks Summit

The Parquet Format and Performance Optimization Opportunities

Databricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...

Uwe Korn

HDFS Tiered Storage: Mounting Object Stores in HDFS

DataWorks Summit

Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination. Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures. This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.

Big Data's Journey to ACID

Owen O'Malley

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Skillspeed

This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered. At the end, you'll have a strong knowledge regarding Hadoop Hive Basics. PPT Agenda ✓ Introduction to BIG Data & Hadoop ✓ What is Hive? ✓ Hive Data Flows ✓ Hive Programming ---------- What is Apache Hive? Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data. ---------- Hive has the following 5 Components: 1. Driver 2. Compiler 3. Shell 4. Metastore 5. Execution Engine ---------- Applications of Hive 1. Data Mining 2. Document Indexing 3. Business Intelligence 4. Predictive Modelling 5. Hypothesis Testing ---------- Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance. Email: sales@skillspeed.com Website: https://www.skillspeed.com

Data Science Languages and Industry Analytics

Wes McKinney

Parquet Strata/Hadoop World, New York 2013

Julien Le Dem

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

What's hot

Apache parquet - Apache big data North America 2017

techmaddy

How Apache Arrow and Parquet boost cross-language interoperability

Uwe Korn

Strata NY 2018: The deconstructed database

Julien Le Dem

Sql on everything with drill

Julien Le Dem

HUG_Ireland_Apache_Arrow_Tomer_Shiran

John Mulhall

From flat files to deconstructed database

Julien Le Dem

Apache Arrow (Strata-Hadoop World San Jose 2016)

Wes McKinney

Parquet and AVRO

airisData

Python Data Wrangling: Preparing for the Future

Wes McKinney

ORC Deep Dive 2020

Owen O'Malley

Apache Arrow - An Overview

Dremio Corporation

HUG France - Apache Drill

MapR Technologies

Apache Arrow: In Theory, In Practice

Dremio Corporation

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

DataWorks Summit

The Parquet Format and Performance Optimization Opportunities

Databricks

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...

Uwe Korn

HDFS Tiered Storage: Mounting Object Stores in HDFS

DataWorks Summit

Big Data's Journey to ACID

Owen O'Malley

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Skillspeed

Data Science Languages and Industry Analytics

Wes McKinney

What's hot (20)

Apache parquet - Apache big data North America 2017

How Apache Arrow and Parquet boost cross-language interoperability

Strata NY 2018: The deconstructed database

Sql on everything with drill

HUG_Ireland_Apache_Arrow_Tomer_Shiran

From flat files to deconstructed database

Apache Arrow (Strata-Hadoop World San Jose 2016)

Parquet and AVRO

Python Data Wrangling: Preparing for the Future

ORC Deep Dive 2020

Apache Arrow - An Overview

HUG France - Apache Drill

Apache Arrow: In Theory, In Practice

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

The Parquet Format and Performance Optimization Opportunities

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...

HDFS Tiered Storage: Mounting Object Stores in HDFS

Big Data's Journey to ACID

Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture

Data Science Languages and Industry Analytics

Viewers also liked

Parquet Strata/Hadoop World, New York 2013

Julien Le Dem

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

StampedeCon

At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.

객체지향적인 도메인 레이어 구축하기Young-Ho Cho

[NEXT 프연 Week2] UNIX 명령어 간단하게 살펴보기

Young-Ho Cho

애플리케이션 아키텍처와 객체지향

Young-Ho Cho

Domain Driven Design

Young-Ho Cho

도메인 주도 설계의 본질

Young-Ho Cho

What Makes Great Infographics

SlideShare

Masters of SlideShare

Kapost

STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare

Empowered Presentations

You Suck At PowerPoint!

Jesse Desjardins - @jessedee

10 Ways to Win at SlideShare SEO & Presentation Optimization

Oneupweb

How To Get More From SlideShare - Super-Simple Tips For Content Marketing

Content Marketing Institute

Poster Hadoop summit 2011: pig embedding in scripting languages

Julien Le Dem

How to Make Awesome SlideShares: Tips & Tricks

SlideShare

Embedding Pig in scripting languages

Julien Le Dem

Inside Parquet Format

Yue Chen

Processing edges on apache giraphDataWorks Summit

ORC File Introduction

Owen O'Malley

Viewers also liked (20)

Parquet Strata/Hadoop World, New York 2013

Efficient Data Storage for Analytics with Apache Parquet 2.0

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

객체지향적인 도메인 레이어 구축하기

[NEXT 프연 Week2] UNIX 명령어 간단하게 살펴보기

애플리케이션 아키텍처와 객체지향

Domain Driven Design

도메인 주도 설계의 본질

What Makes Great Infographics

Masters of SlideShare

STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare

You Suck At PowerPoint!

10 Ways to Win at SlideShare SEO & Presentation Optimization

How To Get More From SlideShare - Super-Simple Tips For Content Marketing

Poster Hadoop summit 2011: pig embedding in scripting languages

How to Make Awesome SlideShares: Tips & Tricks

Embedding Pig in scripting languages

Inside Parquet Format

Processing edges on apache giraph

ORC File Introduction

Similar to If you have your own Columnar format, stop now and use Parquet 😛

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...

Dremio Corporation

Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.

The Big Connection: Integrating Cloud with Enterprise Systems

Inside Analysis

The Briefing Room with Robin Bloor and Dell Boomi Live Webcast July 30, 2013 http://www.insideanalysis.com Integrating cloud solutions with existing infrastructures poses formidable challenges for most organizations, especially when key business applications are complex, or spread across disparate systems. Companies need to leverage and analyze their data no matter where it lives, and they need the flexibility to make changes as needed. Achieving this level of agility requires a new kind of application integration that natively harnesses cloud architectures. Register for this episode of The Briefing Room to learn from veteran Analyst Robin Bloor as he explains how the maturation of cloud services and platforms has improved application integration techniques. He'll be briefed by Wes Manning of Dell Boomi, who will tout his company’s cloud integration offering, which includes the ability to integrate hallmark solutions like SAP and Salesforce with on premise or SaaS-based applications. He will also share a demo of Boomi Suggest, a community-developed repository of data mappings that can greatly accelerate integration projects.

Generative AI on Enterprise Cloud with NiFi and Milvus

Timothy Spann

Gen AI on Enterprise Cloud Apache NiFi Milvus Apache Kafka Apache Flink Cloudera Machine Learning Cloudera DataFlow https://medium.com/@tspann/building-a-milvus-connector-for-nifi-34372cb3c7fa https://www.meetup.com/futureofdata-princeton/events/300737266/ https://lu.ma/q7pcfyjn?source=post_page-----34372cb3c7fa--------------------------------&tk=TTyakY If you're interested in working with Generative AI on the cloud, this virtual workshop is for you. Tim Spann from Cloudera and Yujian Tang from Zilliz will cover how you can implement your own GenAI workflows on the cloud at enterprise scale. 9:00 - 9:05: Intro 9:05 - 9:15: What is Milvus 9:15 - 9:25: Cloudera Development Platform 9:25 - 10:00: Demo Location https://www.youtube.com/watch?v=IfWIzKsoHnA https://github.com/tspannhw/SpeakerProfile https://www.linkedin.com/in/yujiantang/

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

DataWorks Summit/Hadoop Summit

Steve Totman Syncsort Big Data Warehousing hug 23 sept Final

Steven Totman

Getting Started with HTML5 in Tech Com (STC 2012)

Peter Lubbers

S2DS London 2015 - Hadoop Real World

Sean Roberts

Advantages of the Cloud_Q2_2017.pptx

SaboneSabone

PyData: The Next Generation

Wes McKinney

Level Up – How to Achieve Hadoop Acceleration

Inside Analysis

The Briefing Room with Robin Bloor and HP Vertica Live Webcast on August 26, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=3dd6d1b068fe395f665c75adb682ac41 Hadoop has long passed the point of being a nascent technology, but many users have found that when left to its own devices, Hadoop can be a one trick pony. To get the most out of Hadoop, organizations need a flexible platform that empowers analysts and data managers with a complete set of information lifecycle management and analytics tools without a performance tradeoff. Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he outlines Hadoop’s role in a big data architecture. He’ll be briefed by Walt Maguire of HP Vertica, who will showcase his company’s big data solutions, including HAVEn and the HP Big Data Platform. He will demonstrate how HP Vertica acts as a complement to Hadoop, and how the combination of the two provides a versatile and highly performant solution. Visit InsideAnlaysis.com for more information.

Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven

InfluxData

The transition from 40 years of successful licensed software development to an agile-based SaaS business involves many challenges. Octo, a real-time streaming metrics framework built around InfluxDB time series database, is aimed specifically at one: simplifying the collection and visualization of mission-critical operational data to enable a culture change toward metrics immersion and product ownership. Learn more by viewing this InfluxDays NYC 2019 presentation.

Modernizing i5 Applications

ZendCon

Cloud Native Applications Containers Microservices Platforms CICD Oh my

Fabio Chiodini

This is my session at Dell Technologies World 2018. here's the abstract: In this energizing session you’ll get exposed, in an informal way (through cool demos), to all the new terms “buzzing” in the IT world. We will touch new constructs, terms and architectures like: Cloud Native Applications, Microservices, Continuous Integration/Deployment, Containers, Service Discovery and much more!! We will show you with real demos how these concepts and tools work in real-life scenarios.

Enabling Data centric Teams

Data Con LA

Data Con LA 2020 Description Coming from a grand belief of data democratization, I believe that in order for any team to be successful collaborators, it has to be data centric and data should be accessible to all. *To ensure that your non software or software engineering centric team has maximum efficiency, data should be visible, data lake should be accessible. *Form a database for analytics summaries, talk about the different technologies(SQL, NoSQL) cost of deployment, need, team driven structure. Build an API for this database for external/inter team crosstalk. *Build analytics and visual layer on top of it. Flask/Django/Node, etc.., to enable the team to have high visibility in their analysis, and to ensure a higher turnaround of data. *Talk about an easy way of enabling the team to run code, could be local/cloud, JupyterHub is a great way of doing so, talk about the tremendous value added in that and the potential it enables *Talk about the common tools user for version control/CICD/Coding technologies, etc.. *Finally summarize the value of the mixture of all these tools and technologies in order to ensure the maximum efficiency. Speaker Nawar Khabbaz, Rivian, Data Engineer

Demystifying Data Warehousing as a Service (GLOC 2019)

Kent Graziano

Senior C++ engineer

Nataliya Zhuk

BPM and SOA Are Going Mobile: An Architectural PerspectiveGuido Schmutz

Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho

Mastering Docker and Docker Swarm

Ankit Yadav

TetraNoodle is building one of the most comprehensive course on Docker and Docker Swarm from scratch. Docker is an open source containerization system that works similar to a virtual system but without the hassles of having to build an entire virtual system from scratch. Docker Swarm, on the other hand, is a tool that is helpful after you have already built your containers. Back us now on Kickstarter here - http://rite.ly/KtRr

Run Your First Hadoop 2.x Program

Skillspeed

Similar to If you have your own Columnar format, stop now and use Parquet 😛 (20)

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...

The Big Connection: Integrating Cloud with Enterprise Systems

Generative AI on Enterprise Cloud with NiFi and Milvus

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

Steve Totman Syncsort Big Data Warehousing hug 23 sept Final

Getting Started with HTML5 in Tech Com (STC 2012)

S2DS London 2015 - Hadoop Real World

Advantages of the Cloud_Q2_2017.pptx

PyData: The Next Generation

Level Up – How to Achieve Hadoop Acceleration

Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven

Modernizing i5 Applications

Cloud Native Applications Containers Microservices Platforms CICD Oh my

Enabling Data centric Teams

Demystifying Data Warehousing as a Service (GLOC 2019)

Senior C++ engineer

BPM and SOA Are Going Mobile: An Architectural Perspective

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Mastering Docker and Docker Swarm

Run Your First Hadoop 2.x Program

More from Julien Le Dem

Data and AI summit: data pipelines observability with open lineage

Julien Le Dem

Data pipelines observability: OpenLineage & Marquez

Julien Le Dem

Open core summit: Observability for data pipelines with OpenLineage

Julien Le Dem

Data platform architecture principles - ieee infrastructure 2020

Julien Le Dem

Data lineage and observability with Marquez - subsurface 2020

Julien Le Dem

How to use Parquet as a basis for ETL and analytics

Julien Le Dem

Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014

Julien Le Dem

Parquet Hadoop Summit 2013

Julien Le Dem

Parquet Twitter Seattle open house

Julien Le Dem

Parquet overview

Julien Le Dem

More from Julien Le Dem (10)

Data and AI summit: data pipelines observability with open lineage

Data pipelines observability: OpenLineage & Marquez

Open core summit: Observability for data pipelines with OpenLineage

Data platform architecture principles - ieee infrastructure 2020

Data lineage and observability with Marquez - subsurface 2020

How to use Parquet as a basis for ETL and analytics

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014

Parquet Hadoop Summit 2013

Parquet Twitter Seattle open house

Parquet overview

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

КАТЕРИНА АБЗЯТОВА «Ефективне планування тестування ключові аспекти та практ...

QADay

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

UiPathCommunity

In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni. 📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath: Autopilot per Studio Web Autopilot per Studio Autopilot per Apps Clipboard AI GenAI applicata alla Document Understanding 👨‍🏫👨‍💻 Speakers: Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath Andrei Tasca, RPA Solutions Team Lead @NTT Data

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Search and Society: Reimagining Information Access for Radical Futures

Bhaskar Mitra

The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

Peter Udo Diehl

I'm excited to share my latest predictions on how AI, robotics, and other technological advancements will reshape industries in the coming years. The slides explore the exponential growth of computational power, the future of AI and robotics, and their profound impact on various sectors. Why this matters: The success of new products and investments hinges on precise timing and foresight into emerging categories. This deck equips founders, VCs, and industry leaders with insights to align future products with upcoming tech developments. These insights enhance the ability to forecast industry trends, improve market timing, and predict competitor actions. Highlights: ▪ Exponential Growth in Compute: How $1000 will soon buy the computational power of a human brain ▪ Scaling of AI Models: The journey towards beyond human-scale models and intelligent edge computing ▪ Transformative Technologies: From advanced robotics and brain interfaces to automated healthcare and beyond ▪ Future of Work: How automation will redefine jobs and economic structures by 2040 With so many predictions presented here, some will inevitably be wrong or mistimed, especially with potential external disruptions. For instance, a conflict in Taiwan could severely impact global semiconductor production, affecting compute costs and related advancements. Nonetheless, these slides are intended to guide intuition on future technological trends.

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

КАТЕРИНА АБЗЯТОВА «Ефективне планування тестування ключові аспекти та практ...

The Art of the Pitch: WordPress Relationships and Sales

FIDO Alliance Osaka Seminar: Overview.pdf

Neuro-symbolic is not enough, we need neuro-*semantic*

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

UiPath Test Automation using UiPath Test Suite series, part 3

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Essentials of Automations: Optimizing FME Workflows with Parameters

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Search and Society: Reimagining Information Access for Radical Futures

JMeter webinar - integration with InfluxDB and Grafana

"Impact of front-end architecture on development cost", Viktor Turskyi

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

How world-class product teams are winning in the AI era by CEO and Founder, P...

If you have your own Columnar format, stop now and use Parquet 😛

2. © 2015 Dremio Corporation About Dremio Jacques Nadeau Founder & CTO •Apache Drill PMC Chair •Recognized SQL & NoSQL expert •Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO •Apache Drill Founder •MapR (VP Product); Microsoft; IBM Research •Carnegie Mellon, Technion Julien Le Dem Architect •Apache Parquet Founder •Apache Pig PMC Member •Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Enabling self-service data discovery, exploration and analysis on modern data • Founded in June 2015 • Building on open source technologies including Drill, Parquet, Spark

3. © 2015 Dremio Corporation Background of Parquet • Twitter’s data • Lots of data: Instrumentation, User graph, Derived data, ... • Complex: deeply nested structures • Analytics infrastructure: • Several 1000s nodes Hadoop clusters • Log collection to HDFS in Thrift • Parquet • Columnar: space and query efficient • Inspired from the Google Dremel Paper • supports complex data • interoperable Caillebotte: The Parquet Planers

4. © 2015 Dremio Corporation Parquet timeline • Fall 2012: Twitter & Cloudera’s Impala team merge efforts to develop columnar formats. • March 2013: OSS announcement; Criteo signs on for Hive integration. • July 2013: 1.0 release. 18 contributors from more than 5 organizations. • August 2013: Drill chose Parquet as its primary storage format. • May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. • Apr 2015: Parquet graduates from the Apache Incubator.

6. © 2015 Dremio Corporation Columnar storage Logical table representation Row layout Column layout Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk On Disk: Encodings: Dictionary, RLE, Delta, Preﬁx

7. © 2015 Dremio Corporation Nested representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet

8. © 2015 Dremio Corporation Statistics Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =

9. © 2015 Dremio Corporation Interoperability Library level integration Query engine integration Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping (model agnostic) Parquet ﬁle format (language agnostic) Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encodings Impala ... ... Encodings (C) PrestoDrill …

10. © 2015 Dremio Corporation Query engines, frameworks and libraries integrated with Parquet (non exhaustive) Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto, SparkSQL Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite Data Models: Avro, Thrift, ProtocolBuffers, POJOs

11. © 2015 Dremio Corporation Loose coupling • Users don’t want to load their data into every tool. • Many tools are available and show up every day. • The cost of trying a new tool should be minimal Storage (HDFS/S3/…) Interactive queries (Drill, Impala, Presto, …) automated dashboard machine learning Query-efﬁcient format Parquet Graph Processing (Giraph, …) Batch computation (Pig, Cascading, Scalding, Spark, …)

12. © 2015 Dremio Corporation Get involved Twitter: - @ApacheParquet Mailing list: - dev@parquet.apache.org Github repo: - https://github.com/apache/parquet-mr Parquet sync ups: - Regular meetings on google hangout

If you have your own Columnar format, stop now and use Parquet 😛

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to If you have your own Columnar format, stop now and use Parquet 😛

Similar to If you have your own Columnar format, stop now and use Parquet 😛 (20)

More from Julien Le Dem

More from Julien Le Dem (10)

Recently uploaded

Recently uploaded (20)

If you have your own Columnar format, stop now and use Parquet 😛