Continuous Optimization for Distributed BigData Analysis

•

1 j'aime•1,240 vues

Kai Sasaki

Talk at HighLoad++ 2018, Moscow

Technologie

Continuous Optimization for Distributed
BigData Analysis
Kai Sasaki (Treasure Data)

Bio
Kai Sasaki
- Software Engineer at Treasure Data
- Hadoop, Presto
- Apache Hivemall
- Books
2

3
Design and Concept
https://pixabay.com/en/desktop-tidy-clean-mockup-white-2325627/

Agenda
- Who is Treasure Data
- What is distributed data analysis?
- What kind of challenges we have?
- Our approach
- Columnar Storage
- Partitioning
- Repartitioning
4

Treasure Data
• Founded in Dec, 2011
• Mountain View, CA
• DMP, CDP, IoT, Cloud
• We joined Arm Oct, 2018
6

Arm x Treasure Data
• Pelion: Device-to-Device Platform
9

10
Challenges
based on Our Experience
https://pixabay.com/en/adventure-height-climbing-mountain-1807524/

Distributed Data Analysis?
• Large Scale Data
• High Throughput
• High Availability & Reliability
• Data Consistency
11

Distributed Processing Engines
• Hadoop
• Presto
• Spark
12

Typical Architecture
• Master-Worker model
13
https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm

Distributed Plan
14
select
t1.class,
t2.features,
count(1)
from iris t1
join iris t2
on t1.class = t2.class
group by 1, 2;

Challenges
• Network Bandwidth
• Throughput
• Transactional Processing
• Data Consistency
• System Reliability
• Service Availability
15

Our Approach
• Columnar Storage
• MessagePack based columnar format
• Time Index Pushdown
• Optimization of Partitioning Layout
16

Columnar Storage
• General design for OLAP workload
• Save IO bandwidth
• Efficient compression and encoding
• e.g. Parquet, ORC
17

MessagePack
• JSON-like binary serialization format
• Faster and smaller
• 100+  
implementations
• https://msgpack.org
18

MessagePack x Columnar File
• Type embedded file format
• Schema-on-Read
• -> Saving network bandwidth and storage
space efficiently
19

Time Index Pushdown
• Read skipping by time range
• Fitting to the typical analytical use cases
• Saving network bandwidth
21

Time Index Pushdown
• Indexed by PostgreSQL
• Transactional Update
• Data Consistency
• GiST index achieves efficient multi
column index
22

Partition Size?
• The partition file size affects the
performance significantly
• 1000000 records / file
• 256MB / file
• But depends on the workload
25

Auto Optimization
• Partitioning layout should be fit to the
actual workload
• File size
• Time range
• Partitioning key
26

Repartitioning
• Small distributed partition files
• High IO overhead
• Few large partition files
• High memory pressure
TRADE OFF PROBLEM
27

Repartitioning
• Partitioning key decides the throughput
• e.g. Customer segmentation by
• User ID
• Purchase item
• Living address
28

User Defined Partitioning
• Custom partitioning schema defined by
our user side (or ourselves)
29

User Defined Partitioning
• Granularity
• Partitioning Key Selection
33

Stella Connector
• Repartitioning & UDP is designed as a
Presto connector
• Make use of Presto high scalability and
reliability for such high workload
34

Stella Connector
35
CREATE TABLE remerged WITH (max_file_size = '256MB', max_time_range='48h') AS
SELECT * FROM partition.sources
WHERE table_schema = 'tpch_s1'
AND table_name = 'lineitem' AND TD_TIME_RANGE(time, '1998-10-11', '1998-10-20')

Stella Connector
• Scalable
• Reliable
• Easy to embed it into Workflow
• Automatic Storage Optimization!
36

Recap
- Treasure Data Overview
- Architecture of Distributed Data Analysis
- Challenges
- Our Approach
- Columnar Storage
- Partitioning
- Repartitioning
37

Contenu connexe

Tendances

Modern data warehouse

Rakesh Jayaram

The role of databases in modern application development

MariaDB plc

GridGain Systems Lead Architect Valentin (Val) Kulichenko presented the following talk at the May 17 Bay Area In-Memory Computing Meetup: Improving Apache Spark™ In-Memory Computing with Apache Ignite™ Val explained how Apache Ignite™ simplifies development and improves performance for Apache Spark™. He'll demonstrate how Apache Spark and Ignite are integrated, and how they are used to together for analytics, stream processing and machine learning. The following was covered: * How Apache Ignite’s native RDD and new native DataFrame APIs work * How to use Ignite as an in-memory database and massively parallel processing (MPP) style collocated processing for preparing and managing data for Spark * How to leverage Ignite to easily share state across Spark jobs using mutable RDDs and DataFrames * How to leverage Ignite distributed SQL and advanced indexing in memory to improve SQL performance

Improving Apache Spark™ In-Memory Computing with Apache Ignite™

Tom Diederich

Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement. Narasimhan and Avinash highlight the architecture, lessons learned, and the challenges that were overcome on both the business and technology fronts. The analytics platform is designed as a framework to enable self-service data intake, data processing, and report/model generation by the business users. The data-driven framework consists of a distributed hybrid-cloud data ingestor for data intake and a Cloudera CDH cluster with Spark as the distributed compute engine. The solution is built in such a way that storage and compute have been decoupled and encourages the concept of BYOC (bring your own compute). The platform uses EC2 instances to run CDH and leverages Amazon S3 as a data warehouse storage layer (data lake), Spark as an ETL engine, and Spark SQL as a distributed query engine. Results (computations/derived tables) are exposed to the end users via Spark SQL and are discovered via Tableau. The platform supports both batch and streaming use cases and is built on the following technology stack: AWS (S3, EC2, SQS, SNS), Cloudera CDH (YARN, Navigator, Sentry), Spark, Kafka, Spark SQL, and Spark Streaming.

Strata+Hadoop World NY 2016 - Avinash Ramineni

Avinash Ramineni

BlueData makes on-premises Spark infrastructure easy. With BlueData, you can spin up virtual Spark clusters within minutes – providing secure, on-demand access to Big Data analytics and infrastructure. You can use Spark with or without the Hadoop ecosystem of tools – using HDFS, Tachyon, or any shared storage system. You can also build analytical pipelines and create Spark clusters using our RESTful APIs. BlueData’s software platform leverages virtualization and patent-pending innovations to make it simpler, faster, and more cost-effective to deploy Hadoop or Spark infrastructure on-premises. Learn more at http://www.bluedata.com

Spark Infrastructure Made Easy

BlueData, Inc.

Architecting a datalake

Laurent Leturgez

Big Data on Cloud Native Platform

Sunil Govindan

Azure document db/Cosmos DB

Mohit Chhabra

Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...

Databricks

As a data integration professional, it’s almost a guarantee that you’ve heard of real-time stream processing of Big Data. The usual players in the open source world are Apache Kafka, used to move data in real-time, and Spark Streaming, built for in-flight transformations. But what about relational data? Quite often we forget that products incubated in the Apache Foundation can also serve a purpose for “standard” relational databases as well. But how? Well, let’s introduce Oracle GoldenGate and Oracle Data Integrator for Big Data. GoldenGate can extract relational data in real time and produce Kafka messages, ensuring relational data is a part of the enterprise data bus. These messages can then be ingested via ODI through a Spark Streaming process, integrating with additional data sources, such as other relational tables, flat files, etc, as needed. Finally, the output can be sent to multiple locations: on through to a data warehouse for analytical reporting, back to Kafka for additional targets to consume, or any number of targets. Attendees will walk away with a framework on which they can build their data streaming projects, combining relational data with big data and using a common, structured approach via the Oracle Data Integration product stack. Presented at BIWA Summit 2017.

Streaming with Oracle Data Integration

Michael Rainey

Bootstrap SaaS startup using Open Source Tools

botsplash.com

Unified Data Access with Gimel

Alluxio, Inc.

Unleash the power of Azure Data Factory

Sergio Zenatti Filho

Data driven organizations can be challenged to deliver new and growing business intelligence requirements from existing data warehouse platforms, constrained by lack of scalability and performance. The solution for customers is a data warehouse that scales for real-time demands and uses resources in a more optimized and cost-effective manner. Join Snowflake, AWS and Ask.com to learn how Ask.com enhanced BI service levels and decreased expenses while meeting demand to collect, store and analyze over a terabyte of data per day. Snowflake Computing delivers a fast and flexible elastic data warehouse solution that reduces complexity and overhead, built on top of the elasticity, flexibility, and resiliency of AWS. Join us to learn: • Learn how Ask.com eliminates data redundancy, and simplifies and accelerates data load, unload, and administration • Learn how to support new and fluid data consumption patterns with consistently high performance • Best practices for scaling high data volume on Amazon EC2 and Amazon S3 Who should attend: CIOs, CTOs, CDOs, Directors of IT, IT Administrators, IT Architects, Data Warehouse Developers, Database Administrators, Business Analysts and Data Architects

Snowflake Best Practices for Elastic Data Warehousing

Amazon Web Services

Best Practices: Hadoop migration to Azure HDInsight

Revin Chalil

The new big data

Adam Doyle

Snowflake Automated Deployments / CI/CD Pipelines

Drew Hansen

R in Power BI

Eric Bragas

Modern ETL: Azure Data Factory, Data Lake, and SQL Database

Eric Bragas

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Adam Doyle

Tendances (20)

Modern data warehouse

The role of databases in modern application development

Improving Apache Spark™ In-Memory Computing with Apache Ignite™

Strata+Hadoop World NY 2016 - Avinash Ramineni

Spark Infrastructure Made Easy

Architecting a datalake

Big Data on Cloud Native Platform

Azure document db/Cosmos DB

Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...

Streaming with Oracle Data Integration

Bootstrap SaaS startup using Open Source Tools

Unified Data Access with Gimel

Unleash the power of Azure Data Factory

Snowflake Best Practices for Elastic Data Warehousing

Best Practices: Hadoop migration to Azure HDInsight

The new big data

Snowflake Automated Deployments / CI/CD Pipelines

R in Power BI

Modern ETL: Azure Data Factory, Data Lake, and SQL Database

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Similaire à Continuous Optimization for Distributed BigData Analysis

With the boom in data; the volume and its complexity, the trend is to move data to the cloud. Where and How do we do this? Azure gives you the answer. In this session, I will give you an introduction to Azure Data Lake and Azure Data Factory, and why they are good for the type of problem we are talking about. You will learn how large datasets can be stored on the cloud, and how you could transport your data to this store. The session will briefly cover Azure Data Lake as the modern warehouse for data on the cloud,

Move your on prem data to a lake in a Lake in Cloud

CAMMS

Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020. Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms. Data lakes will be built in cloud object storage. We’ll discuss the options there as well. Get this data point for your data lake journey.

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

DATAVERSITY

Colorado Springs Open Source Hadoop/MySQL

David Smelker

Intro to Big Data

Zohar Elkayam

An overview of modern scalable web development

Tung Nguyen

سکوهای ابری و مدل های برنامه نویسی در ابر

datastack

This is from the talk I gave at the 30th Anniversary NoCOUG meeting in San Jose, CA. We all know that data warehouses and best practices for them are changing dramatically today. As organizations build new data warehouses and modernize established ones, they are turning to Data Warehousing as a Service (DWaaS) in hopes of taking advantage of the performance, concurrency, simplicity, and lower cost of a SaaS solution or simply to reduce their data center footprint (and the maintenance that goes with that). But what is a DWaaS really? How is it different from traditional on-premises data warehousing? In this talk I will: • Demystify DWaaS by defining it and its goals • Discuss the real-world benefits of DWaaS • Discuss some of the coolest features in a DWaaS solution as exemplified by the Snowflake Elastic Data Warehouse.

Demystifying Data Warehouse as a Service (DWaaS)

Kent Graziano

Hadoop

Mallikarjuna G D

Big data talk barcelona - jsr - jc

James Saint-Rossy

Session from BGOUG I presented in June, 2016 Big data is one of the biggest buzzword in today's market. Terms like Hadoop, HDFS, YARN, Sqoop, and non-structured data has been scaring DBA's since 2010 - but where does the DBA team really fit in? In this session, we will discuss everything database administrators and database developers needs to know about big data. We will demystify the Hadoop ecosystem and explore the different components. We will learn how HDFS and MapReduce are changing the data world, and where traditional databases fits into the grand scheme of things. We will also talk about why DBAs are the perfect candidates to transition into Big Data and Hadoop professionals and experts.

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem

Zohar Elkayam

Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.

How to use Big Data and Data Lake concept in business using Hadoop and Spark...

Institute of Contemporary Sciences

Oracle big data appliance and solutions

solarisyougood

The Hadoop Ecosystem for Developers

Zohar Elkayam

20160331 sa introduction to big data pipelining berlin meetup 0.3

Simon Ambridge

Hadoop Data Modeling

Adam Doyle

ADV Slides: Building and Growing Organizational Analytics with Data Lakes

DATAVERSITY

Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no. Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit. Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop. This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...

DATAVERSITY

Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric

Cambridge Semantics

Ds03 data analysis

DotNetCampus

By Doug Daniels (Director of Engineering, Data Dog) At Datadog, we collect hundreds of billions of metric data points per day from hosts, services, and customers all over the world. In addition charting and monitoring this data in real time, we also run many large-scale offline jobs to apply algorithms and compute aggregations on the data. In the past months, we’ve migrated our largest data sets over to Apache Parquet—an efficient, portable columnar storage format

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Hakka Labs

Similaire à Continuous Optimization for Distributed BigData Analysis (20)

Move your on prem data to a lake in a Lake in Cloud

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

Colorado Springs Open Source Hadoop/MySQL

Intro to Big Data

An overview of modern scalable web development

سکوهای ابری و مدل های برنامه نویسی در ابر

Demystifying Data Warehouse as a Service (DWaaS)

Hadoop

Big data talk barcelona - jsr - jc

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem

How to use Big Data and Data Lake concept in business using Hadoop and Spark...

Oracle big data appliance and solutions

The Hadoop Ecosystem for Developers

20160331 sa introduction to big data pipelining berlin meetup 0.3

Hadoop Data Modeling

ADV Slides: Building and Growing Organizational Analytics with Data Lakes

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...

Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric

Ds03 data analysis

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Plus de Kai Sasaki

Graviton 2で実現する コスト効率のよいCDP基盤

Kai Sasaki

Infrastructure for auto scaling distributed system

Kai Sasaki

Recent Changes and Challenges for Future Presto

Kai Sasaki

Real World Storage in Treasure Data

Kai Sasaki

20180522 infra autoscaling_system

Kai Sasaki

User Defined Partitioning on PlazmaDB

Kai Sasaki

Deep dive into deeplearn.js

Kai Sasaki

Optimizing Presto Connector on Cloud Storage

Kai Sasaki

Presto updates to 0.178

Kai Sasaki

How to ensure Presto scalability  in multi use case

Kai Sasaki

Managing multi tenant resource toward Hive 2.0

Kai Sasaki

Embulk makes Japan visible

Kai Sasaki

Maintainable cloud architecture_of_hadoop

Kai Sasaki

図でわかるHDFS Erasure Coding

Kai Sasaki

Spark MLlib code reading ~optimization~

How I tried MADE

Reading kernel org

Reading drill

Kernel ext4

Kernel bootstrap

Plus de Kai Sasaki (20)

Graviton 2で実現する コスト効率のよいCDP基盤

Infrastructure for auto scaling distributed system

Recent Changes and Challenges for Future Presto

Real World Storage in Treasure Data

20180522 infra autoscaling_system

User Defined Partitioning on PlazmaDB

Deep dive into deeplearn.js

Optimizing Presto Connector on Cloud Storage

Presto updates to 0.178

How to ensure Presto scalability  in multi use case

Managing multi tenant resource toward Hive 2.0

Embulk makes Japan visible

Maintainable cloud architecture_of_hadoop

図でわかるHDFS Erasure Coding

Spark MLlib code reading ~optimization~

How I tried MADE

Reading kernel org

Reading drill

Kernel ext4

Kernel bootstrap

Dernier

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Retrieval augmented generation (RAG) is the most popular style of large language model application to emerge from 2023. The most basic style of RAG works by vectorizing your data and injecting it into a vector database like Milvus for retrieval to augment the text output generated by an LLM. This is just the beginning. One of the ways that we can extend RAG, and extend AI, is through multilingual use cases. Typical RAG is done in English using embedding models that are trained in English. In this talk, we’ll explore how RAG could work in languages other than English. We’ll explore French, Chinese, and Polish.

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Zilliz

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

Tracing the root cause of a performance issue requires a lot of patience, experience, and focus. It’s so hard that we sometimes attempt to guess by trying out tentative fixes, but that usually results in frustration, messy code, and a considerable waste of time and money. This talk explains how to correctly zoom in on a performance bottleneck using three levels of profiling: distributed tracing, metrics, and method profiling. After we learn to read the JVM profiler output as a flame graph, we explore a series of bottlenecks typical for backend systems, like connection/thread pool starvation, invisible aspects, blocking code, hot CPU methods, lock contention, and Virtual Thread pinning, and we learn to trace them even if they occur in library code you are not familiar with. Attend this talk and prepare for the performance issues that will eventually hit any successful system. About authorWith two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Victor Rentea

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar. In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

💥 You’re lucky! We’ve found two different (lead) developers that are willing to share their valuable lessons learned about using UiPath Document Understanding! Based on recent implementations in appealing use cases at Partou and SPIE. Don’t expect fancy videos or slide decks, but real and practical experiences that will help you with your own implementations. 📕 Topics that will be addressed: • Training the ML-model by humans: do or don't? • Rule-based versus AI extractors • Tips for finding use cases • How to start 👨‍🏫👨‍💻 Speakers: o Dion Morskieft, RPA Product Owner @Partou o Jack Klein-Schiphorst, Automation Developer @Tacstone Technology

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

UiPathCommunity

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Vector Search -An Introduction in Oracle Database 23ai.pptx

Remote DBA Services

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Dubai, often portrayed as a shimmering oasis in the desert, faces its own set of challenges, including the occasional threat of flooding. Despite its reputation for opulence and modernity, the emirate is not immune to the forces of nature. In recent years, Dubai has experienced sporadic but significant floods, testing the resilience of its infrastructure and communities. Among the critical lifelines in this bustling metropolis is the Dubai International Airport, a bustling hub that connects the city to the world. This article explores the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Orbitshub

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Exploring Multimodal Embeddings with Milvus

Zilliz

ICT role in 21st century education and its challenges

rafiqahmad00786416

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Dernier (20)

Why Teams call analytics are critical to your entire business

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

AWS Community Day CPH - Three problems of Terraform

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Strategies for Landing an Oracle DBA Job as a Fresher

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Apidays New York 2024 - The value of a flexible API Management solution for O...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Vector Search -An Introduction in Oracle Database 23ai.pptx

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Exploring Multimodal Embeddings with Milvus

ICT role in 21st century education and its challenges

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Continuous Optimization for Distributed BigData Analysis

1. Continuous Optimization for Distributed BigData Analysis Kai Sasaki (Treasure Data)

2. Bio Kai Sasaki - Software Engineer at Treasure Data - Hadoop, Presto - Apache Hivemall - Books 2

3. 3 Design and Concept https://pixabay.com/en/desktop-tidy-clean-mockup-white-2325627/

4. Agenda - Who is Treasure Data - What is distributed data analysis? - What kind of challenges we have? - Our approach - Columnar Storage - Partitioning - Repartitioning 4

5. Treasure Data 5

6. Treasure Data • Founded in Dec, 2011 • Mountain View, CA • DMP, CDP, IoT, Cloud • We joined Arm Oct, 2018 6

7. Treasure Data • Open Source Lovers 7

8. Enterprise Data Analysis 8

9. Arm x Treasure Data • Pelion: Device-to-Device Platform 9

10. 10 Challenges based on Our Experience https://pixabay.com/en/adventure-height-climbing-mountain-1807524/

11. Distributed Data Analysis? • Large Scale Data • High Throughput • High Availability & Reliability • Data Consistency 11

12. Distributed Processing Engines • Hadoop • Presto • Spark 12

13. Typical Architecture • Master-Worker model 13 https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm

14. Distributed Plan 14 select t1.class, t2.features, count(1) from iris t1 join iris t2 on t1.class = t2.class group by 1, 2;

15. Challenges • Network Bandwidth • Throughput • Transactional Processing • Data Consistency • System Reliability • Service Availability 15

16. Our Approach • Columnar Storage • MessagePack based columnar format • Time Index Pushdown • Optimization of Partitioning Layout 16

17. Columnar Storage • General design for OLAP workload • Save IO bandwidth • Efficient compression and encoding • e.g. Parquet, ORC 17

18. MessagePack • JSON-like binary serialization format • Faster and smaller • 100+   implementations • https://msgpack.org 18

19. MessagePack x Columnar File • Type embedded file format • Schema-on-Read • -> Saving network bandwidth and storage space efficiently 19

20. MessagePack x Columnar File 20

21. Time Index Pushdown • Read skipping by time range • Fitting to the typical analytical use cases • Saving network bandwidth 21

22. Time Index Pushdown • Indexed by PostgreSQL • Transactional Update • Data Consistency • GiST index achieves efficient multi column index 22

23. Time-Range Partitioning 23

24. Time Index Pushdown 24

25. Partition Size? • The partition file size affects the performance significantly • 1000000 records / file • 256MB / file • But depends on the workload 25

26. Auto Optimization • Partitioning layout should be fit to the actual workload • File size • Time range • Partitioning key 26

27. Repartitioning • Small distributed partition files • High IO overhead • Few large partition files • High memory pressure TRADE OFF PROBLEM 27

28. Repartitioning • Partitioning key decides the throughput • e.g. Customer segmentation by • User ID • Purchase item • Living address 28

29. User Defined Partitioning • Custom partitioning schema defined by our user side (or ourselves) 29

30. User Defined Partitioning 30

31. Colocated Join 31

32. User Defined Partitioning 32

33. User Defined Partitioning • Granularity • Partitioning Key Selection 33

34. Stella Connector • Repartitioning & UDP is designed as a Presto connector • Make use of Presto high scalability and reliability for such high workload 34

35. Stella Connector 35 CREATE TABLE remerged WITH (max_file_size = '256MB', max_time_range='48h') AS SELECT * FROM partition.sources WHERE table_schema = 'tpch_s1' AND table_name = 'lineitem' AND TD_TIME_RANGE(time, '1998-10-11', '1998-10-20')

36. Stella Connector • Scalable • Reliable • Easy to embed it into Workflow • Automatic Storage Optimization! 36

37. Recap - Treasure Data Overview - Architecture of Distributed Data Analysis - Challenges - Our Approach - Columnar Storage - Partitioning - Repartitioning 37

38. Thanks! 38

Continuous Optimization for Distributed BigData Analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Continuous Optimization for Distributed BigData Analysis

Similaire à Continuous Optimization for Distributed BigData Analysis (20)

Plus de Kai Sasaki

Plus de Kai Sasaki (20)

Dernier

Dernier (20)

Continuous Optimization for Distributed BigData Analysis