BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

•Télécharger en tant que PPTX, PDF•

0 j'aime•3,239 vues

High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.

Technologie

EndoMine System
Jewish General Hospital

by David Lauzon
and Anton Zakharov
Big Data Montreal #9
February 5th 2013 1 / 18

Presentation

• Our Objectives
• Requirements and context
• Project scope
• Hadoop Solution
– Big Data Solution Overview
– Hive Table Schema
– Compression Performance
– Data Architecture in Hadoop
– Hadoop/Impala Prototype Demo
• Oracle Solution
• Hadoop vs Oracle comparison
• What are expensive queries?

2 / 18

Our Objectives

• Lead an end-of-study project in an
industrial context
– Requirements elicitation
– Implement a « proof-of-concept » prototype

• Experiment with big data technologies
– Compare with RDBMS

3 / 18

Requirements and context

• Department of Medical Diagnostic
(medical test results DB, e.g. blood, urine, ...)
– Dr. Shaun Eintracht
• « ad hoc » Query
• ETL Query
– Dr. Elizabeth Mac Namara
• « business intelligence » requirements
• Realtime Dashboard

• Department of Endocrinology
– Dr. Mark Trifiro
• Data mining

4 / 18

Project scope

• First iteration = improve ad-hoc queries
– Slow analytical queries and ETL (MS Access)
– Risk of « crashing » production DB
– Some queries impossible to process

5 / 18

Solutions

• Solution 1 : Hadoop + Impala

• Solution 2 : Tune the existing Oracle RDBMS

7 / 18

Compression Performance

250

200

150
Impala
100 Hive
Oracle
50

0
Oracle FS Text File Sequence SeqFile + SeqFile +
File Gzip Snappy

10 / 18

Data Architecture in Hadoop

• All big tables are pre-joined
– With specimen (1)
– Without specimen (2)
• Partitioned using two schemes
– Year-month (3)
– Year and Test (4)
• 4 different versions of the same data:
– stay_order_results_yearmonth
– stay_order_results_year_and_test
– stay_order_results_specimen_yearmonth
– stay_order_results_specimen_year_and_test

11 / 18

Oracle Solution

• Same tables as source DB
– A big pre-joined table is not a good solution
• Techniques explored :
– Partitioning
• Partitions automatically created
– Compression
• Inefficient for joins
– Clustering
– Join multiple partitioned tables

13 / 18

Oracle Solution (continued)

• Avoid too many indexes on the big tables:
– Takes a lot of memory
– Slow to create
– May not be used if query use more than 5% of the
rows

14 / 18

Comparison: Hadoop Solution

• Pro
– Crunch massive amount of data
– Scalability
– Free software
• Cons
– Needs better UI and tune-ups
– Maintenance cost
– Require ETL time to merge data into one table
– BIG Joins should be avoided

15 / 18

Comparison: Oracle Solution

• Pro
– Just need to create a slave DB (just?)
– Faster random-lookup
– Easier to find expertise
• Cons
– Scalability up to a certain point..
– Synchronisation with master DB:
• Rebuilding indexes would take hours

16 / 18

What are expensive queries?

• If possible, avoid these constructs on
large result sets
– SELECT DISTINCT
– ORDER BY
– GROUP BY
– JOIN big table with another big table
• JOIN big table with multiple small tables should be OK

17 / 18

Conclusion

• Recommendation to use a “classic” RDBMS
– The database fit on a single-node
– Existing expertise in-house
– Acceptable performance with appropriate
tune-ups
– Stop using MS Access
• Disadvantage : limited scalability

18 / 18

Contenu connexe

Tendances

Column Stores and Google BigQuery

Csaba Toth

From Raw Data to Analytics with No ETL

Cloudera, Inc.

Hadoop Architecture Options for Existing Enterprise DataWarehouse

Asis Mohanty

Big Data and Hadoop Ecosystem

Rajkumar Singh

Big Data in the Real World

Mark Kromer

SQL, NoSQL, BigData in Data Architecture

Venu Anuganti

ETL Practices for Better or Worse

Eric Sun

Hadoop and IDW - When_to_use_which

Dan TheMan

Optiq: A dynamic data management framework

Julian Hyde

NoSQL Needs SomeSQL

DataWorks Summit

Deep learning has become widespread as frameworks such as TensorFlow and PyTorch have made it easy to onboard machine learning applications. However, while it is easy to start developing with these frameworks on your local developer machine, scaling up a model to run on a cluster and train on huge datasets is still challenging. Code and dependencies have to be copied to every machine and defining the cluster configurations is tedious and error-prone. In addition, troubleshooting errors and aggregating logs is difficult. Ad-hoc solutions also lack resource guarantees, isolation from other jobs, and fault tolerance. To solve these problems and make scaling deep learning easy, we have made several enhancements to Hadoop and built an open-source deep learning platform called TonY. In this talk, Anthony and Keqiu will discuss new Hadoop features useful for deep learning, such as GPU resource support, and deep dive into TonY, which lets you run deep learning programs natively on Hadoop. We will discuss TonY's architecture and how it allows users to manage their deep learning jobs, acting as a portal from which to launch notebooks, monitor jobs, and visualize training results.

Scaling Deep Learning on Hadoop at LinkedIn

DataWorks Summit

Apache HBase™

Prashant Gupta

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Dan Lynn

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...

Alluxio, Inc.

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Douglas Moore

From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Dremio Corporation

Big data vahidamiri-tabriz-13960226-datastack.ir

datastack

Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

DataWorks Summit

Introduction To Hadoop Ecosystem

InSemble

Devon Energy is a Fortune 500 company focused on unconventional upstream oil and gas production. With a companywide focus on innovation and data-driven decision making, IT has been challenged to make more data available to more people more quickly. To this end, we have leveraged the scale of Microsoft Azure and Databricks’ Unified Analytics Platform to help reimagine our integration, data warehousing and analytics landscape to improve agility while moving our workloads to the cloud. We are in the third year of this transformation and have lessons learned around improving the testability of data pipelines, code management, model training and deployment, promotion, and user empowerment. In this talk, we will share our experience managing the lifecycle of data engineering and machine learning solutions and striking the balance between agility and reliability in a single platform, while democratizing data access to users from all disciplines across the company. Author: Paul Bruffett

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...

Databricks

Tendances (20)

Column Stores and Google BigQuery

From Raw Data to Analytics with No ETL

Hadoop Architecture Options for Existing Enterprise DataWarehouse

Big Data and Hadoop Ecosystem

Big Data in the Real World

SQL, NoSQL, BigData in Data Architecture

ETL Practices for Better or Worse

Hadoop and IDW - When_to_use_which

Optiq: A dynamic data management framework

NoSQL Needs SomeSQL

Scaling Deep Learning on Hadoop at LinkedIn

Apache HBase™

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Big data vahidamiri-tabriz-13960226-datastack.ir

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Introduction To Hadoop Ecosystem

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...

En vedette

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...

Mark Rittman

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...

Mark Rittman

Oracle big data appliance and solutions

solarisyougood

Extending Hortonworks with Oracle's Big Data Platform

DataWorks Summit/Hadoop Summit

A7 storytelling with_oracle_analytics_cloud

Dr. Wilfred Lin (Ph.D.)

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Rittman Analytics

En vedette (6)

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...

Oracle big data appliance and solutions

Extending Hortonworks with Oracle's Big Data Platform

A7 storytelling with_oracle_analytics_cloud

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Similaire à BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...

Hadoop Data Modeling

Intro to Big Data

Big data and hadoop

Distributed Computing with Apache Hadoop: Technology Overview

Konstantin V. Shvachko

This session will cover a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames. Here’s an example outline of some of the topics that will be covered in the talk: Use cases to store in file systems for use with Apache Spark: - Analyzing a large set of data files. - Doing ETL of a large amount of data. - Applying Machine Learning & Data Science to a large dataset. - Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Databricks

2013 year of real-time hadoop

Geoff Hendrey

Hadoop DB

Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Spark Summit EU talk by Berni Schiefer

Spark Summit

Rapid Cluster Computing with Apache Spark 2016

Zohar Elkayam

Hadoop ecosystem for health/life sciences

Uri Laserson

Hadoop, HBase, and friends are built from the ground up to support big data, but that doesn't make them easy. Just like with any other relatively new and complex technologies, there are some rough edges and growing pains to manage. We've learned some hard lessons while deploying HBase tables containing billions of rows and dozens of terabytes on OpenLogic's Hadoop infrastructure. In this webinar, Rod Cope discusses some of the "gotchas" you might run into when deploying Hadoop and HBase in your own private cloud and how to avoid them.

Top 10 lessons learned from deploying hadoop in a private cloud

Rogue Wave Software

North Bay Ruby Meetup 101911

Ines Sombra

50 Shades of SQL

DataWorks Summit

Big data and hadoop overvew

Kunal Khanna

Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Lucidworks

Presentation db2 best practices for optimal performance

solarisyougood

Real time hadoop + mapreduce intro

Geoff Hendrey

Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think

New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...

Rittman Analytics

Michael Ralph Stonebraker is a computer scientist specializing in database research. He is currently an adjunct professor at MIT, where he has been involved in the development of the Aurora, C-Store, H-Store, Morpheus, and SciDB systems.Through a series of academic prototypes and commercial startups, Stonebraker's research and products are central to many relational database systems on the market today. He is also the founder of a number of database companies, including Ingres, Illustra, Cohera, StreamBase Systems, Vertica, VoltDB, and Paradigm4. He was previously the Chief Technical Officer (CTO) of Informix & a Professor of Computer Science at University of California, Berkeley. He is also an editor for the book "Readings in Database Systems"

What Does Big Data Mean and Who Will Win

BigDataCloud

Similaire à BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case (20)

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...

Hadoop Data Modeling

Intro to Big Data

Big data and hadoop

Distributed Computing with Apache Hadoop: Technology Overview

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

2013 year of real-time hadoop

Hadoop DB

Spark Summit EU talk by Berni Schiefer

Rapid Cluster Computing with Apache Spark 2016

Hadoop ecosystem for health/life sciences

Top 10 lessons learned from deploying hadoop in a private cloud

North Bay Ruby Meetup 101911

50 Shades of SQL

Big data and hadoop overvew

Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Presentation db2 best practices for optimal performance

Real time hadoop + mapreduce intro

New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...

What Does Big Data Mean and Who Will Win

Dernier

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

In the thrilling conclusion to 2023, ransomware groups had a banner year, really outdoing themselves in the "make everyone's life miserable" department. LockBit 3.0 took gold in the hacking olympics, followed by the plucky upstarts Clop and ALPHV/BlackCat. Apparently, 48% of organizations were feeling left out and decided to get in on the cyber attack action. Business services won the "most likely to get digitally mugged" award, with education and retail nipping at their heels. Hackers expanded their repertoire beyond boring old encryption to the much more exciting world of extortion. The US, UK and Canada took top honors in the "countries most likely to pay up" category. Bitcoins were the currency of choice for discerning hackers, because who doesn't love untraceable money?

Ransomware_Q4_2023. The report. [EN].pdf

Overkill Security

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

DBX First Quarter 2024 Investor Presentation

Dropbox

Scalable LLM APIs for AI and Generative AI Application Development Ettikan Karuppiah, Director/Technologist - NVIDIA Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

apidays

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Architecting Cloud Native Applications

WSO2

Dernier (20)

MINDCTI Revenue Release Quarter One 2024

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Corporate and higher education May webinar.pptx

AXA XL - Insurer Innovation Award Americas 2024

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Ransomware_Q4_2023. The report. [EN].pdf

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Axa Assurance Maroc - Insurer Innovation Award 2024

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Why Teams call analytics are critical to your entire business

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

DBX First Quarter 2024 Investor Presentation

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

How to Troubleshoot Apps for the Modern Connected Worker

GenAI Risks & Security Meetup 01052024.pdf

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Boost Fertility New Invention Ups Success Rates.pdf

Architecting Cloud Native Applications

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

1. EndoMine System Jewish General Hospital by David Lauzon and Anton Zakharov Big Data Montreal #9 February 5th 2013 1 / 18

2. Presentation • Our Objectives • Requirements and context • Project scope • Hadoop Solution – Big Data Solution Overview – Hive Table Schema – Compression Performance – Data Architecture in Hadoop – Hadoop/Impala Prototype Demo • Oracle Solution • Hadoop vs Oracle comparison • What are expensive queries? 2 / 18

3. Our Objectives • Lead an end-of-study project in an industrial context – Requirements elicitation – Implement a « proof-of-concept » prototype • Experiment with big data technologies – Compare with RDBMS 3 / 18

4. Requirements and context • Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...) – Dr. Shaun Eintracht • « ad hoc » Query • ETL Query – Dr. Elizabeth Mac Namara • « business intelligence » requirements • Realtime Dashboard • Department of Endocrinology – Dr. Mark Trifiro • Data mining 4 / 18

5. Project scope • First iteration = improve ad-hoc queries – Slow analytical queries and ETL (MS Access) – Risk of « crashing » production DB – Some queries impossible to process 5 / 18

6. Production DB (Oracle) 6 / 18

7. Solutions • Solution 1 : Hadoop + Impala • Solution 2 : Tune the existing Oracle RDBMS 7 / 18

8. Big Data Solution Overview 8 / 18

9. Hive Table Schema 9 / 18

10. Compression Performance 250 200 150 Impala 100 Hive Oracle 50 0 Oracle FS Text File Sequence SeqFile + SeqFile + File Gzip Snappy 10 / 18

11. Data Architecture in Hadoop • All big tables are pre-joined – With specimen (1) – Without specimen (2) • Partitioned using two schemes – Year-month (3) – Year and Test (4) • 4 different versions of the same data: – stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test 11 / 18

12. Hadoop Prototype Demo 12 / 18

13. Oracle Solution • Same tables as source DB – A big pre-joined table is not a good solution • Techniques explored : – Partitioning • Partitions automatically created – Compression • Inefficient for joins – Clustering – Join multiple partitioned tables 13 / 18

14. Oracle Solution (continued) • Avoid too many indexes on the big tables: – Takes a lot of memory – Slow to create – May not be used if query use more than 5% of the rows 14 / 18

15. Comparison: Hadoop Solution • Pro – Crunch massive amount of data – Scalability – Free software • Cons – Needs better UI and tune-ups – Maintenance cost – Require ETL time to merge data into one table – BIG Joins should be avoided 15 / 18

16. Comparison: Oracle Solution • Pro – Just need to create a slave DB (just?) – Faster random-lookup – Easier to find expertise • Cons – Scalability up to a certain point.. – Synchronisation with master DB: • Rebuilding indexes would take hours 16 / 18

17. What are expensive queries? • If possible, avoid these constructs on large result sets – SELECT DISTINCT – ORDER BY – GROUP BY – JOIN big table with another big table • JOIN big table with multiple small tables should be OK 17 / 18

18. Conclusion • Recommendation to use a “classic” RDBMS – The database fit on a single-node – Existing expertise in-house – Acceptable performance with appropriate tune-ups – Stop using MS Access • Disadvantage : limited scalability 18 / 18

Notes de l'éditeur

ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
NE PARLERONS PAS DE : Extraction des exigences
25% plusrapide avec compression Snappy (5.5X compression)Impala 80% plus rapidequ’Oracle
ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

Similaire à BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case (20)

Dernier

Dernier (20)

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

Notes de l'éditeur