SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
White Paper
Storage for Big Data
and Analytics Challenges
2
Big Data and analytics workloads represent a new frontier for organizations. Data is being
collected from sources that did not exist 10 years ago. Mobile phone data, machine-
generated data, and web site interaction data are all being collected and analyzed. In
addition, as IT budgets are already being pressured down, Big Data footprints are getting
larger and posing a huge storage challenge.
This paper provides information on the issues that Big Data applications pose for storage
systems and how choosing the correct storage infrastructure can streamline and consolidate
Big Data and analytics applications without breaking the bank.
Abstract
3
Introduction: Big Data Stresses
Storage Infrastructure
Big Data applications have caused an explosion in the amount of data that an
organization needs to keep online and analyze. This has caused the cost of storage as a
percent of the overall IT budget to explode. A study recently performed among Big Data and
analytics-driven organizations discovered these top 5 use cases for this data explosion:
	 1. Enhanced customer service and support
	 2. Digital security, intrusion detection, fraud detection and prevention
	 3. Operational analysis
	 4. Big Data exploration
	 5. Data warehouse augmentation
Enhanced customer service is always on the minds of organizations, both large and small.
“How can I help improve the relationship with my customer? I know my customer will choose
to buy from me, and buy more often, if I cultivate and enhance the relationship.” Typically
this is preference data gathered from various sources, online navigation, and search criteria.
Collected through the lifecycle of the customer, this data is mined and written alongside
customer records.
Digital security and intrusion detection is very important to customers. This data is
collected and analyzed in real time, and is typically machine generated. The analytic results
must be returned immediately for this activity to be relevant. This requires fast storage with
lots of capacity, as machine-generated and sensor data can consume large capacities.
Operational analysis involves collecting data, many times sensor-based from other
machines, and using that data to identify areas of operational improvement and conduct
fault isolation and break-fix analysis. Manufacturing firms collect up to the second data about
robotic activities on their shop floor and want to know not just the status, but how they can
improve the process. Like intrusion detection, this data is generated and analyzed in real
time, and the results must be stored and sent back up the chain to be actionable. Unlike
intrusion detection, all data is interesting and shows machine and process trends, which
can be of use later.
Big Data exploration: How do you know what Big Data is until you find out what you are
collecting, what you are not, and identify what is missing. Normally this is done by collecting
more and more data.
4
Data Warehouse augmentation: “How do I take my existing analytics data, typically in a
Warehouse or Mart form, and augment the data feeds from outside sources to improve
accuracy, reduce execution time, and give me the answers I need without reinventing the
wheel?” Data warehouse adoption is becoming widespread, even for smaller organizations,
as transactional data analysis is becoming a requirement for any organization at any level.
All of these use cases require more storage and more compute power. Big Data is now
considered production data, so availability, recoverability and performance are just as
important as for the transactional systems within the organization. And as stated before,
the trend is for IT budgets to get smaller, not bigger. These diametrically opposed forces
are creating a change in the storage industry. How can you do more with less while taking
into account that you also don’t want to compromise on system and application reliability,
efficiency or performance.
The demands that Big Data and analytic workloads place on enterprise storage can be
summed up as follows:
	 • Must have excellent performance
	 • Must have extremely high density
	 • Must have excellent uptime, high reliability
	 • Must be easy to use, easy to manage, easy to provision
	 • Must have attractively low total cost of ownership
This is where a brand-new storage architecture such as INFINIDAT’s InfiniBox™ can help.
High Performance
Large data sets, heavyweight analytics applications, and time-sensitive demands for results
make high performance of the underlying storage a key criteria. In InfiniBox, maximum
performance is achieved with no tuning or optimization. InfiniBox uses standard “off
the shelf” hardware components (CPU/Memory/HDDs/SSDs) wrapped in sophisticated
storage to extract the maximum performance from the 480 Near-Line SAS drives used in
the InfiniBox architecture. One of the key elements developed in the core system code
is the ability to analyze real application profiles and surgically define cache pre-fetch and
de-stage algorithms. The system design specifically targets real-life profiles and provides
optimum performance under those conditions. This capability is at the core of the InfiniBox
architecture.
5
Large Data Sets
Large data sets pose a unique and daunting challenge to enterprise storage arrays by
providing an I/O profile that is unpredictable, and often overwhelms the storage frame. This
results in high latencies, which increase the run time of analytic workloads. Some analytic
activities are very latency sensitive, and in many cases will affect the end-user population that
the application supports. Many of these workloads will overwhelm storage platforms with
limited cache sizes. InfiniBox contains up to 2.3TB of high-speed L1 DIMM cache and 23TB
of L2 SSD cache used to improve cache hits and reduce latency.
• Consistent Performance
• High Capacity
• Utmost Uptime and Reliability
• Easy to Manage and Provision
• Attractive Cost of Ownership
Exploration
Enhanced Customer Service
Security and Fraud Protection
Operational Analysis
DWH Augmentation
Storage Requirements for
Big Data and Analytics
Big Data and Analytics
Use Cases
6
Many of these characteristics can be seen occurring simultaneously, while others are driven
by specific activities such as backups or data load/ETL. InfiniBox thrives on supporting a wide
range of I/O types, all at the same time. The data architecture virtualizes the storage for each
volume by populating each of the 480 spindles in the InfiniBox frame, all in parallel, using a
sophisticated data parity dispersed layout.
In addition, InfiniBox utilizes very advanced capabilities to improve writes. Using a unique and
patented multi-modal log write mechanism, INFINIDAT significantly improves the effciency
of write I/O de-staged from cache. This is very important for the Data Acquisition and ETL
phases of this example.
Block Size Management Matters
Many analytic workloads can change the I/O profile on the fly. But in general, the vast
majority of Big Data and Analytic applications use large block I/O, loading data in from
storage, reducing, sorting, comparing, then writing out aggregate data. Large blocks can
historically give traditional storage platforms problems because most storage environments
are not designed with large-block support in mind.
sequential read/sequential write
sequential read/sequential write
random read/sequential read
random read/sequential write
sequential write/random read
Profiling/Hygiene (cleanup)
Loading
Query/Reporting
Building interim data mart
Analytic Modeling
sequential write
sequential read/sequential write
sequential read/sequential write
random read/sequential read
random read/sequential write
sequential read/random read
Data Acquisition
Profiling/Hygiene (cleanup)
Loading
Query/Reporting
Building Interim Data Mart
Analytic Modeling
Diverse I/O Profiles and Patterns Used in Big Data
We have seen many analytic environments exhibit I/O profiles containing all of the following
characteristics:
7
High Density
INFINIDAT has the ability to configure a system with over 2.7PB of usable storage in a
single 19-inch rack. The InfiniBox storage system is a modern, fully symmetric grid, all-
active controller (node) system with an advanced multi-layer caching architecture. The data
architecture encompasses a double parity (wide-stripe) data distribution model. This model
uses a unique combination of random data distribution and parity protection. This ensures
maximum data availability while minimizing data footprint. Each and every volume created on
a single InfiniBox frame will store small pieces of data on each of the 480 drives in the frame.
InfiniBox usable storage per frame is the highest in the storage industry.
High Availability and Reliability
Keeping analytics available is critical for every storage system. The InfiniBox architecture
provides a robust, highly available storage environment, providing 99.99999% uptime. That
equates to less than 3 seconds of downtime a year! Our customers report no loss of data,
even upon multiple disk failures. InfiniBox offers end-to-end business continuity features,
including asynchronous remote mirroring and snapshots. Using snapshots, recovery of
a database can be reduced to the amount of time it takes to map the volumes to hosts,
minutes instead of hours using a more traditional backup and recovery process.
Easy to Use, Automated Provisioning
and Management
The InfiniBox architecture, along with the elegant simplicity of its web-based GUI and built-in
Command Line Interface, allows for easy, fast deployment and management of the storage
system. The amount of time saved in performing traditional storage administration tasks is
huge. Also, because of the InfiniBox open architecture and aggressive support for RESTful
API, platforms such as OpenStack, and Docker allow storage administration tasks to be
performed at the application level, without the need to use the InfiniBox GUI.
InfiniBox provides a management system that can isolate storage pools and volumes to
specific users. Multi-tenancy features are supported so that application users, such as those
deployed in a private cloud environment, can see and manage the storage that has been
granted to that user community.
www.infinidat.com | info@infinidat.com
WP-BIGDATA-160705-US | © INFINIDAT 2016
Very Low Total Cost of Ownership
High performance, extreme availability, highest data density and ease of use all point to an
unmatched TCO. This is important for environments where there is a need to consolidate
mission-critical databases into smaller and smaller physical footprints.
Hadoop and InfiniBox
One of the many Big Data use cases is support for Hadoop clusters. Today, the default
deployment model for Hadoop clusters is to build a cluster from many, inexpensive nodes.
Each node has equal amounts of compute, memory and dedicated storage. The initial design
phase of most customers launching their first Hadoop environment is to find the right
mixture of resources to come up with this node-level design. Then they string a number of
these nodes together to create a large compute environment for their favorite MapReduce
application. The problem most customers run into is that they run out of storage long before
they run short of compute cycles in the cluster. The only option available for customers using
dedicated storage is to keep adding more nodes to the cluster. This is fine, except now you
are adding more compute power as well, which may not be needed, and in many cases is
where this dedicated storage model breaks.
By adopting InfiniBox block-based SAN storage instead of dedicated hard drives per node,
you are no longer limited by how much storage each node is capable of providing. You can
dynamically add more LUNs, or add more space per LUN for each cluster node. There is no
longer a need to add compute power until needed. The level of Hadoop data redundancy
can be reduced, as the data will be fully protected with 99.99999% uptime.
Conclusions
InfiniBox bridges the gap between high performance and high capacity for Big Data
applications. InfiniBox allows an organization implementing Big Data and Analytics projects to
truly attain its business goals: cost reduction, continual and deep capacity scaling, and simple
and effective management — and without any compromises in performance or reliability.
All of this to effectively and effciently support Big Data applications at a disruptive price point.

Contenu connexe

Tendances

Hds ucp sap hana infographic v6[1]
Hds ucp  sap hana infographic v6[1]Hds ucp  sap hana infographic v6[1]
Hds ucp sap hana infographic v6[1]
Barbara Götz
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Eyad Manna
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paper
july12jana
 

Tendances (20)

Tools for data warehousing
Tools  for data warehousingTools  for data warehousing
Tools for data warehousing
 
Data warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika Kotecha
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...White paper   making an-operational_data_store_(ods)_the_center_of_your_data_...
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
 
Smarter Management for Your Data Growth
Smarter Management for Your Data GrowthSmarter Management for Your Data Growth
Smarter Management for Your Data Growth
 
Project Presentation on Data WareHouse
Project Presentation on Data WareHouseProject Presentation on Data WareHouse
Project Presentation on Data WareHouse
 
Ppt
PptPpt
Ppt
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Hds ucp sap hana infographic v6[1]
Hds ucp  sap hana infographic v6[1]Hds ucp  sap hana infographic v6[1]
Hds ucp sap hana infographic v6[1]
 
Neelesh it assignment
Neelesh it assignmentNeelesh it assignment
Neelesh it assignment
 
Benefits of a data warehouse presentation by Being topper
Benefits of a data warehouse presentation by Being topperBenefits of a data warehouse presentation by Being topper
Benefits of a data warehouse presentation by Being topper
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
 
Data Wearhouse (Dw) concepts
Data Wearhouse (Dw)  conceptsData Wearhouse (Dw)  concepts
Data Wearhouse (Dw) concepts
 
Data warehousing Demo PPTS | Over View | Introduction
Data warehousing Demo PPTS | Over View | Introduction Data warehousing Demo PPTS | Over View | Introduction
Data warehousing Demo PPTS | Over View | Introduction
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paper
 

Similaire à Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges

Solve the Top 6 Enterprise Storage Issues White Paper
Solve the Top 6 Enterprise Storage Issues White PaperSolve the Top 6 Enterprise Storage Issues White Paper
Solve the Top 6 Enterprise Storage Issues White Paper
Hitachi Vantara
 
TSS03135-USEN-00_HR
TSS03135-USEN-00_HRTSS03135-USEN-00_HR
TSS03135-USEN-00_HR
Ed Ahl
 
Accenture hana-in-memory-pov
Accenture hana-in-memory-povAccenture hana-in-memory-pov
Accenture hana-in-memory-pov
K Thomas
 
HitVirtualized Tiered Storage Solution Profile
HitVirtualized Tiered Storage Solution ProfileHitVirtualized Tiered Storage Solution Profile
HitVirtualized Tiered Storage Solution Profile
Hitachi Vantara
 
Data warehousing has quickly evolved into a unique and popular busin.pdf
Data warehousing has quickly evolved into a unique and popular busin.pdfData warehousing has quickly evolved into a unique and popular busin.pdf
Data warehousing has quickly evolved into a unique and popular busin.pdf
apleather
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-cloudera
Jyrki Määttä
 

Similaire à Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges (20)

Solve the Top 6 Enterprise Storage Issues White Paper
Solve the Top 6 Enterprise Storage Issues White PaperSolve the Top 6 Enterprise Storage Issues White Paper
Solve the Top 6 Enterprise Storage Issues White Paper
 
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
TSS03135-USEN-00_HR
TSS03135-USEN-00_HRTSS03135-USEN-00_HR
TSS03135-USEN-00_HR
 
Chip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureChip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochure
 
Accenture hana-in-memory-pov
Accenture hana-in-memory-povAccenture hana-in-memory-pov
Accenture hana-in-memory-pov
 
HitVirtualized Tiered Storage Solution Profile
HitVirtualized Tiered Storage Solution ProfileHitVirtualized Tiered Storage Solution Profile
HitVirtualized Tiered Storage Solution Profile
 
Business Intelligence Solution on Windows Azure
Business Intelligence Solution on Windows AzureBusiness Intelligence Solution on Windows Azure
Business Intelligence Solution on Windows Azure
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
Architecting a Modern Data Warehouse: Enterprise Must-Haves
Architecting a Modern Data Warehouse: Enterprise Must-HavesArchitecting a Modern Data Warehouse: Enterprise Must-Haves
Architecting a Modern Data Warehouse: Enterprise Must-Haves
 
Data warehousing has quickly evolved into a unique and popular busin.pdf
Data warehousing has quickly evolved into a unique and popular busin.pdfData warehousing has quickly evolved into a unique and popular busin.pdf
Data warehousing has quickly evolved into a unique and popular busin.pdf
 
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-cloudera
 
Backing Up Mountains of Data to Disk
Backing Up Mountains of Data to DiskBacking Up Mountains of Data to Disk
Backing Up Mountains of Data to Disk
 
Netmagic the-storage-matrix
Netmagic the-storage-matrixNetmagic the-storage-matrix
Netmagic the-storage-matrix
 
The storage matrix netmagic
The storage matrix   netmagicThe storage matrix   netmagic
The storage matrix netmagic
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
Flashelastic
FlashelasticFlashelastic
Flashelastic
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 

Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges

  • 1. White Paper Storage for Big Data and Analytics Challenges
  • 2. 2 Big Data and analytics workloads represent a new frontier for organizations. Data is being collected from sources that did not exist 10 years ago. Mobile phone data, machine- generated data, and web site interaction data are all being collected and analyzed. In addition, as IT budgets are already being pressured down, Big Data footprints are getting larger and posing a huge storage challenge. This paper provides information on the issues that Big Data applications pose for storage systems and how choosing the correct storage infrastructure can streamline and consolidate Big Data and analytics applications without breaking the bank. Abstract
  • 3. 3 Introduction: Big Data Stresses Storage Infrastructure Big Data applications have caused an explosion in the amount of data that an organization needs to keep online and analyze. This has caused the cost of storage as a percent of the overall IT budget to explode. A study recently performed among Big Data and analytics-driven organizations discovered these top 5 use cases for this data explosion: 1. Enhanced customer service and support 2. Digital security, intrusion detection, fraud detection and prevention 3. Operational analysis 4. Big Data exploration 5. Data warehouse augmentation Enhanced customer service is always on the minds of organizations, both large and small. “How can I help improve the relationship with my customer? I know my customer will choose to buy from me, and buy more often, if I cultivate and enhance the relationship.” Typically this is preference data gathered from various sources, online navigation, and search criteria. Collected through the lifecycle of the customer, this data is mined and written alongside customer records. Digital security and intrusion detection is very important to customers. This data is collected and analyzed in real time, and is typically machine generated. The analytic results must be returned immediately for this activity to be relevant. This requires fast storage with lots of capacity, as machine-generated and sensor data can consume large capacities. Operational analysis involves collecting data, many times sensor-based from other machines, and using that data to identify areas of operational improvement and conduct fault isolation and break-fix analysis. Manufacturing firms collect up to the second data about robotic activities on their shop floor and want to know not just the status, but how they can improve the process. Like intrusion detection, this data is generated and analyzed in real time, and the results must be stored and sent back up the chain to be actionable. Unlike intrusion detection, all data is interesting and shows machine and process trends, which can be of use later. Big Data exploration: How do you know what Big Data is until you find out what you are collecting, what you are not, and identify what is missing. Normally this is done by collecting more and more data.
  • 4. 4 Data Warehouse augmentation: “How do I take my existing analytics data, typically in a Warehouse or Mart form, and augment the data feeds from outside sources to improve accuracy, reduce execution time, and give me the answers I need without reinventing the wheel?” Data warehouse adoption is becoming widespread, even for smaller organizations, as transactional data analysis is becoming a requirement for any organization at any level. All of these use cases require more storage and more compute power. Big Data is now considered production data, so availability, recoverability and performance are just as important as for the transactional systems within the organization. And as stated before, the trend is for IT budgets to get smaller, not bigger. These diametrically opposed forces are creating a change in the storage industry. How can you do more with less while taking into account that you also don’t want to compromise on system and application reliability, efficiency or performance. The demands that Big Data and analytic workloads place on enterprise storage can be summed up as follows: • Must have excellent performance • Must have extremely high density • Must have excellent uptime, high reliability • Must be easy to use, easy to manage, easy to provision • Must have attractively low total cost of ownership This is where a brand-new storage architecture such as INFINIDAT’s InfiniBox™ can help. High Performance Large data sets, heavyweight analytics applications, and time-sensitive demands for results make high performance of the underlying storage a key criteria. In InfiniBox, maximum performance is achieved with no tuning or optimization. InfiniBox uses standard “off the shelf” hardware components (CPU/Memory/HDDs/SSDs) wrapped in sophisticated storage to extract the maximum performance from the 480 Near-Line SAS drives used in the InfiniBox architecture. One of the key elements developed in the core system code is the ability to analyze real application profiles and surgically define cache pre-fetch and de-stage algorithms. The system design specifically targets real-life profiles and provides optimum performance under those conditions. This capability is at the core of the InfiniBox architecture.
  • 5. 5 Large Data Sets Large data sets pose a unique and daunting challenge to enterprise storage arrays by providing an I/O profile that is unpredictable, and often overwhelms the storage frame. This results in high latencies, which increase the run time of analytic workloads. Some analytic activities are very latency sensitive, and in many cases will affect the end-user population that the application supports. Many of these workloads will overwhelm storage platforms with limited cache sizes. InfiniBox contains up to 2.3TB of high-speed L1 DIMM cache and 23TB of L2 SSD cache used to improve cache hits and reduce latency. • Consistent Performance • High Capacity • Utmost Uptime and Reliability • Easy to Manage and Provision • Attractive Cost of Ownership Exploration Enhanced Customer Service Security and Fraud Protection Operational Analysis DWH Augmentation Storage Requirements for Big Data and Analytics Big Data and Analytics Use Cases
  • 6. 6 Many of these characteristics can be seen occurring simultaneously, while others are driven by specific activities such as backups or data load/ETL. InfiniBox thrives on supporting a wide range of I/O types, all at the same time. The data architecture virtualizes the storage for each volume by populating each of the 480 spindles in the InfiniBox frame, all in parallel, using a sophisticated data parity dispersed layout. In addition, InfiniBox utilizes very advanced capabilities to improve writes. Using a unique and patented multi-modal log write mechanism, INFINIDAT significantly improves the effciency of write I/O de-staged from cache. This is very important for the Data Acquisition and ETL phases of this example. Block Size Management Matters Many analytic workloads can change the I/O profile on the fly. But in general, the vast majority of Big Data and Analytic applications use large block I/O, loading data in from storage, reducing, sorting, comparing, then writing out aggregate data. Large blocks can historically give traditional storage platforms problems because most storage environments are not designed with large-block support in mind. sequential read/sequential write sequential read/sequential write random read/sequential read random read/sequential write sequential write/random read Profiling/Hygiene (cleanup) Loading Query/Reporting Building interim data mart Analytic Modeling sequential write sequential read/sequential write sequential read/sequential write random read/sequential read random read/sequential write sequential read/random read Data Acquisition Profiling/Hygiene (cleanup) Loading Query/Reporting Building Interim Data Mart Analytic Modeling Diverse I/O Profiles and Patterns Used in Big Data We have seen many analytic environments exhibit I/O profiles containing all of the following characteristics:
  • 7. 7 High Density INFINIDAT has the ability to configure a system with over 2.7PB of usable storage in a single 19-inch rack. The InfiniBox storage system is a modern, fully symmetric grid, all- active controller (node) system with an advanced multi-layer caching architecture. The data architecture encompasses a double parity (wide-stripe) data distribution model. This model uses a unique combination of random data distribution and parity protection. This ensures maximum data availability while minimizing data footprint. Each and every volume created on a single InfiniBox frame will store small pieces of data on each of the 480 drives in the frame. InfiniBox usable storage per frame is the highest in the storage industry. High Availability and Reliability Keeping analytics available is critical for every storage system. The InfiniBox architecture provides a robust, highly available storage environment, providing 99.99999% uptime. That equates to less than 3 seconds of downtime a year! Our customers report no loss of data, even upon multiple disk failures. InfiniBox offers end-to-end business continuity features, including asynchronous remote mirroring and snapshots. Using snapshots, recovery of a database can be reduced to the amount of time it takes to map the volumes to hosts, minutes instead of hours using a more traditional backup and recovery process. Easy to Use, Automated Provisioning and Management The InfiniBox architecture, along with the elegant simplicity of its web-based GUI and built-in Command Line Interface, allows for easy, fast deployment and management of the storage system. The amount of time saved in performing traditional storage administration tasks is huge. Also, because of the InfiniBox open architecture and aggressive support for RESTful API, platforms such as OpenStack, and Docker allow storage administration tasks to be performed at the application level, without the need to use the InfiniBox GUI. InfiniBox provides a management system that can isolate storage pools and volumes to specific users. Multi-tenancy features are supported so that application users, such as those deployed in a private cloud environment, can see and manage the storage that has been granted to that user community.
  • 8. www.infinidat.com | info@infinidat.com WP-BIGDATA-160705-US | © INFINIDAT 2016 Very Low Total Cost of Ownership High performance, extreme availability, highest data density and ease of use all point to an unmatched TCO. This is important for environments where there is a need to consolidate mission-critical databases into smaller and smaller physical footprints. Hadoop and InfiniBox One of the many Big Data use cases is support for Hadoop clusters. Today, the default deployment model for Hadoop clusters is to build a cluster from many, inexpensive nodes. Each node has equal amounts of compute, memory and dedicated storage. The initial design phase of most customers launching their first Hadoop environment is to find the right mixture of resources to come up with this node-level design. Then they string a number of these nodes together to create a large compute environment for their favorite MapReduce application. The problem most customers run into is that they run out of storage long before they run short of compute cycles in the cluster. The only option available for customers using dedicated storage is to keep adding more nodes to the cluster. This is fine, except now you are adding more compute power as well, which may not be needed, and in many cases is where this dedicated storage model breaks. By adopting InfiniBox block-based SAN storage instead of dedicated hard drives per node, you are no longer limited by how much storage each node is capable of providing. You can dynamically add more LUNs, or add more space per LUN for each cluster node. There is no longer a need to add compute power until needed. The level of Hadoop data redundancy can be reduced, as the data will be fully protected with 99.99999% uptime. Conclusions InfiniBox bridges the gap between high performance and high capacity for Big Data applications. InfiniBox allows an organization implementing Big Data and Analytics projects to truly attain its business goals: cost reduction, continual and deep capacity scaling, and simple and effective management — and without any compromises in performance or reliability. All of this to effectively and effciently support Big Data applications at a disruptive price point.