SlideShare une entreprise Scribd logo
1  sur  7
Webinar: Big Data Architectures – Beyond the Elephant Ride
June 29, 2012
Question and Answer Session

Q1. What are the differences between Storm and ESBs like Mule?

Storm and ESB (like mule) are very distinct and cannot be compared..

The motivation behind ESBs is to standardize and structure the loosely coupled
software components so that they can be independently deployed and run in a
disparate environment. The communication is through message passing and using an
ESB heterogeneous components are able to interact with each other.

Storm is for processing large data in real time. When we use storm, we do not
attempt establishing any form of a common structure for different components to
collaborate. Rather, Storm enables huge amount of data to be processed through a
chain of processing units.

So when you’ve large amounts of data than you want to process in real-time, we
advise you to use Storm. On the other hand, when you have numerous components
and you want to write a layer that will enable their interaction, use an ESB.

Infact, Storm and ESB can be theoretically integrated together so that Storm can
handle the streaming analytics part while ESB can cater to service orchestration and
integrations.

Q2. What is the advantage of Giraph and Pregel over more common Graph DBs like
Neo or Infinite graphs?

Giraph is an opensource implementation of Pregel meant for large datasets. It
provides a large-scale graph processing infrastructure over Hadoop. Some of the
advantages I’d like to highlight include:

   1.   Distributed and especially developed for large scale graph processing
   2.   Bulk Synchronous Parallel (BSP) as execution model
   3.   Fault tolerance by check pointing
   4.   Giraph runs on standard Hadoop infrastructure

© 2012 Impetus Technologies                                                     Page 1
5. Computation is executed in memory
   6. It can be a part of pipeline in form of a job
   7. Vertex centric API

Request you to go through answer to Question No. 9 as well.

Q3. What do you recommend for Reporting on top of NoSQL databases?

Technologies coming under NoSQL are relative new and still evolving. Furthermore,
there are a lot of these technologies and it is unlikely that one single tool would work
on all of them.

It will be great if you could share us the exact NoSQL technology which you are either
using or planning to use and we'll then be able to suggest you the right tool.

There are a very few reporting tools like Intellicus and Jasper that work on HBase but
I guess they're still keeping an eye on the market to see the direction it's going to
take.

I strongly believe that you should see some exciting features in these tools in the
next 6 - 12 months’ time frame.

Q4. What are the difference between Cassandra and RIAK and why would you
choose one over the other?

Cassandra and RIAK are popular NoSQL solutions and are best suited to solve
different kind of use cases in specific ways. So the answer to choose one over the
other would totally depend on the business use case that we are trying to solve.

Strengths of Riak over Cassandra

- Adding nodes to the Riak cluster is very easy
- Datamodel doesn't need to be pre-setup
- You can access it using REST or using Protocol Buffer API
- Commercial support is available from Basho




© 2012 Impetus Technologies                                                       Page 2
Strengths of Cassandra over Riak

- Cassandra is still more popular because of the bigger community using it
- You can access it using Cassandra CQL; a SQLish language
- Scales to PBs and support columnar structure
- Enterprise features like rack-awareness are free which is helpful in large
deployments
- Commercial product support is available from Datastax.
- Implementation support is available from 3rd party commercial service providers
like Impetus. (http://wiki.apache.org/cassandra/ThirdPartySupport)

Q5. We planned for a SAN deployment as our storage solution. I have read that
MPP database solutions are optimal on a shared-nothing architecture as DAS
rather than on SAN. Can you please comment on MPP database on SAN vs DAS?

Typically speaking SAN can offer higher throughput over DAS but can also have a
higher latency for lighter loads vis-à-vis DAS. Also, SAN's available throughput will be
shared across all connected nodes. In a MPP Data warehousing scenario, multiple
nodes will connect to SAN, thereby, sharing a common bandwidth.

Another point to note is that most queries served by MPP systems will involve high
amount of scattered reads across multiple nodes, thus pushing the bandwidth
utilization on SAN to its limits. However, if we have high amount of cache with high
speed HBAs and high speed disks in SAN (15K RPM), then the SAN should be able to
server a 10-15 nodes MPP cluster.

On the other hand, DAS storage can also provide very good throughput and does not
have to share the bandwidth across multiple nodes. The bandwidth offered can be
further improved by using multiple SATA adapters and high speed disks (10K - 15K
RPM). DAS probably will offer better performance on a cluster with very high number
of nodes.

To summarize, there is no clear winner and using SAN vs DAS will depend on various
factors like load, underlying technology in the storage system, cache, number of
nodes etc. Both, high end fibre based storage technology and new SATA based
storage technology (e.g. SATA-3), can offer similar bandwidth. We suggest that a



© 2012 Impetus Technologies                                                        Page 3
careful study and capacity planning should be conducted on the underlying storage
system before deciding on the storage solution.

Q6. What architecture components would satisfy the desire to have an integrated
NewSQL environment and be able to marry that data with both adhoc defined user
tables and events detected during unstructured data stream processing?

NewSQL and NoSQL databases/datastores excel in areas where traditional RDBMS
systems have some limitations. In many scenarios, NoSQL/NewSQL databases can
offer significant improvement over RDBMS. Some cases are

1. Very high availability on a high traffic data
2. Storing CLOB/text data that store denormalized/unstructured data
3. Journal data
4. Performance and scalability

Unstructured data stream processing falls more under the category of CEP (complex
event processing) and eventually we will see that NewSQL systems start providing
support for pre-ingestion analytics than the current traditional post-ingestion
analytics. Currently, you will have to rely on some CEP component providing event
detection on streaming data while NewSQL acts as the data sink for this streaming
data. NewSQL can also help in rapid event generation by firing analytical queries
much faster than traditional RDBMS.

Q7. Can you compare Neo4J with your recommended Graph Database

Already answered as a part of Q.2

Q8. What is your take on MongoDB?

RDBMS is still the most commonly used data-store for applications built today. But,
the flexibility offered by Mongo provides advantages with respect to development
speed and overall application performance in many use- cases. Like any other
document store, instead of storing data into tables with rows and columns,
MongoDB encapsulates data into loosely defined documents.




© 2012 Impetus Technologies                                                    Page 4
There are a lot of document-oriented stores, and the underlying implementation
varies between various data-stores. Some represent it as an XML document and
some use JSON. The general rule is documents are not rigidly defined and you can
expect a high degree of flexibility when defining data.

MongoDB is one of the most popular document stores. It is an open source, schema-
free, written in C++ and support for a wide array of programming languages including
a SQL-like query language.

It’s relatively a new technology and has a few challenges as well but with attractive
pricing and relative ease of use, it definitely is becoming a choice for various small
and large companies.

Q9. You didn't mention Neo4j in your graph databases you recommend. Any
particular reason Neo4j wasn't included?

No, there is no particular reason. What’s important here for you is to understand the
difference between these technologies and where their fitment is. If you’ve an OLAP
and data analytics scenario, Hadoop-based Pregel and Giraph will be a better fit. If
you’ve an OLTP setup where you want to store and query on connected data for
online transaction processing Neo4J will come into the picture.

Request you to go through an excellent reading here:
http://jim.webber.name/2011/08/24/66f1fb4b-83c3-4f52-af40-ee6382ad2155.aspx

Q10.What is the limiting factor in analyzing all data in a real-time basis? Is it
processing power, storage systems, DB systems or something else?

There are challenges in each of the points you raised like storing and processing.
When you process the data, it usually has to be loaded on to the main memory which
is still expensive. The machines have to be powerful enough to get you the results
fast. Hence, both processing and storage system are the main bottlenecks.

Also, there is a paradigm shift in the way programming is done. So, in order to
efficiently process the data, we need to come up with parallel algorithms which are
able to work on this data and utilize the processing power of the machines.



© 2012 Impetus Technologies                                                         Page 5
So to summarize three points that I consider limiting factor are: memory, processing
and the right set of algorithms.

Q11.What do you recommend for an in database but very scalable alternative to
SAS for doing advanced math on large datasets

Assuming that the reference here is to SAS language, R scripting can be a good
alternative to work with large datasets as it has good integration with Hadoop and
can scale well using map reduce programming interface over R scripts. Revolution
Analytics is a commercial product for R over Hadoop.

There are other non-Hadoop options as well such as Greenplum or Aster etc which
have support for specialized advanced math libraries.

Also, SAS is now providing integration with Hadoop which means that you can reuse
some of your SAS programming investments and use Hadoop as the underlying
scalable processing engine for some of the analytical execution.

Q12 Are there any NewSQL platforms that have mastered the functionality around
Workload Management? For instance, without workload management, the high
resource, intense transactions can get in the way of traditional reporting needs...
in other words, is there a NewSQL environment that can be used for traditional and
advanced analysis on the same platform?

NewSQL are certainly evolving every day as we speak with many more being built in
stealth mode. We are not aware of any advanced workload management
functionality being provided with any NewSQL platform for now, but that may
change any day now.

However, most NewSQL platforms have been designed to work efficiently with either
OLTP environment or OLAP environment.

Q14 Is MongoDB a better solution for any of the scenarios discussed?

MongoDB can be a good option in some use-cases of OLTP systems or the
transactional system we discussed.



© 2012 Impetus Technologies                                                    Page 6
Q15 Do you have recommendations for an indexing solution?

Depending on the data size you can go for Solr and Elastic Solr as options for
indexing. There are commercial solutions as well but Solr with its new scalable
version SolrCloud can compete with any other commercial solution.

            Write to us at bigdata@impetus.com for more information




© 2012 Impetus Technologies                                                       Page 7

Contenu connexe

Tendances

Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
Joey Echeverria
 
1 rh storage - architecture whitepaper
1 rh storage - architecture whitepaper1 rh storage - architecture whitepaper
1 rh storage - architecture whitepaper
Accenture
 

Tendances (20)

Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
HDFS
HDFSHDFS
HDFS
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
Aug 2012 HUG: Random vs. Sequential
Aug 2012 HUG: Random vs. SequentialAug 2012 HUG: Random vs. Sequential
Aug 2012 HUG: Random vs. Sequential
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
1 rh storage - architecture whitepaper
1 rh storage - architecture whitepaper1 rh storage - architecture whitepaper
1 rh storage - architecture whitepaper
 
Actian DataFlow Whitepaper
Actian DataFlow WhitepaperActian DataFlow Whitepaper
Actian DataFlow Whitepaper
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path ahead
 
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at ExplorysHBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystem
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 

En vedette

Doris Devers Resume 033116
Doris Devers Resume 033116Doris Devers Resume 033116
Doris Devers Resume 033116
Doris Devers
 
비종교 적인 깨달음 (Korean)
비종교 적인 깨달음 (Korean)비종교 적인 깨달음 (Korean)
비종교 적인 깨달음 (Korean)
Hitoshi Tsuchiyama
 
Laurie Faith resume
Laurie Faith resumeLaurie Faith resume
Laurie Faith resume
Laurie Faith
 
BITE Social Case Study 1
BITE Social Case Study 1BITE Social Case Study 1
BITE Social Case Study 1
Tamara Wilson
 
Second Mile Mobile Detailing
Second Mile Mobile DetailingSecond Mile Mobile Detailing
Second Mile Mobile Detailing
Allan S. Watson
 
6 Apps to Use With Instagram
6 Apps to Use With Instagram6 Apps to Use With Instagram
6 Apps to Use With Instagram
MindShift Interactive
 

En vedette (12)

Your Awesome Brand + Resume Tips
Your Awesome Brand + Resume TipsYour Awesome Brand + Resume Tips
Your Awesome Brand + Resume Tips
 
Guideline to composition writing
Guideline to composition writingGuideline to composition writing
Guideline to composition writing
 
Doris Devers Resume 033116
Doris Devers Resume 033116Doris Devers Resume 033116
Doris Devers Resume 033116
 
S4 tarea4 mameb
S4 tarea4 mamebS4 tarea4 mameb
S4 tarea4 mameb
 
Eenvoudig dienstbetoon
Eenvoudig dienstbetoonEenvoudig dienstbetoon
Eenvoudig dienstbetoon
 
비종교 적인 깨달음 (Korean)
비종교 적인 깨달음 (Korean)비종교 적인 깨달음 (Korean)
비종교 적인 깨달음 (Korean)
 
flyer
flyerflyer
flyer
 
Laurie Faith resume
Laurie Faith resumeLaurie Faith resume
Laurie Faith resume
 
BITE Social Case Study 1
BITE Social Case Study 1BITE Social Case Study 1
BITE Social Case Study 1
 
Kti mitra tanjung
Kti mitra tanjungKti mitra tanjung
Kti mitra tanjung
 
Second Mile Mobile Detailing
Second Mile Mobile DetailingSecond Mile Mobile Detailing
Second Mile Mobile Detailing
 
6 Apps to Use With Instagram
6 Apps to Use With Instagram6 Apps to Use With Instagram
6 Apps to Use With Instagram
 

Similaire à Webcast Q&A- Big Data Architectures Beyond Hadoop

EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
David Walker
 
Whitepaper_Cassandra_Datastax_Final
Whitepaper_Cassandra_Datastax_FinalWhitepaper_Cassandra_Datastax_Final
Whitepaper_Cassandra_Datastax_Final
Michele Hunter
 

Similaire à Webcast Q&A- Big Data Architectures Beyond Hadoop (20)

No sql database
No sql databaseNo sql database
No sql database
 
Report 2.0.docx
Report 2.0.docxReport 2.0.docx
Report 2.0.docx
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Introducing Mache
Introducing MacheIntroducing Mache
Introducing Mache
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
No sql
No sqlNo sql
No sql
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Whitepaper_Cassandra_Datastax_Final
Whitepaper_Cassandra_Datastax_FinalWhitepaper_Cassandra_Datastax_Final
Whitepaper_Cassandra_Datastax_Final
 
Sdn in big data
Sdn in big dataSdn in big data
Sdn in big data
 

Plus de Impetus Technologies

Webinar maturity of mobile test automation- approaches and future trends
Webinar  maturity of mobile test automation- approaches and future trendsWebinar  maturity of mobile test automation- approaches and future trends
Webinar maturity of mobile test automation- approaches and future trends
Impetus Technologies
 

Plus de Impetus Technologies (20)

Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
 
Building Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarBuilding Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus Webinar
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
 
Impetus White Paper- Handling Data Corruption in Elasticsearch
Impetus White Paper- Handling  Data Corruption  in ElasticsearchImpetus White Paper- Handling  Data Corruption  in Elasticsearch
Impetus White Paper- Handling Data Corruption in Elasticsearch
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
 
Enterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus WebcastEnterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus Webcast
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
 
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Webinar maturity of mobile test automation- approaches and future trends
Webinar  maturity of mobile test automation- approaches and future trendsWebinar  maturity of mobile test automation- approaches and future trends
Webinar maturity of mobile test automation- approaches and future trends
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Webcast Q&A- Big Data Architectures Beyond Hadoop

  • 1. Webinar: Big Data Architectures – Beyond the Elephant Ride June 29, 2012 Question and Answer Session Q1. What are the differences between Storm and ESBs like Mule? Storm and ESB (like mule) are very distinct and cannot be compared.. The motivation behind ESBs is to standardize and structure the loosely coupled software components so that they can be independently deployed and run in a disparate environment. The communication is through message passing and using an ESB heterogeneous components are able to interact with each other. Storm is for processing large data in real time. When we use storm, we do not attempt establishing any form of a common structure for different components to collaborate. Rather, Storm enables huge amount of data to be processed through a chain of processing units. So when you’ve large amounts of data than you want to process in real-time, we advise you to use Storm. On the other hand, when you have numerous components and you want to write a layer that will enable their interaction, use an ESB. Infact, Storm and ESB can be theoretically integrated together so that Storm can handle the streaming analytics part while ESB can cater to service orchestration and integrations. Q2. What is the advantage of Giraph and Pregel over more common Graph DBs like Neo or Infinite graphs? Giraph is an opensource implementation of Pregel meant for large datasets. It provides a large-scale graph processing infrastructure over Hadoop. Some of the advantages I’d like to highlight include: 1. Distributed and especially developed for large scale graph processing 2. Bulk Synchronous Parallel (BSP) as execution model 3. Fault tolerance by check pointing 4. Giraph runs on standard Hadoop infrastructure © 2012 Impetus Technologies Page 1
  • 2. 5. Computation is executed in memory 6. It can be a part of pipeline in form of a job 7. Vertex centric API Request you to go through answer to Question No. 9 as well. Q3. What do you recommend for Reporting on top of NoSQL databases? Technologies coming under NoSQL are relative new and still evolving. Furthermore, there are a lot of these technologies and it is unlikely that one single tool would work on all of them. It will be great if you could share us the exact NoSQL technology which you are either using or planning to use and we'll then be able to suggest you the right tool. There are a very few reporting tools like Intellicus and Jasper that work on HBase but I guess they're still keeping an eye on the market to see the direction it's going to take. I strongly believe that you should see some exciting features in these tools in the next 6 - 12 months’ time frame. Q4. What are the difference between Cassandra and RIAK and why would you choose one over the other? Cassandra and RIAK are popular NoSQL solutions and are best suited to solve different kind of use cases in specific ways. So the answer to choose one over the other would totally depend on the business use case that we are trying to solve. Strengths of Riak over Cassandra - Adding nodes to the Riak cluster is very easy - Datamodel doesn't need to be pre-setup - You can access it using REST or using Protocol Buffer API - Commercial support is available from Basho © 2012 Impetus Technologies Page 2
  • 3. Strengths of Cassandra over Riak - Cassandra is still more popular because of the bigger community using it - You can access it using Cassandra CQL; a SQLish language - Scales to PBs and support columnar structure - Enterprise features like rack-awareness are free which is helpful in large deployments - Commercial product support is available from Datastax. - Implementation support is available from 3rd party commercial service providers like Impetus. (http://wiki.apache.org/cassandra/ThirdPartySupport) Q5. We planned for a SAN deployment as our storage solution. I have read that MPP database solutions are optimal on a shared-nothing architecture as DAS rather than on SAN. Can you please comment on MPP database on SAN vs DAS? Typically speaking SAN can offer higher throughput over DAS but can also have a higher latency for lighter loads vis-à-vis DAS. Also, SAN's available throughput will be shared across all connected nodes. In a MPP Data warehousing scenario, multiple nodes will connect to SAN, thereby, sharing a common bandwidth. Another point to note is that most queries served by MPP systems will involve high amount of scattered reads across multiple nodes, thus pushing the bandwidth utilization on SAN to its limits. However, if we have high amount of cache with high speed HBAs and high speed disks in SAN (15K RPM), then the SAN should be able to server a 10-15 nodes MPP cluster. On the other hand, DAS storage can also provide very good throughput and does not have to share the bandwidth across multiple nodes. The bandwidth offered can be further improved by using multiple SATA adapters and high speed disks (10K - 15K RPM). DAS probably will offer better performance on a cluster with very high number of nodes. To summarize, there is no clear winner and using SAN vs DAS will depend on various factors like load, underlying technology in the storage system, cache, number of nodes etc. Both, high end fibre based storage technology and new SATA based storage technology (e.g. SATA-3), can offer similar bandwidth. We suggest that a © 2012 Impetus Technologies Page 3
  • 4. careful study and capacity planning should be conducted on the underlying storage system before deciding on the storage solution. Q6. What architecture components would satisfy the desire to have an integrated NewSQL environment and be able to marry that data with both adhoc defined user tables and events detected during unstructured data stream processing? NewSQL and NoSQL databases/datastores excel in areas where traditional RDBMS systems have some limitations. In many scenarios, NoSQL/NewSQL databases can offer significant improvement over RDBMS. Some cases are 1. Very high availability on a high traffic data 2. Storing CLOB/text data that store denormalized/unstructured data 3. Journal data 4. Performance and scalability Unstructured data stream processing falls more under the category of CEP (complex event processing) and eventually we will see that NewSQL systems start providing support for pre-ingestion analytics than the current traditional post-ingestion analytics. Currently, you will have to rely on some CEP component providing event detection on streaming data while NewSQL acts as the data sink for this streaming data. NewSQL can also help in rapid event generation by firing analytical queries much faster than traditional RDBMS. Q7. Can you compare Neo4J with your recommended Graph Database Already answered as a part of Q.2 Q8. What is your take on MongoDB? RDBMS is still the most commonly used data-store for applications built today. But, the flexibility offered by Mongo provides advantages with respect to development speed and overall application performance in many use- cases. Like any other document store, instead of storing data into tables with rows and columns, MongoDB encapsulates data into loosely defined documents. © 2012 Impetus Technologies Page 4
  • 5. There are a lot of document-oriented stores, and the underlying implementation varies between various data-stores. Some represent it as an XML document and some use JSON. The general rule is documents are not rigidly defined and you can expect a high degree of flexibility when defining data. MongoDB is one of the most popular document stores. It is an open source, schema- free, written in C++ and support for a wide array of programming languages including a SQL-like query language. It’s relatively a new technology and has a few challenges as well but with attractive pricing and relative ease of use, it definitely is becoming a choice for various small and large companies. Q9. You didn't mention Neo4j in your graph databases you recommend. Any particular reason Neo4j wasn't included? No, there is no particular reason. What’s important here for you is to understand the difference between these technologies and where their fitment is. If you’ve an OLAP and data analytics scenario, Hadoop-based Pregel and Giraph will be a better fit. If you’ve an OLTP setup where you want to store and query on connected data for online transaction processing Neo4J will come into the picture. Request you to go through an excellent reading here: http://jim.webber.name/2011/08/24/66f1fb4b-83c3-4f52-af40-ee6382ad2155.aspx Q10.What is the limiting factor in analyzing all data in a real-time basis? Is it processing power, storage systems, DB systems or something else? There are challenges in each of the points you raised like storing and processing. When you process the data, it usually has to be loaded on to the main memory which is still expensive. The machines have to be powerful enough to get you the results fast. Hence, both processing and storage system are the main bottlenecks. Also, there is a paradigm shift in the way programming is done. So, in order to efficiently process the data, we need to come up with parallel algorithms which are able to work on this data and utilize the processing power of the machines. © 2012 Impetus Technologies Page 5
  • 6. So to summarize three points that I consider limiting factor are: memory, processing and the right set of algorithms. Q11.What do you recommend for an in database but very scalable alternative to SAS for doing advanced math on large datasets Assuming that the reference here is to SAS language, R scripting can be a good alternative to work with large datasets as it has good integration with Hadoop and can scale well using map reduce programming interface over R scripts. Revolution Analytics is a commercial product for R over Hadoop. There are other non-Hadoop options as well such as Greenplum or Aster etc which have support for specialized advanced math libraries. Also, SAS is now providing integration with Hadoop which means that you can reuse some of your SAS programming investments and use Hadoop as the underlying scalable processing engine for some of the analytical execution. Q12 Are there any NewSQL platforms that have mastered the functionality around Workload Management? For instance, without workload management, the high resource, intense transactions can get in the way of traditional reporting needs... in other words, is there a NewSQL environment that can be used for traditional and advanced analysis on the same platform? NewSQL are certainly evolving every day as we speak with many more being built in stealth mode. We are not aware of any advanced workload management functionality being provided with any NewSQL platform for now, but that may change any day now. However, most NewSQL platforms have been designed to work efficiently with either OLTP environment or OLAP environment. Q14 Is MongoDB a better solution for any of the scenarios discussed? MongoDB can be a good option in some use-cases of OLTP systems or the transactional system we discussed. © 2012 Impetus Technologies Page 6
  • 7. Q15 Do you have recommendations for an indexing solution? Depending on the data size you can go for Solr and Elastic Solr as options for indexing. There are commercial solutions as well but Solr with its new scalable version SolrCloud can compete with any other commercial solution. Write to us at bigdata@impetus.com for more information © 2012 Impetus Technologies Page 7