Webcast Q&A- Big Data Architectures Beyond Hadoop

Webinar: Big Data Architectures – Beyond the Elephant Ride
June 29, 2012
Question and Answer Session

Q1. What are the differences between Storm and ESBs like Mule?

Storm and ESB (like mule) are very distinct and cannot be compared..

The motivation behind ESBs is to standardize and structure the loosely coupled
software components so that they can be independently deployed and run in a
disparate environment. The communication is through message passing and using an
ESB heterogeneous components are able to interact with each other.

Storm is for processing large data in real time. When we use storm, we do not
attempt establishing any form of a common structure for different components to
collaborate. Rather, Storm enables huge amount of data to be processed through a
chain of processing units.

So when you’ve large amounts of data than you want to process in real-time, we
advise you to use Storm. On the other hand, when you have numerous components
and you want to write a layer that will enable their interaction, use an ESB.

Infact, Storm and ESB can be theoretically integrated together so that Storm can
handle the streaming analytics part while ESB can cater to service orchestration and
integrations.

Q2. What is the advantage of Giraph and Pregel over more common Graph DBs like
Neo or Infinite graphs?

Giraph is an opensource implementation of Pregel meant for large datasets. It
provides a large-scale graph processing infrastructure over Hadoop. Some of the
advantages I’d like to highlight include:

1. Distributed and especially developed for large scale graph processing
2. Bulk Synchronous Parallel (BSP) as execution model
3. Fault tolerance by check pointing
4. Giraph runs on standard Hadoop infrastructure

© 2012 Impetus Technologies Page 1

5. Computation is executed in memory
6. It can be a part of pipeline in form of a job
7. Vertex centric API

Request you to go through answer to Question No. 9 as well.

Q3. What do you recommend for Reporting on top of NoSQL databases?

Technologies coming under NoSQL are relative new and still evolving. Furthermore,
there are a lot of these technologies and it is unlikely that one single tool would work
on all of them.

It will be great if you could share us the exact NoSQL technology which you are either
using or planning to use and we'll then be able to suggest you the right tool.

There are a very few reporting tools like Intellicus and Jasper that work on HBase but
I guess they're still keeping an eye on the market to see the direction it's going to
take.

I strongly believe that you should see some exciting features in these tools in the
next 6 - 12 months’ time frame.

Q4. What are the difference between Cassandra and RIAK and why would you
choose one over the other?

Cassandra and RIAK are popular NoSQL solutions and are best suited to solve
different kind of use cases in specific ways. So the answer to choose one over the
other would totally depend on the business use case that we are trying to solve.

Strengths of Riak over Cassandra

- Adding nodes to the Riak cluster is very easy
- Datamodel doesn't need to be pre-setup
- You can access it using REST or using Protocol Buffer API
- Commercial support is available from Basho


Strengths of Cassandra over Riak

- Cassandra is still more popular because of the bigger community using it
- You can access it using Cassandra CQL; a SQLish language
- Scales to PBs and support columnar structure
- Enterprise features like rack-awareness are free which is helpful in large
deployments
- Commercial product support is available from Datastax.
- Implementation support is available from 3rd party commercial service providers
like Impetus. (http://wiki.apache.org/cassandra/ThirdPartySupport)

Q5. We planned for a SAN deployment as our storage solution. I have read that
MPP database solutions are optimal on a shared-nothing architecture as DAS
rather than on SAN. Can you please comment on MPP database on SAN vs DAS?

Typically speaking SAN can offer higher throughput over DAS but can also have a
higher latency for lighter loads vis-à-vis DAS. Also, SAN's available throughput will be
shared across all connected nodes. In a MPP Data warehousing scenario, multiple
nodes will connect to SAN, thereby, sharing a common bandwidth.

Another point to note is that most queries served by MPP systems will involve high
amount of scattered reads across multiple nodes, thus pushing the bandwidth
utilization on SAN to its limits. However, if we have high amount of cache with high
speed HBAs and high speed disks in SAN (15K RPM), then the SAN should be able to
server a 10-15 nodes MPP cluster.

On the other hand, DAS storage can also provide very good throughput and does not
have to share the bandwidth across multiple nodes. The bandwidth offered can be
further improved by using multiple SATA adapters and high speed disks (10K - 15K
RPM). DAS probably will offer better performance on a cluster with very high number
of nodes.

To summarize, there is no clear winner and using SAN vs DAS will depend on various
factors like load, underlying technology in the storage system, cache, number of
nodes etc. Both, high end fibre based storage technology and new SATA based
storage technology (e.g. SATA-3), can offer similar bandwidth. We suggest that a


careful study and capacity planning should be conducted on the underlying storage
system before deciding on the storage solution.

Q6. What architecture components would satisfy the desire to have an integrated
NewSQL environment and be able to marry that data with both adhoc defined user
tables and events detected during unstructured data stream processing?

NewSQL and NoSQL databases/datastores excel in areas where traditional RDBMS
systems have some limitations. In many scenarios, NoSQL/NewSQL databases can
offer significant improvement over RDBMS. Some cases are

1. Very high availability on a high traffic data
2. Storing CLOB/text data that store denormalized/unstructured data
3. Journal data
4. Performance and scalability

Unstructured data stream processing falls more under the category of CEP (complex
event processing) and eventually we will see that NewSQL systems start providing
support for pre-ingestion analytics than the current traditional post-ingestion
analytics. Currently, you will have to rely on some CEP component providing event
detection on streaming data while NewSQL acts as the data sink for this streaming
data. NewSQL can also help in rapid event generation by firing analytical queries
much faster than traditional RDBMS.

Q7. Can you compare Neo4J with your recommended Graph Database

Already answered as a part of Q.2

Q8. What is your take on MongoDB?

RDBMS is still the most commonly used data-store for applications built today. But,
the flexibility offered by Mongo provides advantages with respect to development
speed and overall application performance in many use- cases. Like any other
document store, instead of storing data into tables with rows and columns,
MongoDB encapsulates data into loosely defined documents.


There are a lot of document-oriented stores, and the underlying implementation
varies between various data-stores. Some represent it as an XML document and
some use JSON. The general rule is documents are not rigidly defined and you can
expect a high degree of flexibility when defining data.

MongoDB is one of the most popular document stores. It is an open source, schema-
free, written in C++ and support for a wide array of programming languages including
a SQL-like query language.

It’s relatively a new technology and has a few challenges as well but with attractive
pricing and relative ease of use, it definitely is becoming a choice for various small
and large companies.

Q9. You didn't mention Neo4j in your graph databases you recommend. Any
particular reason Neo4j wasn't included?

No, there is no particular reason. What’s important here for you is to understand the
difference between these technologies and where their fitment is. If you’ve an OLAP
and data analytics scenario, Hadoop-based Pregel and Giraph will be a better fit. If
you’ve an OLTP setup where you want to store and query on connected data for
online transaction processing Neo4J will come into the picture.

Request you to go through an excellent reading here:
http://jim.webber.name/2011/08/24/66f1fb4b-83c3-4f52-af40-ee6382ad2155.aspx

Q10.What is the limiting factor in analyzing all data in a real-time basis? Is it
processing power, storage systems, DB systems or something else?

There are challenges in each of the points you raised like storing and processing.
When you process the data, it usually has to be loaded on to the main memory which
is still expensive. The machines have to be powerful enough to get you the results
fast. Hence, both processing and storage system are the main bottlenecks.

Also, there is a paradigm shift in the way programming is done. So, in order to
efficiently process the data, we need to come up with parallel algorithms which are
able to work on this data and utilize the processing power of the machines.


So to summarize three points that I consider limiting factor are: memory, processing
and the right set of algorithms.

Q11.What do you recommend for an in database but very scalable alternative to
SAS for doing advanced math on large datasets

Assuming that the reference here is to SAS language, R scripting can be a good
alternative to work with large datasets as it has good integration with Hadoop and
can scale well using map reduce programming interface over R scripts. Revolution
Analytics is a commercial product for R over Hadoop.

There are other non-Hadoop options as well such as Greenplum or Aster etc which
have support for specialized advanced math libraries.

Also, SAS is now providing integration with Hadoop which means that you can reuse
some of your SAS programming investments and use Hadoop as the underlying
scalable processing engine for some of the analytical execution.

Q12 Are there any NewSQL platforms that have mastered the functionality around
Workload Management? For instance, without workload management, the high
resource, intense transactions can get in the way of traditional reporting needs...
in other words, is there a NewSQL environment that can be used for traditional and
advanced analysis on the same platform?

NewSQL are certainly evolving every day as we speak with many more being built in
stealth mode. We are not aware of any advanced workload management
functionality being provided with any NewSQL platform for now, but that may
change any day now.

However, most NewSQL platforms have been designed to work efficiently with either
OLTP environment or OLAP environment.

Q14 Is MongoDB a better solution for any of the scenarios discussed?

MongoDB can be a good option in some use-cases of OLTP systems or the
transactional system we discussed.


Q15 Do you have recommendations for an indexing solution?

Depending on the data size you can go for Solr and Elastic Solr as options for
indexing. There are commercial solutions as well but Solr with its new scalable
version SolrCloud can compete with any other commercial solution.

Write to us at bigdata@impetus.com for more information


Webcast Q&A- Big Data Architectures Beyond Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à Webcast Q&A- Big Data Architectures Beyond Hadoop

Similaire à Webcast Q&A- Big Data Architectures Beyond Hadoop (20)

Plus de Impetus Technologies

Plus de Impetus Technologies (20)

Dernier

Dernier (20)

Webcast Q&A- Big Data Architectures Beyond Hadoop