Presentation: “Big Data and MicroStrategy: Building a Bridge for the Elephant”
Intelligent engineering of an agile business requires the ability to connect the vast array of requirements, technologies and data that build up over time, while avoiding the pitfalls commonly encountered on the road to giving users comprehensive, yet nimble business analytics with MicroStrategy.
The Google generation armed with iPads, Droid Phones bring big bold ideas on how “Big Data” will solve the new wave of business problems; traditional users know that addressing them requires more than just embracing the buzzwords like “sentiment”, “R” and “Hadoop.” Overall success requires building a bridge between the stable, proven, mature BI solutions in place today with the disruptive new world. Enabling deeper analytics, predictive modeling, social media analysis in combination with scalable self-service dashboards, reporting and analytics is no longer an idea but a MUST DO.
This informative presentation describes these business challenges and how an organization leveraged the Kognitio Analytical Platform under MicroStrategy to build such a bridge.
30. Hadoop Performance Reality
• Hadoop is batch oriented
• HDFS access is fast but crude
• MapReduce is powerful but has overheads
– ~30 second base response time
– Too much latency in stack and processing model
– Trade-off in optimization and latency
• MapReduce complex
– Typically multiple Java routines
https://www.facebook.com/notes/facebook-engineering/
under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-
corona/10151142560538920
31. SQL to the Rescue
• So MapReduce is complicated
– use Hive (SQL) as the easy way out
Pig Hive
ZooKepper / Ambari
HBase
MapReduce
Oozie
HCatalog
HDFS
32. Hive
• Simplifies access
Hive is great, but Hadoop’s execution engine
“
makes even the smallest queries take minutes!”
• Only basic SQL support
• Concurrency needs careful system admin
• It’s not a silver bullet for interactive BI usage
33. Conclusion
Hadoop just too slow
for interactive BI!
“while hadoop shines as a processing
platform, it is painfully slow as a query tool”
…loss of train-of-thought
34. Hive is based on Hadoop which is a batch processing system. Accordingly,
this system does not and cannot promise low latencies on queries. The
paradigm here is strictly of submitting jobs and being notified when the jobs
are completed as opposed to real time queries. As a result it should not be
compared with systems like Oracle where analysis is done on a
significantly smaller amount of data but the analysis proceeds much more
iteratively with the response times between iterations being less than a few
minutes. For Hive queries response times for even the smallest jobs
can be of the order of 5-10 minutes and for larger jobs this may even
run into hours.
I remain skeptical on the practical performance of the Hive query approach
and have yet to talk to any beta customers. A more practical approach is
loading some of the Hadoop data into the in-memory cube with the new
Hadoop connector.
39. Alternative - In-memory Processing
Analyticsdo the work!
Cores requires CPU,
RAM keeps the data close
Scale with the data
40. Goals: Minimise Disruption, Cut Latency
• Don’t change the existing BI and analytics
• Support more creative and dynamic BI
• Don’t introduce yet more slow disk
– Help the DW investment
• No complex ETL, just pull data as required
• Pull data simply and intelligently from Hadoop
• Simplify – less cubes, caches
• Improve sharing of data
• Increase concurrency and throughput
– Its all about queries per hour!
• Minimal DBA requirement
41.
42. Kognitio Hadoop Connectors
HDFS Connector
• Connector defines access to hdfs file system
• External table accesses row-based data
in hdfs
• Dynamic access or “pin” data into memory
• Selected hdfs file(s) loaded into memory
Filter Agent Connector
• Connector uploads agent to Hadoop nodes
• Query passes selections and relevant
predicates to agent
• Data filtering and projection takes place
locally on each Hadoop node
• Only data of interest is loaded into memory
via parallel load streams
43. BI – Central Governance
Centrally defined data models
Persist data in natural store
Fetch when needed, agile
Available to all tools
Analytical power