Here are the slides from a presentation delivered by Big Data Partnership (@BigDataExperts). This masterclass on Big Data Concepts is an hour-long version of the one-day course run by Big Data Partnership (http://www.bigdatapartnership.com/wp-content/uploads/2013/11/BigData-Concepts.pdf).
"This one-day masterclass is an executive briefing on Big Data designed for senior management and business leaders to learn about Big Data concepts and familiarise themselves with the business and technology trends and opportunities.
Includes extensive guidance in applying the right economic, technological and business criteria to the evaluation of Big
Data adoption in your organisation and how it can help you meet business goals, dispelling the myths around Big Data,
and find out what it is and is not, how to get the biggest benefit for your organisation and guidance on the best of breed approach to initiate a Big Data programme."
If you have any questions or would like to learn more about big data (including the consultancy, training and support we offer), please get in touch contact [at] bigdatapartnership dot com.
Imagine you have a legacy database or DW
=> Data volumes growing rapidly
=> Running out of space
=> We scale up a little, cost goes up a little – nothing too serious
=> Scale up a little more, cost/GB starts to look a little concerning
=> Scale up again, suddenly we hit a discontinuity
=> cost/GB is prohibitive
=> Or performance drops dramatically
=> Or maybe you want to go to TB-PB range and it’s not even possible
=> Problem is, as data increases we have to SCALE UP because of the architecture choices
=> Cost per GB also increases, so it becomes exponentially more expensive to grow
=> Interestingly, this also broadly holds for complexity of integrating more business data sources (Variety) (BI problem),
not just adding more Volume (DW problem)
=> Because again the architecture dictates we have increasingly higher cost to add more Variety
The net result of storing large schema-driven databases is that the individual machines they’re housed on must be:
* high quality
* redundant
* highly available
This translates into:
* high costs for servers, infrastructure and support
Clusters of hardware = distributed system
Netflix challenge – winning team used a very rudimentary algorithm but won because it appended data about movies from outside the original data set (IMDb).
Google – showed PageRank could outperform keyword extraction in other search engines, by leveraging data from outside the page itself (votes, by page creators linking to the page).
Facebook – using detailed data about friendships (social network topology of real world) to beat other media companies.