Big Data is going to explore - from 5 exabyte in 2010-11 to 50 Zettabyte in 2020. What will be things that will enable this? What will be data sources that will contribute to this? What problems we need to solve to enable this?
9. software defined network
google fiber
innovations related to:-
Switches
Routers
Packets size
compressions
10. • cheap storage - a forcing function
• storage companies provide free
storage
• in return, they have access to user
data
• raw data is turned into boutique data
• sold at premium to interested
companies and advertisers
11. • Innovations on rack space
• cheap, baremetal hardware
• lowers TCO of servers
• operational tasks become easier
• allows companies to offer cloud
12. buttons free
WYSIWYS(tore)
connectivity – most important
and a “given”
tendency to track family
13.
14.
15. 10 TB of Data/Engine/30 minutes
6 hour flight from NY to LA for Twin Engine
737 = 240 TB of Data/flight
28,537 Airliners in US Skies/day
6.5 Exabytes (6688 Petabytes/day)
16. “
……..within the next five years, sensor
data will hit the crossover point with
unstructured data generated by social
media. From there, the sensor data will
dominate by factors 10-to-20 times that of
social media……
”
- Stephen Brobst, CTO, Teradata
17. Pic – coolarcade.org
• ~225 million seventh-generation game consoles sold worldwide by early 2012
• ~700 million Wii games,
• 425 million PlayStation 3 games
• 600 million Xbox 360 games.
18. GPS data
Innovations in Transportation
Applications
Multiple sources:
•Computers Embedded in
Vehicle
•In-vehicle navigation systems
•Drivers’ cell phones.
•Communication networks
•Third-party data like weather
•Traffic
Pic – www.bmwusa.com
19. • roads with sensors
• determine traffic patterns
• sustainable ways to route
traffic
• generate data for:-
• law enforcement
• transportation
• insurance companies
• medical agencies
* INTRO – INTelligent ROads – a project of European Commission
27. 3 V’s*
• volume
• velocity
• variety
* coined by Doug Laney of Gartner Inc
28. 3 I’s
• immediate – do something now!!
• intimidating – what if you don’t?
• ill-defined – what is it, anyway
- Vance Loiselle, CEO, Sumo Logic
29.
30.
31. • near real time
• new data sources
• mobile
• immediately actionable
• big
• agile
• core of business
32. • data scientists lead the “Data Orchestra”
• developers/product mgrs/DBAs/Ops will
merge
• Data Techs will emerge
• “behavior”, “intent” and “thought” targeting
• hourly trends will be considered “Jurassic”
old
34. • store Exabytes (Petabytes)
• huge compression ratio (80% compression)
• cheap storage (~ 10 cents/GB/month)
• MTTF rate (High failure 8%)
• distributed storage
• storage over software defined networking
• read compressed data
• ETL
35. • servers and storage merge?
• special CPUs to handle compression?
• encryption?
• better cpu
• bus speed
36. • understand data
• analytical skills
• discover new ways of looking at data
• new containers for data warehouses
incldg data warehouses on cloud
• backup and recovery (should not be
an issue)
Notes de l'éditeur
User generated content – this term was coined in sometime in 2005– also called conversational media as opposed to Packaged Goods Media. It also goes by name of Performance Media. This is the kind of media that has been labeled, somewhat hastily and often derisively, as “User Generated Content,” “Social Media,” or “Consumer Content.” UGC has its fair share of legal and copyrightissues but UGC
Google Glasses type devices
http://fiber.google.com/about/
Non-connected will be unheard of
Jet generates 10 TB of Data/Engine/30 mins6 hour flight from NY to LA for Twin Engine 737 – 240 TB of Data/flight28,537 Airliners in US Skies/day 6.5 Exabytes (6688 Petabytes/day)
Brobst says within the next five years, ……sensor data will hit the crossover point with unstructured data generated by social media. From there, the sensor data will dominate by factors 10-to-20 times that of social media. However, using this data will be difficult for the time being, as there are no standards to ensure the data’s readability beyond those possessing the right software or algorithm. There’s also a question of who owns the data.
Meanwhile, approximately 225 million seventh-generation game consoles (referring to recent units on the market like the Sony PlayStation 3) had been sold worldwide by early 2012, along with about 700 million Wii games, 425 million PlayStation 3 games and 600 million Xbox 360 games. In fact, the global games industry, including hardware and software, had reached the $63 billion per year range. While the global recession of 2008-09 was hard on the games industry, new games and enhanced console technology have put life back into the business.Apps, including those for magazines, information services such as health site WebMD, games, newspapers, catalogs and ebook readers, to name but a few of the tens of thousands of uses, didn’t really exist before the introduction of the iconic iPhonesmartphone a few years ago. For 2011, Gartner estimated global app store revenues at $15.1 billion. That was only an early stage in this soaring business sector. For example, the Apple iTunes App Store launched in July 2008 with only about 500 apps available. By early 2012, Apple had more than 500,000 apps for sale in the iTunes App Store. Analysts at Gartner estimated that 4.5 billion apps were downloaded in 2010 and 17.7 billion in 2011. Gartner predicted volume to grow to 185 billion downloads by 2014 that will produce $58 billion in revenue.By mid 2011, figures for apps for Apple products alone indicated there were at least 85,000 app creators worldwide. By one estimate, 37% of all apps are free downloads, while the average price for paid apps is $3.64.Meanwhile, more than 450,000 apps are also available for the Android mobile phone operating system (the world’s leading smartphone platform), as well as thousands more for the Blackberry and other devices. Android is the mobile operating system developed by Google. On all platforms, the most popular apps include games, such as Angry Birds; tools such as Google Maps; and entertainment and media related apps, such as those for Pandora Internet-based radio and for leading newspapers. At the same time, apps provide tools for business people, travelers, students, hobbyists, wine drinkers, people who like to cook, job seekers, children, sports fans, shoppers, car enthusiasts and myriad other special interest niches.http://www.plunkettresearch.com/games-apps-social-market-research/industry-and-business-data
Description of workThree technical strands of research will be conducted:Surface safety monitoring: integration and testing of real-time warning systems at network level to achieve a significant decrease in the number of accidents due to ‘surprise effects’ from sudden local changes in weather resulting in low friction and hence skiddingincreasing drivers’ attention to low road friction by only a few percent may result in significantly higher reduction of accident rates due to its non-linear relationshipEurope’s most advanced driving simulator will be used to optimise driver responses to new types of information.Traffic and safety monitoring: combination of different sensor data will enable the estimations of entirely new real-time safety parameters and performance indicators to be used in traffic monitoring and early warning systems.Intelligent pavement and intelligent vehicles: innovative use and a combination of new and existing sensor technologies in pavements, bridges and vehicles in order to prevent accidents, enhance traffic flows and significantly extend the lifetimes of existing infrastructurea prolonged lifetime of high capacity roads could thus be obtained using novel methods for early warning detection of deterioration and damage to road surfaces.ResultsDeliverables:Consolidated state of the art focused on the scope of INTRO and focused needs across EuropeReport on scenarios, structure and potential short-term trendsReport on implementation strategiesModel for estimating expectable stopping distancesReport on the simulator study, including evaluation of impact on safety and driversData model for road safety-related dataReport on technical implementation and users’ feedbackDemonstration of methods for the measurement of condition using probe vehiclesReport on the assessment of methods to identify pavement conditions using current and novel in situ sensorsReport on the use of combined probe vehicle and in situ measurements. Proposals for best practice implementationTraffic indicator needs: single source and data fusion estimation modelsIntegration of weather effects for traffic indicators forecastingSafety indicators needs: simulation-based and field-based modelsCreation of a websiteReport on the launch workshop held in June 2005Report: A Vision of Intelligent Roads Final summary reportProject quality assurance planProject mid-term reportProject final reportExploitable product(s) or measure(s:guidelines and recommendations for ITS deployment use in future standardsimplemented data model combining static and dynamic skid warningsnew use of in situ sensors and probe carsnew methods for data fusion and travel time estimationsSectors:road authoritiesITS service providerstraffic management
Car windscreens, train and bus windows, Google glasses, http://www.ted.com/talks/pattie_maes_demos_the_sixth_sense.html, PranavMistry and Pattie Maes TED talk demo
Volume: Data Volume is the primary attribute of “Big Data.” Volume is often quantified in terms of terabytes of data. Anything between 3 – 10 terabytes of data falls within the realm of “Big Data”. In addition, data volume can also be quantified by counting records, transactions, tables, and files. A large number of records, transactions, tables, or files can be categorized as “Big Data.” Volume of data is one of the defining characteristics of “Big Data;” however, data velocity and data variety (highlighted below) constitute the other key characteristics/ingredients of “Big Data.”Velocity: Speed or Velocity of data is another defining characteristic of “Big Data.” Data Velocity encompasses the frequency of data generation and the frequency of data delivery. In today’s hyper-connected and networked society, there is a continuous stream of information coming from a range of devices ranging from sensors and robotics manufacturing machines, to video cameras and mobile gadgets. This ever-increasing amount of data relentlessly flying from devices in real-time is causing data volumes to grow and do so in a hurry. Variety: One thing that makes “Big Data” really big is that it’s coming from a greater variety of sources than ever before. Data from Web sources (i.e., Web logs, clickstreams) and social media is remarkably diverse. RFID data from supply chain applications, text data from call center applications, semi-structured data from various business-to-business processes, and geospatial data in logistics make up an eclectic mix of data types that makes variety/diversity an important attribute characterizing “Big Data.”
The 3 I's Of Big Data+ Comment nowBig Data is:Immediate – in the sense that you need to do something about it nowIntimidating – what if you don’t?Ill-defined – what is it, anyway?This is what Vance Loiselle, CEO of analytics company Sumo Logic recently told me. With a nod to the well-known 3 V’s of Big Data (volume, velocity, and variability), I have coined these the 3 I’s of Big Data.The definition of Big Data may still be up for debate. But with overall corporate data nearly doubling year over year, the number of Facebook users exceeding 900M, and Twitter tweets blowing through 400M per day, two things about Big Data are certain. As Loiselle put it, “Big Data is not going away and it’s only going to get bigger.”So let’s explore the 3 I’s of Big Data. As always, I welcome your comments here and at dave@vcdave.com.1. Ill-defined: What is Big Data?Gartner analyst Doug Laney has characterized Big Data as “data that’s an order of magnitude greater than data you’re accustomed to.”Ed Dumbill, program chair for the O’Reilly Strata Conference, describes Big Data as, “data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”Another way to view Big Data is that it’s a transformative set of technological advances that have made analyzing data vastly more efficient.Consumer facing companies like Google and Facebook have driven many of the recent advances in Big Data efficiency. Facebook has some 900 million users and is still growing, while some estimates put the number of search queries Google handles at 3 billion per day. Twitter handles some 400 million tweets per day.In an ironic twist, highlighted by cloud cost management vendorCloudyn, increased efficiency doesn’t drive down usage. It increases it.Known as Jevons Paradox, it’s named for the economist who made the observation about the Industrial Revolution. Similarly, as technological advances make storing and analyzing data more efficient, companies are doing a lot more analysis — not less. This, in a nutshell, is Big Data.2. Intimidating: How do you make Big Data approachable?There are lots of challenges in leveraging Big Data, from managing the data to having the right tools to get you the insights that matter.Fortunately, Big Data Apps are springing up all over the place to make Big Data a lot easier to take advantage of.Companies like Splunk and Sumo Logic are Big Data Apps for machine data. Marketing relevance company BloomReach is another such example. The company processes more than 100 million web pages, generating 94% average annual incremental traffic as a result.3. Immediate: What’s actionable about big data?Technological improvements that increased the efficiency of coal use led to increased consumption of coal in a wide range of industries, fueling the Industrial Revolution. In much the same way, technological advances that are increasing the efficiency of analyzing and storing data are driving a Big Data Revolution:A lot more data is being generated. While humans generate a seemingly large amount of data in the form of photos and emails, that data production is limited by the number of people. That amount of data is dwarfed by “sensor” data generated by machines–data from computers and network devices, from airplanes, from cell phones, and from connected GPS devices, for example. And high bandwidth wireless networks are now in place to transport that data back to data centers for storage and analysis.Technologies created by companies serving an unprecedented number of consumers have driven efficiencies in how data is stored and analyzed. You now have the ability to store and analyze vastly more data than you could in the past.You can setup your own computer resources to store and analyze data, but the availability of scaleable cloud computing resources like Amazon Web Services means you can access the resources necessary to do large scale data analysis quickly and easily.The next step in making big data actionable is to make Big Data truly immediate by reducing the time between when data is collected and when you get insights from that data. As J. Andrew Rogers, founder and CTO of Space Curve put it, “the analytic value of data decays rapidly.” That means being able to analyze your data as fast as possible is critical to gaining competitive advantage.Educate. This phase focuses on knowledge gathering and market observations.Explore. After completing the education phase, companies will develop a strategy and roadmap based on business needs and challenges.Engage. During the third phase, a business will pilot big data initiatives to validate value and requirements.Execute. Companies in the fourth phase have deployed two or more big data initiatives and are continuing to apply advanced analytics.
Store Exabytes (Petaytes)Huge compression ratio (80% compression)Cheap storage (~ 10 cents/month)MTTF rate (High failure 0.88%)Distributed Storage Storage over Software defined networkingRrecent independent studies from Google and Carnegie Mellon University have concluded that disk drive failure rates are considerably higher than the rates reported by disk drive manufacturers. But, it turns out, many users may not care.At a Usenix conference in San Jose, CA, this past February, Google released its study, which found an 8% annual failure rate for drives in service for two years. That's one out of every 12 drives.Manufacturers claim the mean time to failure (MTTF) of Fibre Channel (FC) and SATA drives ranges between 1,000,000 and 1,500,000 hours, suggesting a normal annual failure rate of 0.88%."Typically, this problem does not hit home for me because vendor support contracts offset the cost associated with the drive replacements," says Earl Hartsell, senior IT analyst at Solvay Pharmaceuticals, Marietta, GA. "It would take a relatively large increase in support costs for this problem to become a pain point."Similarly, Mark Holt, information technology specialist at Media General in Richmond, VA, says failure rates help manufacturers control support costs, but don't mean much to users. "We have very little interest in that magic number," says Holt. "The complexity of systems means a failure generally isn't worth chasing down; we only want to know if the vendor or supplier is going to be there quickly when we do lose a drive, for whatever reason."Carnegie Mellon's study of approximately 100,000 consumer and enterprise drives