Fred Oh's presentation for SNW Spring, Monday 4/2/12, 1:00–1:45PM
Unstructured data growth is in an explosive state, and has no signs of slowing down. Costs continue to rise along with new regulations mandating longer data retention. Moreover, disparate silos, multivendor storage assets and less than optimal use of existing assets have all contributed to ‘accidental architectures.’ And while they can be key drivers for organizations to explore incremental, innovative solutions to their data challenges, they may provide only short-term gain. Join us for this session as we outline the business benefits of a truly unified, integrated platform to manage all block, file and object data that allows enterprises can make the most out of their storage resources. We explore the benefits of an integrated approach to multiprotocol file sharing, intelligent file tiering, federated search and active archiving; how to simplify and reduce the need for backup without the risk of losing availability; and the economic benefits of an integrated architecture approach that leads to lowering TCSO by 35% or more.
2. BIG DATA IS NOT JUST ABOUT SIZE
SEMI-STRUCTURED
DATA Four Vs
UNSTRUCTURED Volume
DATA Satellite
Images
STRUCTURED
DATA
Email
Sensors Velocity
Bio-
Informatics
OLTP Documents
Variability
M2M and
Web Logs
Social
Value
Video
Audio
Data-intensive Processing Increases
2
3. BIG OPPORTUNITY – ACROSS INDUSTRIES
BIG DATA IMPACT BIG DATA EXAMPLES
Science and data science
Telco
• Decoding a genome with 3 billion data pairs can
$100B Opportunity
now be done in < 1 hour
Healthcare
$300B Opportunity)
Media and entertainment
• Video surveillance at airports with facial
Retail recognition analysis and real-time reporting to
+60% Margin (US) security
Manufacturing
+50% Production $ Oil and gas
• Projects usually require coordinating hundreds of
firms with up to 10PB of data to analyze oil
Public locations
Administration
€100B Opportunity (EU)
3
4. BIG DATA MARKET MATURITY
IS JUST BEGINNING
85% of Fortune 500 companies are unable to exploit big
data for competitive advantage
90% of business leaders say information is a strategic
asset but <10% can quantify its economic value
Preparing now with data
quality, event-driven
architectures, and laying
foundational infrastructure for Big
Data later.
4
5. ONE PLATFORM FOR THE JOURNEY
OF BIG DATA AND BIG CONTENT
SEARCH ACROSS
NEW HDDS
SEARCH,
ANALYZE
BRING
INGEST, ANALYTICS
STORE TO THE
DATA
PARTNERS
REPURPOSE,
RECOMBINE
5
6. HITACHI NAS PLATFORM, POWERED BY
BLUEARC® (HNAS) – BIG DATA IN OIL AND GAS
Data Data Seismic Visual
Acquisition Management Processing Interpretation Data workflows and
management are
increasingly complex
Modeling Petrophysical Property Simulation
Automation Analysis Modeling
HNAS provides high-performance scale
‒ Tremendous need for high-performance storage
‒ High data volume with storage requirements from 200TB to tens of PB
‒ High-frequency data streams – e.g., 10MB/sec times the number of boats
6
7. BIG DATA, BIG CONTENT
Provide bottomless
storage
80 nodes and 32 billion
objects
1,000 tenants per
system
70K namespaces in
many-to-one systems
Replicate Reduce tape backup
Distribute content
Write once, read
everywhere
7
8. SEARCH FOR BIG DATA (COMING IN 2012)
NEW HITACHI DATA DISCOVERY SUITE
Super-scale search SEARCH
ACROSS
Big data search index architecture with
Solr + Hadoop + Hitachi
PARALLEL PARALLEL PARALLEL
INDEXING INDEXING INDEXING
REGION 1 REGION 2 REGION 3
Geospatial and wide area search for FCS portfolio
8
9. BIG DATA ANALYTICS – REAL TIME
HITACHI CONVERGED PLATFORM FOR SAP HANA™
In-memory computing for real-time analytics HIGH
PERFORMANCE
‒ Calculate first, then move results APPS
Processing massive quantities Delegate
of real-time data to provide immediate results Data-
intensive
Operations
Converged Platform provides
‒ On-demand, nondisruptive scalability
‒ Highest-performing appliance for SAP
DATA LAYER
HANA
MASSIVE SCALE-OUT COMING IN 2012!
9
10. WINNING STRATEGY – TCO VS. RACK-BY-RACK
COMPETITOR DEPLOYMENT IS RACK-BY-RACK
AT LOWEST POSSIBLE PRICING
10 NAS nodes
with 720TB per rack!
10
11. WINNING STRATEGY – TCO VS. RACK-BY-RACK
HITACHI DATA SYSTEMS DEPLOYMENT IS TCO-BASED
2 HNAS
3090 nodes
+ 672TB for 1st rack
11
12. GOING FORWARD – $$$$$$$$$$$$$$$$$$$$$
Information Managed Genomics
Lifecycle Storage Information Cloud
Management + Solution
FLUID
National Genomics CONTENT
Database (based
on HCP and
HDDS), 4PB per
year DYNAMIC SOPHISTICATED
INFRASTRUCTURE INSIGHT
12
Answer why it’s called big data – Explain misnomer Emphasize the info extraction and why analytics is so important. Possible analogy from Data WH -- NAS – Distributed DatasetOLD NOTESThe Analysts are all hard at work defining Big Data in their own unique ways but they all pretty much agree on 3 key characteristics. Along with big volumes of data, we have velocity which refers to the speed at which the data is streaming in as well as the time sensitivity of delivering the analysis/reacting and variability which refers to the data format – typically separated into structured (fits relational database model), unstructured and semi-structured (has structure but doesn’t fit relational model). Most would argue that it is a combination of these factors that defines Big Data or that Big Data Analytics refers to problems that we can’t solve with traditional DW/Analytics technologiesThe chart illustrates the evolution of data available for analytics as 3 waves – OLTP or traditional DW, Human generated unstructured data – the wave driven by social media, and machine generated data which will really take hold with the Internet of thingsThough traditional DW has been around for about 30 years if really took off in the 1990s. Companies needed a way to gain cross business insight from all the disparate database applications they had rolled out e.g. ERP, supply chain management, order entry… They did that by loading data from the operational systems into relational data warehouses. In the early days the cost of DW was very high - $Millions for mere TBs so earliest adoption was by the big transaction heavy businesses with deep pockets like banks and retailers. The combination of lower technology costs and increased storage and compute capacity spawned usage by companies of all sizes. Data volumes were driven higher by Internet applications, eCommerce and the focus on CRM in the 2000s. Today the largest DWs are in the low PBs but average size still closer to the 10s to 100s of TBs for most businesses - sizeable but not when compared to the next waves The data is all captured from and stored in relational databases so it is highly structured and though there are real time applications predominently data is loaded as nightly and weekly batch jobsThe 2nd wave human generated unstructured data started around 3 years ago but ramped this past year. Social media content including blogs and twitter feeds is a big component here along with web logs that track the trail on human activity on the Internet. Many of these web log files used to be thrown away but with the reduced cost of storage and compute power companies are now starting to glean valuable insight – we’ll look at examples in a few slides. Clearly the volumes are huge here (remember Google generates 20PB daily) , the data streams in at a fast rate and the data does not have the nice predictable structure that we had with the OTLP data. The final wave is machine data this will be the biggest wave of all and some estimate that though we are just dipping a toe into analyzing this kind of data it will overtake social media data in terms of volume within 5 years and quickly surpass it 10 to 20 fold. As we saw from the Boeing example the data streams will be constant and ability to analyze and not just gather insight but to react in real time will be critical for many applications.
*Source: McKinsey Global Institute, 2011 – global projections – Healthcare, Telco, Retail, Manufaturing, Public Admin (Above source) By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”Notes Science and data science190,000 – Shortage of data scientists in U.S. by 2018Media400B Videos viewed online in 2010 (U.S.Oil and Gas2011 $5 Billion in IT spend and $1 Billion on storage-----Oil & Gas – From BjornVideos Watched - http://bigdata-io.org/digital-entertainment-52b-in-2014NOTES – WIPShould talk about / note that search is important tot all these. Video Surveillance at Airports in support of National Defense Video Cameras at all airports (Hitachi Kokusai Ltd)Facial Recognition SW to identify ‘people of interest’ (Hitachi Ltd)Real time reporting to security forces before they leave the airport9420 Tweets per secondAnalyze content for ‘favorable’ characteristicsSend ‘buy now’ app to smart phone15% off couponFree shippingHave it before the next game
Gartner>Most organizations will be unable to exploit new analytic capabilities due to poor data qualityand latency.■ Data quality assurance is becoming a high priority, but traditional approaches fail due toincreased information volume, velocity, variety and complexity.■ The desire to increase reliability, consistency, control and agility in information infrastructure isdriving organizations to rationalize overlapping tools and technologies, replace custom code,remove data silos and add richer metadata and modeling.■ Few organizations evaluate the economic potential of information assets with the discipline theydemonstrate in managing, deploying and accounting for traditional physical and financialassets.■ Event data, proliferating rapidly, can be used to improve situation awareness and enable senseand-respond "smart" systems with rigorous information governance.Recommendations■ Adapt data quality measurement methods to samples, as it will not be possible to measure all.Map expectations to specific uses and expose "confidence factors" to provide businesscontext.■ Select straightforward approaches to estimate the relative value of information sources usingquality, completeness, consistency, integrity, scarcity, timeliness and business problemrelevance, for example.■ Determine a framework and methods (cost, income or market-based) with your CFO to quantifyinformation asset financial value. Consider a supplemental balance sheet to communicate it.■ Use Gartner's Information Capabilities Framework to identify technology in place thataddresses common capabilities and gaps where tools are lacking. Plan to fill critical gaps andrationalize tools.■ Make event-driven architecture and complex event processing first-class citizens in datamodeling work and metadata repositories.Strategic
Customer questions “Do you have a scalable platform for big data?” How do I find across How do I perform – thru partnership with ; Industry vertical, application providers, HANA; Hitachi Consulting This this where EMC will position Islion
Historically, IT has focused on delivering infrastructure for each application. Our infrastructure cloud approach unifies your server, storage and network silos to improve utilization, simplify management and lower costs. Separating applications from underlying storage allows data to be moved freely according to usage, cost and application requirements with minimal impact to applications.As unstructured data overtakes structured data, our content cloud approach creates a warehouse to store billions of data objects. Intelligence makes it all indexable, searchable, and discoverable across applications and devices, anytime and anywhere. This allows you to cut costs associated with managing, storing and accessing data and automate the information lifecycle. Infrastructure and content form the foundation for the information cloud, which will help you repurpose and extract more value from your data and content. It integrates data across application silos and serves it up to analytics applications that connect data sets, reveal patterns across them, and surface actionable insights to business users. Underneath it all, our single virtualization platformensures your organization gets seamless access to all resources, data, content and information.
Super scale search with newHitachi Data Discovery Suite (HDDS)Exponentially more scalable and fasterBillions of objects across geographiesHadoop architecture for scale out indexingLeverages distributed platforms for big dataKey big data use case support of Geospatial (latitude/long.) search
Today’s applications execute many data-intense operations in the application layer but High-performance apps delegate data-intense operations to in-memory computingHDS Unique:On-demand, non-disruptive scalabilityScale seamlessly from HANA “S” to “M” to “L” configurations with Hitachi blades and storageHighest-performing appliance for SAP HANAHitachi solution uses 4-way x86 blade servers with Intel 10-core CPUsBest investment protection and lower OPEXSupport production and test/dev/QA within a single blade chassis