TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Big Data in Action : Operations, Analytics and more
1. What is Better Alert?Big Data in Action: Operations, Analytics and more
2. Agenda
• Meet & Greet Introduction.
• Unfolding the term “Big Data”.
– Evolution of Data to Big Data : Static to Stream.
– 3 V’s of Big Data.
• Overview of Implementing Big Data
– Examples of implementation of Big Data
– Implementing Big data with Hadoop infrastructure
– Implementing Big data with NoSql like Cassandra & MongoDB.
• Advantages of implementing Big Data solutions.
• Open Forum Discussion/ Networking.
3. Vibhu Bhutani
Technical Project Manager
Started as a Java developer, and I have many years of experience in developing and managing
state of the art applications. With extensive experience in the phases of the SDLC model, I
leads the team of innovations & mobile excellence in softweb soloutions. Am involved in
various innovative implementations which include the implementation of Big Data systems,
IOT implementations and iBeacon developments at Softweb Solutions.
in/vibhuis
Welcome
4. Unfolding the Term Big Data
• IBM reported in a study that every day we create roughly 2.5 quintillion data from various
data sources like Climate Sensors, GPS Signals, Social Media, Online transactions. Out of
which 90% was created in the last couple of years. Big Data is a buzz word of a technology
that shows a potential to process, huge amount of data so that we get some valuable
information out of it.
• How old is Big Data?
– Its as old as data however the parameters changes every year. In 2012 it was about couple of
Petabytes and now its about few Exabyte's.
• Why do we now here about Big Data?
– Although big data is old, but now a days more industries are knowing about the implications of
big data. In 2004 Google introduced a paper explaining Map Reduce technique to analyze large
datasets. After that many other companies joined together and the buzz word Big Data came
into existence.
• Static data VS Dynamic Data
5. Evolution of Data
In 76 KB of
Hardwired
Memory, Nasa
successfully took
Men to moon and
brought them back.
With an 8 Gigs
iPhone it can be
done 108 times.
Strange Fact
10. Application of Big Data - Cern
• In 1960 Cern used to store data in a main frame
computer.
• In 1970 cern used to distribute data in several
machines dividing mainframe computer into a
smaller piece of equipments and cern net was
introduced to bridge these machine and travel
was reduced.
• In 1980 these machines were placed in different
countries of US and Europe and internet was
introduced to connect these machines.
• Due to enormous increase of data in 2000 a cern
grid was introduced connecting different smaller
computers together to analyze and process the
data.
• Detector with 150 million sensors are used in LHC
where protons collides at a light speed works as a
3D camera where pictures are by a rate of 40
million times per second. The data is now stored
in cloud and analyzed using big data techniques.
11. Implementation of Big Data - Cern
Proton injection for collision Collision of particles recording data
in sensors
12. Other Industries using Big Data
• Government Application:
– US government invested a lot in the big data applications. Big data analysis played a large role
in Barack Obama's successful 2012 re-election campaign.
– The Utah Data Center is a data center currently being constructed by the United States National
Security Agency. The exact amount of storage space is unknown, but more recent sources claim it
will be on the order of a few exabytes.
– Big data analysis was, in parts, responsible for the BJP and its allies to win a highly successful Indian
General Election 2014.
– UK government is utilizing big data to improv weather forecasting & new drug release forecasts.
• Manufacturing Industries:
• Vast amount of sensory data such as acoustics, vibration, pressure, current, voltage and controller
data in addition to historical data construct the big data in manufacturing. The generated big data
acts as the input into predictive tools and preventive strategies.
• Technology Industries:
• Ebay and Amazon are industry leaders for maintaining large amount of user searches and predictive
analysis. This helps to identify user needs and provide them with better results.
• Retail Industries:
• Walmart contains about 2.5 peta bytes of data handling 1 million customer transaction every hour.
• Amazon does a transaction of USD 80,000 in an hour. Amazon has worlds three largest databases.
13. Big Data Solutions - Hadoop
• Hadoop is an open-source system to reliably
store and process lot of information.
• Solution of Big Data that handles complexity
involved in volume, variety and velocity of data.
• It transform the commodity hardware to services
to handle peta bytes of data into distributed
environments: Pigeon Computing.
• Hadoop is redundant , reliable, powerful, batch
process centric, distributed.
16. Hadoop Implementation in Real World
• Yahoo:
– In 2008 Yahoo claimed world’s largest hadoop prodcution application. Yahoo Search Webmap is a hadoop
application that runs on Linux with more that 10,000 cores.
• Facebook:
– In 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage. On
June 13, 2012 they announced the data had grown to 100 PB] On November 8, 2012 they announced the
data gathered in the warehouse grows by roughly half a PB per day
• As of 2013, Hadoop adoption is widespread. For example, more
than half of the Fortune 50 use Hadoop.
• The New York Times used 100 Amazon EC2 instances and a
Hadoop application to process 4 TB of raw image TIFF data (stored
in S3) into 11 million finished PDFs in the space of 24 hours at a
computation cost of about $240 (not including bandwidth)
18. Introduction to No SQL
• A NoSQL database provides a mechanism
for storage and retrieval of data that is modeled in means other
than the tabular relations used in relational databases
• Types of NoSQL Databases:
– Column: Cassandra, HBase
– Document: Apache CouchDB, MongoDB
– Key-value: CouchDB, Dynamo, Redis
– Graph: Neo4J
– Multi-model: OrientDB, Alchemy Database, CortexDB
19. High Level Architecture - Cassandra
• Ring based replication
• Only 1 type of server (cassandra)
• All nodes hold data and can answer queries
• No Single Point of Failure
• Build for HA & Scalability
• Multi-DC
• Data is found by key (CQL)
• Runs on JVM
21. High Level Architecture - Cassandra
Example: Single Row Partition
• Simple User system
• Identified by name (pk)
• 1 Row per partition
22. High Level Architecture - Cassandra
Example: Multiple Rows
• Comments on photos
• Comments are always selected by
the photo_id
• There are only 4 rows in 2 partitions
23. High Level Architecture - Cassandra
• Multiple rows are transposed into a single partition
• Partitions vary in size
• Old terminology - "wide row"
• Cassandra is built for fast write. The data model should be deformalize to do few Reads as
possible
24. High Level Architecture – Mongo DB
• Open-source, Document-oriented, popular for its
agile and scalable approach
• Notable Features :
– JSON/BSON data model with dynamic schema
– Auto-sharding for horizontal scalability
– Built-in replication with automated fail-overs
– Full, flexible index support including secondary
indexes
– Rich document-based queries
– Aggregation framework and Map / Reduce
– GridFS for large file storage
25. High Level Architecture – Mongo DB
• Ensures High Availability, Redundancy, Automated
Fail-over
• Writes to the Primary, Reads from all
• Asynchronous replication
• In conventional terms, more like Master/Slave
replication
• Members can be configured to be: Secondary only
/ Non- Voting / Hidden / Arbiters / Delayed
26. When to use : Mongo DB
• Unstructured data from multiple suppliers
• GridFS : Stores large binary objects
• Spring Data Services
• Embedding and linking documents
• Easy replication set up for AWS
4. Example of streaming data: If there is a application searching of some text in the emails that we send. Emails can be considered as a stream of data, algorithms work to get some text identification done on the basis of specific pattern and send’s an alert if something is found. Now a days many government agencies are working on these kind of stuffs.
5. The image shows how the data was evolved. Archaeology findings shows that around 2000 BC Phaistos Disc were getting used to store the information. These were the clay discs which embeds the data and store it for a long period of time. Later people used wrote things in pyramids following by store tabs.
6. Necessity is the mother of invention. Human brain always want to know more and to know more, we need to process more. Information Era gave us the data, and to process this data we created big data.
7. Characteristics of Bid Data consists of 3V. Volume, Variety and Velocity. Volume represents the bulk and size of data. Every decade the definition of big data changes. Previously it was hard to store KB’s of data but now we are storing huge amounts of data on a smartphone. The image shows the amount of data that is getting stored in different parts of the world.
Next comes variety, it’s the categorization of big data. By categorizing data we make it easy for data analyst to group some inter dependent data and get some advantage out of it.
8. Velocity represents the speed of generating this data. The image shows by what speed we generate this data.
Its really to think what happens with this enormous amount of data that we are generating and this leads to the 4th V.
9. Value. What is the value of analyzing the data. The image shows how the various industries are utilizing & analyzing this data. Apart from the monetary benefits, many other fields like machine learning, scientific experiments, medicine etc. are benefited by Big Data.
10. In 1962 Arthur Samuel wrote a computer program to play checkers. The program got defeated initially but later Samuel wrote a sub program to analyze the board and compute the plays for wining. When the sub program got linked with the checkers program the computer started to win. This was a first incident of artificial intelligence where the data generated by the computer was recorded and used to plan the moves.
11.
12. Some car manufactures are gathering data from the sensors on the drivers seat, they are identifying the pattern when a driver feels sleepy and informs the driver by vibrating the steering. The same technology is getting used to identify the theft based on sitting patterns.
13.
14. Map reduce is the processing part, it runs the computation and return the results.
Second part is HDFS. It stores all the data, with files and directories and is highly scalable and distributed.
15. This is a classic map reduce program for word count.
16.
17. In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
18. A column of a distributed data store is a NoSQL object of the lowest level in a keyspace. It is a tuple (a key-value pair) consisting of three elements, Unique name, value & timestramp.
Document: A trivial example would be scanning paper documents, extracting the title, author, and date from them either by OCR or having a human locate and enter them, and storing each document in a 4-column relational database, the columns being author, title, date, and a blob full of page images
Key Value: an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of pairs, such that each possible key appears just once in the collection.
Graph: graph database is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data
19.
20.
21.
22.
23.
24.
25.
26.
27. Not to say there are some disadvantages too:
Issues with finding the right talent.
Issue with finding the proper use case.
Impact on white colar jobs due to high needs of data scientists.
Analyzing and finding out Good Data from Big Data.