Big Data Myths, Current Mainstream Technologies related to Collecting, Storing, Computing & Stream Processing Data. Real-life experience with E-commerce businesses.
2. Agenda
● Big Data Myths
● Big Data Technologies
● Big Data Applications
@123Mua
3. Big Data Myths
● People talk about Big Data all
the time: 3Vs
○ Volume
○ Variety
○ Velocity
● Business Value in Data
○ Customer Insights
○ Product Insights
4. Big Data Myths
VOLUME
● Data is BIG
● Storage capability of hard drives
increased massively compared
to Access speed
5. Big Data Myths
VARIETY
● Different kinds of data
○ Structured
○ Semi-structured
○ Unstructured
● Structured
● Semi-structured
○ Self-described Information (json,
xml, logs)
● Unstructured
6. Big Data Myths
VELOCITY
● Characteristics
○ How fast data available for
processing?
○ How fast the processing is?
● Data accumulation with very
high rates
○ Click streams
○ Supermarket transactions
○ Social media interactions
9. ● Scribe is a server for
aggregating log data
that's streamed in real
time from clients.
● It is designed and
developed by
FaceBook.
● Not active any more
Scribe
Big Data
Collecting
10. ● Kafka is a distributed,
partitioned, replicated commit
log service. It provides the
functionality of a messaging
system which allows producers
send messages over the
network to the Kafka cluster
which in turn serves them up to
consumers
Apache Kafka
Big Data
Collecting
12. The Hadoop Distributed File System (HDFS) is a distributed file
system designed to run on commodity hardware
Hadoop File System (HDFS)
Big Data
Storage
13. ● NoSQL: Next Generation Databases mostly addressing
some of the points: being non-relational, distributed,
open-source and horizontally scalable.
● Types:
○ Key-Value Store
○ Document Store
○ Column Store
○ Graph Database
○ Content Delivery Network
NoSQL Datastores
Big Data
Storage
14. NoSQL: Next Generation Databases mostly addressing some
of the points: being non-relational, distributed, open-source
and horizontally scalable.
NoSQL Datastores
Big Data
Storage
15. A distributed, scalable, versioned, non-relational datastore on top of
HDFS which models after Google's Bigtable.
HBase
Big Data
Storage
16. Hadoop MapReduce
Big Data
Computation
● Hadoop MapReduce is a software
framework for easily writing
applications which process vast
amounts of data (multi-terabyte
data-sets) in-parallel on large
clusters (thousands of nodes) of
commodity hardware in a
reliable, fault-tolerant manner.
18. Apache Spark
● Fast and general engine
for large-scale data
processing.
● Suitable for iterative
algorithms
Big Data
Computation
19. Apache Samza
● Apache Samza is a distributed
stream processing framework.
● Uses Kafka to guarantee that
messages are processed in the
order they were written to a
partition
● Whenever a machine in the cluster
fails, Samza works with Hadoop
YARN to transparently migrate
your tasks to another machine.
Big Data
Stream
Processing
20. Apache Mahout
Provide open-source implementations of distributed and
scalable machine learning algorithms focused primarily in the
areas:
● Collaborative Filtering
● Classification
● Clustering
● Dimension Reduction
Big Data
Mining