2. WHAT IS BIG DATA?
Terra bytes(1024 GB) of data to be processed(or
analyzed).
Giga toTerra bytes of new data generated daily
3. IMPLICATIONS OF BIG DATA
Data will be spread across multiple machines.
Data will be in different formats
Structured
.CSV, rdbms
Log files
Unstructured data
Data extracted from web pages, email content
4. ISSUES
Moving data to databases is expensive
Daily terra bytes of data to be uploaded which is
cumbersome
How to handle data errors?
5. POSSIBLE SOLUTION
Analyze the data in the format they are
I.e. A text file need not be uploaded into database to
analyze it.
Thus data need not be uploaded into any system.
6. HOW TO ANALYZE DATA?
The data has to be read by your code to analyze
the data.
If the code is in different machine than the data
again huge data transfer will happen during
analysis
This happens for every analysis
7. POSSIBLE SOLUTION
Do not move the data out of the box.
Instead move the code to the box where data
resides. The size of the code is very less when
compared to the data.
Thus network contention problem is solved.
8. MAP REDUCE FRAMEWORK
Map reduce framework implements the solution that
we saw in the previous slide
9. HDFS
HDFS is very similar to a file system, except that
files are replicated to multiple machines for
availability and scalability