Group 2 Handling and Processing of big data (1).pptx
1. HANDING AND
PROCESSING
OF BIG DATA
RUBAB TARIQ, AQSA BIBI
22015956-015 ,016
MS-IT
This Photo by Unknown author is licensed under CC BY-SA.
2. OUTLINE
What is big data?
Importance of big data?
Why we need handling of big data?
Handling of big data
Big data handling techniques
Why we need pre-processing of big data?
Processing of big data
Case study
Advance Tools and Techniques
Research directions
3. WHAT IS BIG DATA?
Data that’s too big, too fast, or too hard for existing tools to
process.
Big data emphasizes not only the huge volume of data, but
also its diversity and the speed at which it must be managed as
well as its correctness.
Basically, big data is data that is generated in
high volume, variety, and velocity. There are many other
concepts, theories, and facts related to big data and
its popularity.
4. WHY BIG
DATA?
Big Data initiatives were rated as “extremely important”
to 93% of companies. Leveraging a Big Data analytics
solution helps organizations to unlock the strategic
values and take full advantage of their assets.
It helps organizations:
To understand Where, When and Why their customers
buy
Protect the company’s client base with improved loyalty
programs
Predict market trends
Predict future needs
Make companies more innovative and competitive
It helps companies to discover new sources of revenue
5. IMPORTANCE OF
BIG DATA
Big Data importance doesn’t revolve around the
amount of data a company has. Its importance lies in
the fact that how the company utilizes the gathered
data.
The companies in the present market need to collect it
and analyze it because:
Cost saving
Time saving
Understand the market conditions
Social media listening
Boost Customer Acquisition and Retention
Solve Advertisers Problem and Offer Marketing Insights
6. NEED OF
HANDLING BIG
DATA
In the past, the focus was on small data for business
intelligence and prediction, but today we have a
deluge of data everywhere.
The ability to correlate more data allows us to
discover new and better information.
From the huge volume of various types of data, we
may predict the future, generate valuable hidden
information and deduce preventive actions, which
could increase productivity.
7. HANDLING
AND
PROCESSING
BIG DATA
Big Data management is the systematic organization,
administration as well as governance of massive
amounts of data.
The process includes management of both
unstructured and structured data.
The primary objective is to ensure the data is of high
quality and accessible for business intelligence along
with big data analytics applications.
To contend with the rapidly growing data pools,
government agencies, corporations and other large
organizations have begun implementing Big
Data management solutions.
The data involves several terabytes or even
petabytes of data that has been saved in a broad
range of file formats.
Effective Big Data management enables an
organization to find valuable information with ease
irrespective of how large or unstructured the data is.
The data is gathered from different sources such as
call records, system logs and social media sites.
8. HANDLING OF
BIG DATA
1. Outline Your Goals
The first tick on the checklist when it comes to
handling Big Data is knowing what data to gather and
the data that need not be collected. To do this one
has to determine clearly defined goals. Failure to
accomplish this will lead one to gather large amounts
of data which isn’t aligned with a business’ continuous
requirements.
Many enterprises eventually collect unnecessary data
as they would not have clearly defined goals, well
mapped strategies for achieving the said goals. It is of
paramount importance that organizations should
collect data with a laser focus to benefit business
objectives.
9. HANDLING BIG
DATA
2. Do Not Ignore Audit Regulations
Offsite Database Managers should maintain the right
database components especially when an audit is in
hand. Irrespective of the data nature being payment
data, credit scores or data of lesser importance, the
data should be managed accordingly. One should steer
clear of liability and progressively earn the client’s trust.
10. HANDLING OF
BIG DATA
3. Secure data
The next step in managing Big Data is to ensure
the relevant data collected is secured with a broad
range of measures. To ensure the data secured is
both accessible and secure, it must be protected
by firewall security measures, spam filtering,
malware scanning and elimination, along with most
importantly team permission control.
Since data has the immense power to drive your
business to new heights of success, or crash into
oblivion. Therefore it is wise not to take data
management lightly since securing organizational
data is the highest priority in Big Data
Management.
11. HANDLING BIG
DATA
4. keep data protected
A database is susceptible to threats from not
just human influences and synthetic anomalies, but
also is prone to damage from the elements of
nature such as heat, humidity, and extreme cold. All
of which can easily corrupt data. Whenever data is
damaged, system failures are bound to follow
leading to expensive downtimes and related
overheads.
Organizations have to safeguard databases
against adverse environmental situations which
would damage data and put forth considerable
efforts to protect their data. It is essential to create
and maintain/update a backup of the database
elsewhere, in addition to implementation of safety
features. The updates should be at planned at
frequent intervals.
12. HANDLING OF
BIG DATA
5. Data has to be interlinked
Since organizational databases are bound to be
accessed by a number of channels, it is
not recommended to use different software for
the required solutions. In essence, all
organizational data must be able to talk to each
other. If there are communication hassles
between applications and data and the converse
of this as well can lead huge problems.
Cloud Storage solution is the perfect answer to
data interlinking issue. Also useful in
this circumstance would be a remote database
administrator among other tools. The objective
is to generate seamless data synchronization.
This will be needed all the more when more than
just team will be accessing and working on the
same data simultaneously.
13. HANDLING BIG
DATA
6. Know the Data You Need to Capture
The key to successful Big Data management is knowing
which data will suit a particular solution. This will mean
one will be aware which data is needed to be collected
for different situations.
Organizations are required to know which data has to
be collected and also when. To do this correctly,
objectives will have to be clearly known and a plan
must be formulated on how to accomplish them.
14. HANDLING BIG
DATA
7. Adapt to the New Changes
One of the most important aspects of Big Data Management
is keeping up with the latest trends in the same. Software
and data in all its forms change constantly and almost on a
daily basis, globally. Keeping up with the newest
technologies and strategies for adoption will enable
organizations to stay ahead of the curve and build
highly optimized and efficient databases. Being flexible and
open to new trends and technologies will go a long way in
giving you an edge over the competition.
16. WHAT IS PREPROCESSING
Today’s real world databases are noisy, contain
missing values and inconsistent due to their
huge size.
A good preprocess data before data mining not
only “improve the quality of mining results” but
also “ease the mining process”.
Remember: No quality data, no quality mining
results!
A preliminary processing of data in order to
prepare it for the primary processing or for
further analysis.
17. WHY WE NEED
TO
PREPROCESS
BIG DATA?
Data preparation is a big issue for both warehousing
and mining
Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes
or names
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration
of quality data
20. PROBLEMS
What's wrong here?
1'Dept. of Transportation'New York'NY
2'Dept. of Finance'New York'NY
3'Office of
Veteran's Affairs'New York'NY
The separator is used in the data.
Easy to miss if you don’t check the number
of columns when parsing each row
What's wrong here?
1,Dept. of Transportation,
New York City, NY
2,Dept. of Finance,City of New York,NY
3,Office of
Veteran's Affairs,New York,NY
We need standardization / naming
conventions
21. STEPS OR TECHNIQUES
Data cleaning: Fill in missing values,
smooth noisy data, identify
or remove outliers, and
resolve inconsistencies
Data integration: Integration
of multiple databases, data cubes,
files, or notes
Data
integration: Normalization (scaling
to a specific range)
Data reduction: Obtains
reduced representation in volume but
produces the same or similar analytical
results
Data discretization: with
particular importance, especially
for numerical data
Data aggregation:
dimensionality reduction,
data compression, generalization
24. READING AN EXCEL FILE FROM URL WHICH HAVE
DATA OF ONLINE RETAIL TRANSACTIONS
25. CLEANING DATASET
Remove duplicate
invoices
Remove spaces from the
start and from end of
description column
Converting member
number to string
Remove credit
transactions
26. ADVANCED TOOLS
AND TECHNIQUES
Rapid Minor
using python libraries
Pandas Library
Scikit Learn
R Studio
Apache OpenNLP
NLTK or The Natural
Language Toolkit
27. RESEARCH
POINT OF
VIEW
A lot a methods have been developed but
still an active area of research
Overall, the research focus on preprocessing
aims to develop advanced techniques and
methodologies that address the specific
challenges and requirements of
different data types, domains, and analysis
tasks.
These advancements in preprocessing
techniques contribute to improving the
quality and reliability of research findings,
enhancing model performance, and enabling
more accurate and meaningful analysis in
various fields.
29. REFRENCES
Big Data Databases: the Essence https://www.scnsoft.com/analytics/big-
data/databases
Big Data Applications – A manifestation of the hottest buzzword https://data-
flair.training/blogs/big-data-applications/
Big Data Tutorial For Beginners | What Is Big
Data? https://www.softwaretestinghelp.com/big-data-
tutorial/#Big_Data_Benefits_Over_Traditional_Database
Healthcare Big Data and the Promise of Value-Based
Care https://catalyst.nejm.org/doi/full/10.1056/CAT.18.0290
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=26d7b8e8
7af17b63a2cbda0de5b598c321697e37