SlideShare une entreprise Scribd logo
1  sur  29
HANDING AND
PROCESSING
OF BIG DATA
RUBAB TARIQ, AQSA BIBI
22015956-015 ,016
MS-IT
This Photo by Unknown author is licensed under CC BY-SA.
OUTLINE
 What is big data?
 Importance of big data?
 Why we need handling of big data?
 Handling of big data
 Big data handling techniques
 Why we need pre-processing of big data?
 Processing of big data
 Case study
 Advance Tools and Techniques
 Research directions
WHAT IS BIG DATA?
 Data that’s too big, too fast, or too hard for existing tools to
process.
 Big data emphasizes not only the huge volume of data, but
also its diversity and the speed at which it must be managed as
well as its correctness.
 Basically, big data is data that is generated in
high volume, variety, and velocity. There are many other
concepts, theories, and facts related to big data and
its popularity.
WHY BIG
DATA?
 Big Data initiatives were rated as “extremely important”
to 93% of companies. Leveraging a Big Data analytics
solution helps organizations to unlock the strategic
values and take full advantage of their assets.
 It helps organizations:
 To understand Where, When and Why their customers
buy
 Protect the company’s client base with improved loyalty
programs
 Predict market trends
 Predict future needs
 Make companies more innovative and competitive
 It helps companies to discover new sources of revenue
IMPORTANCE OF
BIG DATA
 Big Data importance doesn’t revolve around the
amount of data a company has. Its importance lies in
the fact that how the company utilizes the gathered
data.
 The companies in the present market need to collect it
and analyze it because:
 Cost saving
 Time saving
 Understand the market conditions
 Social media listening
 Boost Customer Acquisition and Retention
 Solve Advertisers Problem and Offer Marketing Insights
NEED OF
HANDLING BIG
DATA
 In the past, the focus was on small data for business
intelligence and prediction, but today we have a
deluge of data everywhere.
 The ability to correlate more data allows us to
discover new and better information.
 From the huge volume of various types of data, we
may predict the future, generate valuable hidden
information and deduce preventive actions, which
could increase productivity.
HANDLING
AND
PROCESSING
BIG DATA
 Big Data management is the systematic organization,
administration as well as governance of massive
amounts of data.
 The process includes management of both
unstructured and structured data.
 The primary objective is to ensure the data is of high
quality and accessible for business intelligence along
with big data analytics applications.
 To contend with the rapidly growing data pools,
government agencies, corporations and other large
organizations have begun implementing Big
Data management solutions.
 The data involves several terabytes or even
petabytes of data that has been saved in a broad
range of file formats.
 Effective Big Data management enables an
organization to find valuable information with ease
irrespective of how large or unstructured the data is.
The data is gathered from different sources such as
call records, system logs and social media sites.
HANDLING OF
BIG DATA
1. Outline Your Goals
 The first tick on the checklist when it comes to
handling Big Data is knowing what data to gather and
the data that need not be collected. To do this one
has to determine clearly defined goals. Failure to
accomplish this will lead one to gather large amounts
of data which isn’t aligned with a business’ continuous
requirements.
 Many enterprises eventually collect unnecessary data
as they would not have clearly defined goals, well
mapped strategies for achieving the said goals. It is of
paramount importance that organizations should
collect data with a laser focus to benefit business
objectives.
HANDLING BIG
DATA
2. Do Not Ignore Audit Regulations
 Offsite Database Managers should maintain the right
database components especially when an audit is in
hand. Irrespective of the data nature being payment
data, credit scores or data of lesser importance, the
data should be managed accordingly. One should steer
clear of liability and progressively earn the client’s trust.
HANDLING OF
BIG DATA
3. Secure data
 The next step in managing Big Data is to ensure
the relevant data collected is secured with a broad
range of measures. To ensure the data secured is
both accessible and secure, it must be protected
by firewall security measures, spam filtering,
malware scanning and elimination, along with most
importantly team permission control.
 Since data has the immense power to drive your
business to new heights of success, or crash into
oblivion. Therefore it is wise not to take data
management lightly since securing organizational
data is the highest priority in Big Data
Management.
HANDLING BIG
DATA
4. keep data protected
 A database is susceptible to threats from not
just human influences and synthetic anomalies, but
also is prone to damage from the elements of
nature such as heat, humidity, and extreme cold. All
of which can easily corrupt data. Whenever data is
damaged, system failures are bound to follow
leading to expensive downtimes and related
overheads.
 Organizations have to safeguard databases
against adverse environmental situations which
would damage data and put forth considerable
efforts to protect their data. It is essential to create
and maintain/update a backup of the database
elsewhere, in addition to implementation of safety
features. The updates should be at planned at
frequent intervals.
HANDLING OF
BIG DATA
5. Data has to be interlinked
 Since organizational databases are bound to be
accessed by a number of channels, it is
not recommended to use different software for
the required solutions. In essence, all
organizational data must be able to talk to each
other. If there are communication hassles
between applications and data and the converse
of this as well can lead huge problems.
 Cloud Storage solution is the perfect answer to
data interlinking issue. Also useful in
this circumstance would be a remote database
administrator among other tools. The objective
is to generate seamless data synchronization.
This will be needed all the more when more than
just team will be accessing and working on the
same data simultaneously.
HANDLING BIG
DATA
6. Know the Data You Need to Capture
 The key to successful Big Data management is knowing
which data will suit a particular solution. This will mean
one will be aware which data is needed to be collected
for different situations.
 Organizations are required to know which data has to
be collected and also when. To do this correctly,
objectives will have to be clearly known and a plan
must be formulated on how to accomplish them.
HANDLING BIG
DATA
7. Adapt to the New Changes
 One of the most important aspects of Big Data Management
is keeping up with the latest trends in the same. Software
and data in all its forms change constantly and almost on a
daily basis, globally. Keeping up with the newest
technologies and strategies for adoption will enable
organizations to stay ahead of the curve and build
highly optimized and efficient databases. Being flexible and
open to new trends and technologies will go a long way in
giving you an edge over the competition.
BIG DATA HANDLING
TECHNIQUES
 Hadoop
 PIG
 HIVE
 Column oriented database
 Using cloud for big data
WHAT IS PREPROCESSING
 Today’s real world databases are noisy, contain
missing values and inconsistent due to their
huge size.
 A good preprocess data before data mining not
only “improve the quality of mining results” but
also “ease the mining process”.
 Remember: No quality data, no quality mining
results!
 A preliminary processing of data in order to
prepare it for the primary processing or for
further analysis.
WHY WE NEED
TO
PREPROCESS
BIG DATA?
 Data preparation is a big issue for both warehousing
and mining
 Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in codes
or names
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration
of quality data
MEASURES OF
DATA QUALITY
 Accuracy
 Consistency
 Accessibility
 Completeness
CASE STUDY:
GOVERNMENT
AGENCY DATA
 What we want?
PROBLEMS
What's wrong here?
 1'Dept. of Transportation'New York'NY
 2'Dept. of Finance'New York'NY
 3'Office of
Veteran's Affairs'New York'NY
 The separator is used in the data.
 Easy to miss if you don’t check the number
of columns when parsing each row
What's wrong here?
 1,Dept. of Transportation,
New York City, NY
 2,Dept. of Finance,City of New York,NY
 3,Office of
Veteran's Affairs,New York,NY
 We need standardization / naming
conventions
STEPS OR TECHNIQUES
 Data cleaning: Fill in missing values,
smooth noisy data, identify
or remove outliers, and
resolve inconsistencies
 Data integration: Integration
of multiple databases, data cubes,
files, or notes
 Data
integration: Normalization (scaling
to a specific range)
 Data reduction: Obtains
reduced representation in volume but
produces the same or similar analytical
results
 Data discretization: with
particular importance, especially
for numerical data
 Data aggregation:
dimensionality reduction,
data compression, generalization
EXAMPLE: MARKET
BASKET ANALYSIS
READING AN EXCEL FILE FROM URL WHICH HAVE
DATA OF ONLINE RETAIL TRANSACTIONS
CLEANING DATASET
 Remove duplicate
invoices
 Remove spaces from the
start and from end of
description column
 Converting member
number to string
 Remove credit
transactions
ADVANCED TOOLS
AND TECHNIQUES
 Rapid Minor
using python libraries
 Pandas Library
 Scikit Learn
 R Studio
 Apache OpenNLP
 NLTK or The Natural
Language Toolkit
RESEARCH
POINT OF
VIEW
 A lot a methods have been developed but
still an active area of research
 Overall, the research focus on preprocessing
aims to develop advanced techniques and
methodologies that address the specific
challenges and requirements of
different data types, domains, and analysis
tasks.
 These advancements in preprocessing
techniques contribute to improving the
quality and reliability of research findings,
enhancing model performance, and enabling
more accurate and meaningful analysis in
various fields.
ANY QUESTION?
REFRENCES
 Big Data Databases: the Essence https://www.scnsoft.com/analytics/big-
data/databases
 Big Data Applications – A manifestation of the hottest buzzword https://data-
flair.training/blogs/big-data-applications/
 Big Data Tutorial For Beginners | What Is Big
Data? https://www.softwaretestinghelp.com/big-data-
tutorial/#Big_Data_Benefits_Over_Traditional_Database
 Healthcare Big Data and the Promise of Value-Based
Care https://catalyst.nejm.org/doi/full/10.1056/CAT.18.0290
 https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=26d7b8e8
7af17b63a2cbda0de5b598c321697e37

Contenu connexe

Similaire à Group 2 Handling and Processing of big data (1).pptx

Veritas corporate brochure emea
Veritas corporate brochure emeaVeritas corporate brochure emea
Veritas corporate brochure emeaHayatollah Ayoubi
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analyticsThe Marketing Distillery
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data scienceVipul Kalamkar
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
Big data security
Big data securityBig data security
Big data securityAnne ndolo
 
Big data security
Big data securityBig data security
Big data securityAnne ndolo
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big DataUmair Shafique
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analyticsThe Marketing Distillery
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data Shallote Dsouza
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big DataIRJET Journal
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
Master Data Management
Master Data ManagementMaster Data Management
Master Data ManagementSabir Akhtar
 
Practical analytics john enoch white paper
Practical analytics john enoch white paperPractical analytics john enoch white paper
Practical analytics john enoch white paperJohn Enoch
 
Data foundation for analytics excellence
Data foundation for analytics excellenceData foundation for analytics excellence
Data foundation for analytics excellenceMudit Mangal
 
Getting Ahead Of The Game: Proactive Data Governance
Getting Ahead Of The Game: Proactive Data GovernanceGetting Ahead Of The Game: Proactive Data Governance
Getting Ahead Of The Game: Proactive Data GovernanceHarley Capewell
 
Data Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data InsightData Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data InsightSymantec
 
Chief data-officers-guide-on-transforming-to-a-data-driven-organization
Chief data-officers-guide-on-transforming-to-a-data-driven-organizationChief data-officers-guide-on-transforming-to-a-data-driven-organization
Chief data-officers-guide-on-transforming-to-a-data-driven-organizationHappiest Minds Technologies
 

Similaire à Group 2 Handling and Processing of big data (1).pptx (20)

Veritas corporate brochure emea
Veritas corporate brochure emeaVeritas corporate brochure emea
Veritas corporate brochure emea
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
Big data security
Big data securityBig data security
Big data security
 
Big data security
Big data securityBig data security
Big data security
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big Data
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
Big data
Big dataBig data
Big data
 
Master Data Management
Master Data ManagementMaster Data Management
Master Data Management
 
Practical analytics john enoch white paper
Practical analytics john enoch white paperPractical analytics john enoch white paper
Practical analytics john enoch white paper
 
Data foundation for analytics excellence
Data foundation for analytics excellenceData foundation for analytics excellence
Data foundation for analytics excellence
 
Getting Ahead Of The Game: Proactive Data Governance
Getting Ahead Of The Game: Proactive Data GovernanceGetting Ahead Of The Game: Proactive Data Governance
Getting Ahead Of The Game: Proactive Data Governance
 
Data Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data InsightData Sheet - Manage unstructured data growth with Symantec Data Insight
Data Sheet - Manage unstructured data growth with Symantec Data Insight
 
Chief data-officers-guide-on-transforming-to-a-data-driven-organization
Chief data-officers-guide-on-transforming-to-a-data-driven-organizationChief data-officers-guide-on-transforming-to-a-data-driven-organization
Chief data-officers-guide-on-transforming-to-a-data-driven-organization
 
Big data vs datawarehousing
Big data vs datawarehousingBig data vs datawarehousing
Big data vs datawarehousing
 

Dernier

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 

Dernier (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 

Group 2 Handling and Processing of big data (1).pptx

  • 1. HANDING AND PROCESSING OF BIG DATA RUBAB TARIQ, AQSA BIBI 22015956-015 ,016 MS-IT This Photo by Unknown author is licensed under CC BY-SA.
  • 2. OUTLINE  What is big data?  Importance of big data?  Why we need handling of big data?  Handling of big data  Big data handling techniques  Why we need pre-processing of big data?  Processing of big data  Case study  Advance Tools and Techniques  Research directions
  • 3. WHAT IS BIG DATA?  Data that’s too big, too fast, or too hard for existing tools to process.  Big data emphasizes not only the huge volume of data, but also its diversity and the speed at which it must be managed as well as its correctness.  Basically, big data is data that is generated in high volume, variety, and velocity. There are many other concepts, theories, and facts related to big data and its popularity.
  • 4. WHY BIG DATA?  Big Data initiatives were rated as “extremely important” to 93% of companies. Leveraging a Big Data analytics solution helps organizations to unlock the strategic values and take full advantage of their assets.  It helps organizations:  To understand Where, When and Why their customers buy  Protect the company’s client base with improved loyalty programs  Predict market trends  Predict future needs  Make companies more innovative and competitive  It helps companies to discover new sources of revenue
  • 5. IMPORTANCE OF BIG DATA  Big Data importance doesn’t revolve around the amount of data a company has. Its importance lies in the fact that how the company utilizes the gathered data.  The companies in the present market need to collect it and analyze it because:  Cost saving  Time saving  Understand the market conditions  Social media listening  Boost Customer Acquisition and Retention  Solve Advertisers Problem and Offer Marketing Insights
  • 6. NEED OF HANDLING BIG DATA  In the past, the focus was on small data for business intelligence and prediction, but today we have a deluge of data everywhere.  The ability to correlate more data allows us to discover new and better information.  From the huge volume of various types of data, we may predict the future, generate valuable hidden information and deduce preventive actions, which could increase productivity.
  • 7. HANDLING AND PROCESSING BIG DATA  Big Data management is the systematic organization, administration as well as governance of massive amounts of data.  The process includes management of both unstructured and structured data.  The primary objective is to ensure the data is of high quality and accessible for business intelligence along with big data analytics applications.  To contend with the rapidly growing data pools, government agencies, corporations and other large organizations have begun implementing Big Data management solutions.  The data involves several terabytes or even petabytes of data that has been saved in a broad range of file formats.  Effective Big Data management enables an organization to find valuable information with ease irrespective of how large or unstructured the data is. The data is gathered from different sources such as call records, system logs and social media sites.
  • 8. HANDLING OF BIG DATA 1. Outline Your Goals  The first tick on the checklist when it comes to handling Big Data is knowing what data to gather and the data that need not be collected. To do this one has to determine clearly defined goals. Failure to accomplish this will lead one to gather large amounts of data which isn’t aligned with a business’ continuous requirements.  Many enterprises eventually collect unnecessary data as they would not have clearly defined goals, well mapped strategies for achieving the said goals. It is of paramount importance that organizations should collect data with a laser focus to benefit business objectives.
  • 9. HANDLING BIG DATA 2. Do Not Ignore Audit Regulations  Offsite Database Managers should maintain the right database components especially when an audit is in hand. Irrespective of the data nature being payment data, credit scores or data of lesser importance, the data should be managed accordingly. One should steer clear of liability and progressively earn the client’s trust.
  • 10. HANDLING OF BIG DATA 3. Secure data  The next step in managing Big Data is to ensure the relevant data collected is secured with a broad range of measures. To ensure the data secured is both accessible and secure, it must be protected by firewall security measures, spam filtering, malware scanning and elimination, along with most importantly team permission control.  Since data has the immense power to drive your business to new heights of success, or crash into oblivion. Therefore it is wise not to take data management lightly since securing organizational data is the highest priority in Big Data Management.
  • 11. HANDLING BIG DATA 4. keep data protected  A database is susceptible to threats from not just human influences and synthetic anomalies, but also is prone to damage from the elements of nature such as heat, humidity, and extreme cold. All of which can easily corrupt data. Whenever data is damaged, system failures are bound to follow leading to expensive downtimes and related overheads.  Organizations have to safeguard databases against adverse environmental situations which would damage data and put forth considerable efforts to protect their data. It is essential to create and maintain/update a backup of the database elsewhere, in addition to implementation of safety features. The updates should be at planned at frequent intervals.
  • 12. HANDLING OF BIG DATA 5. Data has to be interlinked  Since organizational databases are bound to be accessed by a number of channels, it is not recommended to use different software for the required solutions. In essence, all organizational data must be able to talk to each other. If there are communication hassles between applications and data and the converse of this as well can lead huge problems.  Cloud Storage solution is the perfect answer to data interlinking issue. Also useful in this circumstance would be a remote database administrator among other tools. The objective is to generate seamless data synchronization. This will be needed all the more when more than just team will be accessing and working on the same data simultaneously.
  • 13. HANDLING BIG DATA 6. Know the Data You Need to Capture  The key to successful Big Data management is knowing which data will suit a particular solution. This will mean one will be aware which data is needed to be collected for different situations.  Organizations are required to know which data has to be collected and also when. To do this correctly, objectives will have to be clearly known and a plan must be formulated on how to accomplish them.
  • 14. HANDLING BIG DATA 7. Adapt to the New Changes  One of the most important aspects of Big Data Management is keeping up with the latest trends in the same. Software and data in all its forms change constantly and almost on a daily basis, globally. Keeping up with the newest technologies and strategies for adoption will enable organizations to stay ahead of the curve and build highly optimized and efficient databases. Being flexible and open to new trends and technologies will go a long way in giving you an edge over the competition.
  • 15. BIG DATA HANDLING TECHNIQUES  Hadoop  PIG  HIVE  Column oriented database  Using cloud for big data
  • 16. WHAT IS PREPROCESSING  Today’s real world databases are noisy, contain missing values and inconsistent due to their huge size.  A good preprocess data before data mining not only “improve the quality of mining results” but also “ease the mining process”.  Remember: No quality data, no quality mining results!  A preliminary processing of data in order to prepare it for the primary processing or for further analysis.
  • 17. WHY WE NEED TO PREPROCESS BIG DATA?  Data preparation is a big issue for both warehousing and mining  Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  Noisy: containing errors or outliers  Inconsistent: containing discrepancies in codes or names  No quality data, no quality mining results!  Quality decisions must be based on quality data  Data warehouse needs consistent integration of quality data
  • 18. MEASURES OF DATA QUALITY  Accuracy  Consistency  Accessibility  Completeness
  • 20. PROBLEMS What's wrong here?  1'Dept. of Transportation'New York'NY  2'Dept. of Finance'New York'NY  3'Office of Veteran's Affairs'New York'NY  The separator is used in the data.  Easy to miss if you don’t check the number of columns when parsing each row What's wrong here?  1,Dept. of Transportation, New York City, NY  2,Dept. of Finance,City of New York,NY  3,Office of Veteran's Affairs,New York,NY  We need standardization / naming conventions
  • 21. STEPS OR TECHNIQUES  Data cleaning: Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration: Integration of multiple databases, data cubes, files, or notes  Data integration: Normalization (scaling to a specific range)  Data reduction: Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization: with particular importance, especially for numerical data  Data aggregation: dimensionality reduction, data compression, generalization
  • 22.
  • 24. READING AN EXCEL FILE FROM URL WHICH HAVE DATA OF ONLINE RETAIL TRANSACTIONS
  • 25. CLEANING DATASET  Remove duplicate invoices  Remove spaces from the start and from end of description column  Converting member number to string  Remove credit transactions
  • 26. ADVANCED TOOLS AND TECHNIQUES  Rapid Minor using python libraries  Pandas Library  Scikit Learn  R Studio  Apache OpenNLP  NLTK or The Natural Language Toolkit
  • 27. RESEARCH POINT OF VIEW  A lot a methods have been developed but still an active area of research  Overall, the research focus on preprocessing aims to develop advanced techniques and methodologies that address the specific challenges and requirements of different data types, domains, and analysis tasks.  These advancements in preprocessing techniques contribute to improving the quality and reliability of research findings, enhancing model performance, and enabling more accurate and meaningful analysis in various fields.
  • 29. REFRENCES  Big Data Databases: the Essence https://www.scnsoft.com/analytics/big- data/databases  Big Data Applications – A manifestation of the hottest buzzword https://data- flair.training/blogs/big-data-applications/  Big Data Tutorial For Beginners | What Is Big Data? https://www.softwaretestinghelp.com/big-data- tutorial/#Big_Data_Benefits_Over_Traditional_Database  Healthcare Big Data and the Promise of Value-Based Care https://catalyst.nejm.org/doi/full/10.1056/CAT.18.0290  https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=26d7b8e8 7af17b63a2cbda0de5b598c321697e37