SlideShare une entreprise Scribd logo
1  sur  29
DISCOVER . LEARN . EMPOWER
Apex Institute of Technology
Department of Computer Science & Engineering
Introduction to Data Science
(CSF-284)
1
Sukhman Kaur
E13261
Assistant Professor
CSE(AIT), CU
Introduction to Data Science: Course Objectives
2
COURSE OBJECTIVES
The Course aims to:
• This course brings together several key big data problems and solutions.
• To recognize the key concepts of Extraction, Transformation and Loading
• To prepare a sample project in Hadoop Environment
COURSE OUTCOMES
3
On completion of this course, the students shall be able to:-
CO4 Recognize the feasibility of applying technologies for Handling Big Data.
Chapter Index
S. No Reference No Particulars Slide
From-To
1 Learning Objectives 3
2 Topic 1 Distributed and Parallel Computing for Big Data 4-5
3 Topic 2 Introduction to Big Data Technologies 6-9
4 Topic 3 Cloud Computing and Big Data 10-14
5 Topic 4 In-Memory Technology for Big Data 15
6 Topic 5 Big Data Techniques 16-24
7 Let’s Sum Up 25
Learning Objectives
• Explain distributed and parallel computing for Big Data
• Recognise Big Data technologies
• Describe cloud computing in reference to Big Data
• Discuss in-memory technology for Big Data
• Elucidate Big Data techniques
1. Distributed and Parallel Computing for Big Data
• Distributed computing is a method in which multiple computing resources are connected in a network and
computing tasks are distributed across the resources, thereby increasing the computing power. Distributed
computing is faster and more efficient than traditional computing, and, hence, of immense value when it
comes to processing a huge amount of data in a limited time.
• To carry out complex computations, the processing power of a standalone personal computer can also be
enhanced by adding multiple processing units, which can carry out the processing of a complex task by
breaking it up into sub-tasks, and carrying out individual sub-tasks simultaneously. Such systems are
often termed as parallel systems. The greater the processing power, the faster the computing.
2. Distributed and Parallel Computing for Big Data
• Following Table differentiates between distributed and parallel computing systems:
Distributed Computing System Parallel Computing System
An independent, autonomous system connected in a
network for accomplishing specific tasks
A computer system with several
processing units attached to it
Coordination is possible between
connected computers that have
their own memory and CPU
A common shared memory can be directly accessed by
every processing
unit in a network
Loose coupling of computers connected in a network
that provides
access to data and remotely located
Resources
Tight coupling of processing resources that are used for
solving a single, complex problem
1.Introduction to Big Data Technologies
• A Big Data system is vastly different from other solution-providing systems and is based on the seven Vs,
as described in previous chapter, namely: Volume, Velocity, Variety, Veracity, Variability, Value and
Visualisation.
• A system that complies with these properties and happens to be robust to withstand unexpected events
and scalable enough to accommodate future methodologies is qualified to be called as a Big Data system.
• A typical Big Data system consists of a setup that adheres to these seven Vs and provides a great
infrastructure that can withstand the influx of huge datasets with high velocity, meanwhile providing
effective mechanism to process the datasets by cleansing, shaping, filtering and sorting into meaningful
information aimed towards making the data both user- and machine-friendly.
2.Introduction to Big Data Technologies
Hadoop
• Hadoop is an open-source platform that provides analytical technologies and computational power required
to work with such large volumes of data.
• Hadoop platform provides an improved programming model, which is used to create and run distributed
systems quickly and efficiently.
• A Hadoop cluster consists of single MasterNode and multiple worker nodes. The MasterNode contains a
NameNode and JobTracker and a slave or worker node acts as both a DataNode and TaskTracker. Hadoop
requires Java Runtime Environment (JRE) 1.6 or a higher version of JRE.
• There are two main components of Apache Hadoop – the Hadoop Distributed File System (HDFS) and the
MapReduce parallel processing framework.
3.Introduction to Big Data Technologies
Hadoop
• Hadoop distributed file system (HDFS) is a fault-tolerant storage system in Hadoop. It stores large size
files from terabytes to petabytes across different terminals.
• Data is replicated on three nodes: two on the same rack and one on a different rack. The file in HDFS is
split into large blocks size of 64 MB by default (typically 64 to 128 megabytes) and each block of the file is
independently replicated at multiple data nodes.
• The NameNode actively monitors the number of replicas of a block (by default 3 times). When a replica of a
block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block.
4.Introduction to Big Data Technologies
R
• R is an open source programming language and an application environment for statistical computing with
graphics, developed by R Foundation for Statistical Computing.
• It is an interpreted language like Python and uses a command line interpreter. It supports procedural as
well as generic functions with OOP.
• R is extensively used by data miners and statisticians, providing a vast variety of graphical and statistical
techniques, with linear and nonlinear modelling, time-series analysis, classical statistical tests, clustering,
classification and others.
• R is easily extendable and implementable through functions and available extensions.
1. Cloud Computing and Big Data
• One of the vital issues that organisations face with the storage and management of Big Data is the huge
amount of investment to get the required hardware setup and software packages.
• Some of these resources may be overutilised or underutilised with varying requirements overtime. We can
overcome these challenges by providing a set of computing resources that can be shared through cloud
computing.
• The cloud computing environment saves costs related to infrastructure in an organisation by providing a
framework that can be optimised and expanded horizontally.
2. Cloud Computing and Big Data
Features of Cloud Computing
Scalability
Elasticity
Resource Pooling
Self-Service
Low Cost
Fault Tolerance
3. Cloud Computing and Big Data
Cloud Deployment Models
Public Cloud (End-User Level Cloud)
Private Cloud (Enterprise-Level Cloud)
Community Cloud
Hybrid Cloud
4. Cloud Computing and Big Data
Cloud Delivery Models
• It is one of the categories of cloud computing services, which
makes available virtualised computing resources on Internet.
Infrastructure as a Service (IaaS)
• It is built above IaaS and is the layer that interacts with the
users, allowing them to deploy and use applications created
using programming and run-time environment platforms that
are supported by the provider.
Platform as a Service (PaaS)
• SaaS is one of the most popular cloud-based models and
comprises applications provided by the service provider.
Software as a Service (SaaS)
5. Cloud Computing and Big Data
Cloud Providers in Big Data Market
Big Data cloud providers have been gearing up to bring the most advanced technologies at competitive prices
in the market. Some providers are established, whereas some of them are relatively new to the field of cloud
services. Some of these providers are rendering services that are relevant to Big Data analytics only. Some
such providers are as follows:
• Amazon
• Google
• Windows Azure
In-Memory Technology for Big Data
• Hardware obstructions and limitations, lag of memory indifferences have to be side-lined and streamlined
with something faster like a cache memory or dynamic access memory so that the data is readily available
for disposal.
• The in-memory big data computing tool supports processing of high velocity data in real-time and also
faster processing of the stationary data.
• Technologies like event-streaming platforms, in-memory databases and analytics, and high-level
messaging structures are witnessing massive growth that resonates with the organisational needs for
better understandings achievable by a wider and deeper data assessment.
1. Big Data Techniques
• To analyse the datasets, there are many techniques available, some of which are as follows:
• Massive Parallelism
• Data Distribution
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
2. Big Data Techniques
Massive Parallelism
• According to the simplest definition available, a parallel system is a system where multiple processors are
involved and associated to carry out the concurrent computations.
• Massive parallelism refers to a parallel system where multiple systems interconnected with each other
pose as a single mighty conjoint processor and carry out tasks received from the data sets parallelly.
• In terms of Big Data dynamics, the systems can not only be processor, but also memory, hardware and even
network conjoint to scale up the operational efficiency posing as a massive system that can eat humongous
datasets parallelly without breaking a sweat.
3. Big Data Techniques
Data Distribution
• There are approaches to data distribution in a Big Data system described as follows:
• Centralised Approach: A central repository is used to store and download the essential
dataset by virtual machines.
• ‰
Semi-Centralised Approach: Semi-centralised approach reduces the stress on the
networking infrastructure.
• Hierarchical Approach: In a hierarchical approach, the data is fetched from the parent
node, i.e., the virtual machine, in the hierarchy.
• P2P Approach: P2P streaming connections are based on hierarchical multi-trees.
4. Big Data Techniques
High-Performance Computing
• High-performance computing is the simultaneous use of supercomputers and parallel processing
techniques for solving intricate computation problems.
• It emphasises on making parallel processing systems and algorithms by joining both parallel and
administrative computational methods.
• The words supercomputing and high-performance computing are often used to resemble each other.
• High-performance computing is used for performing research activities and cracking advanced problems
through computer simulation, modelling and analysis.
5. Big Data Techniques
Task and Thread Management
• Threads are simply the OS-based feature with their own Kernel and memory resources, and allow an
application logic to be segregated into concurrent multiple execution paths. It is a useful feature when
complex applications having multiple tasks need to be performed at the same time.
• Task parallelism refers to the execution of computer programmes throughout the multiple processors on
the different or same machines. It emphasises on performing diverse operations in parallel to best utilise
the accessible computing resources like memory and processors.
• Data parallelism focusses on effective distribution of datasets throughout the multiple calculation
programs.
6. Big Data Techniques
Data Mining and Analytics
• Data mining is the process of data extraction, evaluating it from multiple perspectives and then producing
the information summary in a meaningful form that identifies one or more relationships within the
dataset.
• Data analysis is an experiential activity, where the data scouring gives out some insight.
• Data analytics is about applying an algorithmic or logical process to derive the insights from a given
dataset. For example, looking at the past year’s weather and pest data, for the current month, we can
determine that a particular type of fungus grows often when the humidity levels reach a definite point.
7. Big Data Techniques
Data Retrieval
• Big Data refers to the large amounts of multi-structural data that continuously flows around and within
the organisations, and includes text, video, transactional records and sensor logs.
• Big Data systems utilise the Hadoop and the HDFS architecture to retrieve the data using MapReduce – a
distributed processing framework.
• It helps programmers in solving parallel data problems where the dataset can be divided into small chunks
and handled autonomously.
• MapReduce is important step as it allows normal developers to utilise parallel programming concepts
irrespective of cluster communication details, failure handling and task monitoring.
8. Big Data Techniques
Machine Learning
• Machine leaning formally focusses on the performance, theory and properties of learning algorithms and
systems. Machine learning is considered an ideal research field for taking advantage of the opportunities
available in big data.
• It delivers on the potential of mining the value from huge and different data sources with less dependence
on human instructions. It is data-driven and runs at machine scale and well-suited to the complication of
dealing with different data sources and the enormous range of variables and quantities of data involved.
• Machine learning systems utilise multiple algorithms to discover and show the patterns hidden in the
datasets.
9. Big Data Techniques
Data Visualisation
• Data visualisation is a valuable means through which the larger datasets after being combined may appear
practical, sensible and open to most people. Data visualisation is a trail-blazing method that not only keeps
you enlightened but helps others with the attributes of a typical statistical and computational result that
would’ve otherwise appeared intimidating for normal minds.
• Visual representation is often considered the most effective medium of information and communication
channel. As the saying goes, a picture is worth thousand words, data visualisation is a great example of
that saying. When properly aligned, it can convey critical information of data analysis in probably the
easiest way possible.
Let’s Sum Up
• The distributed computing works on the rules of divide and conquer, performing modules of the parent
tasks on multiple machines and then combining the results.
• Parallel computing refers to the utilisation of a single CPU present in a system or a group of internally
coupled systems by the means of efficient and clever multi-threading operations.
• Distributed computing is considered as the subset of parallel computing, which further is the subset of
concurrent computing.
• A Big Data system is vastly different from other solution-providing systems and is based on the seven Vs,
namely: Volume, Velocity, Variety, Veracity, Variability, Value and Visualisation.
References
• Text Books:
• Principles of Data Science by Sinan Ozdemir, (2016) PACKT.
• Data Science For Dummies by Lillian Pierson (2015)
• https://www.simplilearn.com/math-and-data-science-article
• Reference Books:
• R1. Data Science For Dummies by Lillian Pierson (2015)
• Journals:
• http://www.ijsmsjournal.org/ijsms-v4i4p137.html
• https://www.springer.com/journal/41060
28
Thank you
Please Send Your Queries on:
e-Mail:sukhman.e13261@cumail.in
29

Contenu connexe

Similaire à Lecture 3.31 3.32.pptx

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaStudent
 
Cloud Computing and Virtualization Overview by Amr Ali
Cloud Computing and Virtualization Overview by Amr AliCloud Computing and Virtualization Overview by Amr Ali
Cloud Computing and Virtualization Overview by Amr AliAmr Ali
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With HadoopUmair Shafique
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdfTyStrk
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfJohn422973
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material ccAnkit Gupta
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Denodo
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 

Similaire à Lecture 3.31 3.32.pptx (20)

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
 
Cloud Computing and Virtualization Overview by Amr Ali
Cloud Computing and Virtualization Overview by Amr AliCloud Computing and Virtualization Overview by Amr Ali
Cloud Computing and Virtualization Overview by Amr Ali
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdf
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material cc
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 

Dernier

Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Dernier (20)

Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 

Lecture 3.31 3.32.pptx

  • 1. DISCOVER . LEARN . EMPOWER Apex Institute of Technology Department of Computer Science & Engineering Introduction to Data Science (CSF-284) 1 Sukhman Kaur E13261 Assistant Professor CSE(AIT), CU
  • 2. Introduction to Data Science: Course Objectives 2 COURSE OBJECTIVES The Course aims to: • This course brings together several key big data problems and solutions. • To recognize the key concepts of Extraction, Transformation and Loading • To prepare a sample project in Hadoop Environment
  • 3. COURSE OUTCOMES 3 On completion of this course, the students shall be able to:- CO4 Recognize the feasibility of applying technologies for Handling Big Data.
  • 4. Chapter Index S. No Reference No Particulars Slide From-To 1 Learning Objectives 3 2 Topic 1 Distributed and Parallel Computing for Big Data 4-5 3 Topic 2 Introduction to Big Data Technologies 6-9 4 Topic 3 Cloud Computing and Big Data 10-14 5 Topic 4 In-Memory Technology for Big Data 15 6 Topic 5 Big Data Techniques 16-24 7 Let’s Sum Up 25
  • 5. Learning Objectives • Explain distributed and parallel computing for Big Data • Recognise Big Data technologies • Describe cloud computing in reference to Big Data • Discuss in-memory technology for Big Data • Elucidate Big Data techniques
  • 6. 1. Distributed and Parallel Computing for Big Data • Distributed computing is a method in which multiple computing resources are connected in a network and computing tasks are distributed across the resources, thereby increasing the computing power. Distributed computing is faster and more efficient than traditional computing, and, hence, of immense value when it comes to processing a huge amount of data in a limited time. • To carry out complex computations, the processing power of a standalone personal computer can also be enhanced by adding multiple processing units, which can carry out the processing of a complex task by breaking it up into sub-tasks, and carrying out individual sub-tasks simultaneously. Such systems are often termed as parallel systems. The greater the processing power, the faster the computing.
  • 7. 2. Distributed and Parallel Computing for Big Data • Following Table differentiates between distributed and parallel computing systems: Distributed Computing System Parallel Computing System An independent, autonomous system connected in a network for accomplishing specific tasks A computer system with several processing units attached to it Coordination is possible between connected computers that have their own memory and CPU A common shared memory can be directly accessed by every processing unit in a network Loose coupling of computers connected in a network that provides access to data and remotely located Resources Tight coupling of processing resources that are used for solving a single, complex problem
  • 8. 1.Introduction to Big Data Technologies • A Big Data system is vastly different from other solution-providing systems and is based on the seven Vs, as described in previous chapter, namely: Volume, Velocity, Variety, Veracity, Variability, Value and Visualisation. • A system that complies with these properties and happens to be robust to withstand unexpected events and scalable enough to accommodate future methodologies is qualified to be called as a Big Data system. • A typical Big Data system consists of a setup that adheres to these seven Vs and provides a great infrastructure that can withstand the influx of huge datasets with high velocity, meanwhile providing effective mechanism to process the datasets by cleansing, shaping, filtering and sorting into meaningful information aimed towards making the data both user- and machine-friendly.
  • 9. 2.Introduction to Big Data Technologies Hadoop • Hadoop is an open-source platform that provides analytical technologies and computational power required to work with such large volumes of data. • Hadoop platform provides an improved programming model, which is used to create and run distributed systems quickly and efficiently. • A Hadoop cluster consists of single MasterNode and multiple worker nodes. The MasterNode contains a NameNode and JobTracker and a slave or worker node acts as both a DataNode and TaskTracker. Hadoop requires Java Runtime Environment (JRE) 1.6 or a higher version of JRE. • There are two main components of Apache Hadoop – the Hadoop Distributed File System (HDFS) and the MapReduce parallel processing framework.
  • 10. 3.Introduction to Big Data Technologies Hadoop • Hadoop distributed file system (HDFS) is a fault-tolerant storage system in Hadoop. It stores large size files from terabytes to petabytes across different terminals. • Data is replicated on three nodes: two on the same rack and one on a different rack. The file in HDFS is split into large blocks size of 64 MB by default (typically 64 to 128 megabytes) and each block of the file is independently replicated at multiple data nodes. • The NameNode actively monitors the number of replicas of a block (by default 3 times). When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block.
  • 11. 4.Introduction to Big Data Technologies R • R is an open source programming language and an application environment for statistical computing with graphics, developed by R Foundation for Statistical Computing. • It is an interpreted language like Python and uses a command line interpreter. It supports procedural as well as generic functions with OOP. • R is extensively used by data miners and statisticians, providing a vast variety of graphical and statistical techniques, with linear and nonlinear modelling, time-series analysis, classical statistical tests, clustering, classification and others. • R is easily extendable and implementable through functions and available extensions.
  • 12. 1. Cloud Computing and Big Data • One of the vital issues that organisations face with the storage and management of Big Data is the huge amount of investment to get the required hardware setup and software packages. • Some of these resources may be overutilised or underutilised with varying requirements overtime. We can overcome these challenges by providing a set of computing resources that can be shared through cloud computing. • The cloud computing environment saves costs related to infrastructure in an organisation by providing a framework that can be optimised and expanded horizontally.
  • 13. 2. Cloud Computing and Big Data Features of Cloud Computing Scalability Elasticity Resource Pooling Self-Service Low Cost Fault Tolerance
  • 14. 3. Cloud Computing and Big Data Cloud Deployment Models Public Cloud (End-User Level Cloud) Private Cloud (Enterprise-Level Cloud) Community Cloud Hybrid Cloud
  • 15. 4. Cloud Computing and Big Data Cloud Delivery Models • It is one of the categories of cloud computing services, which makes available virtualised computing resources on Internet. Infrastructure as a Service (IaaS) • It is built above IaaS and is the layer that interacts with the users, allowing them to deploy and use applications created using programming and run-time environment platforms that are supported by the provider. Platform as a Service (PaaS) • SaaS is one of the most popular cloud-based models and comprises applications provided by the service provider. Software as a Service (SaaS)
  • 16. 5. Cloud Computing and Big Data Cloud Providers in Big Data Market Big Data cloud providers have been gearing up to bring the most advanced technologies at competitive prices in the market. Some providers are established, whereas some of them are relatively new to the field of cloud services. Some of these providers are rendering services that are relevant to Big Data analytics only. Some such providers are as follows: • Amazon • Google • Windows Azure
  • 17. In-Memory Technology for Big Data • Hardware obstructions and limitations, lag of memory indifferences have to be side-lined and streamlined with something faster like a cache memory or dynamic access memory so that the data is readily available for disposal. • The in-memory big data computing tool supports processing of high velocity data in real-time and also faster processing of the stationary data. • Technologies like event-streaming platforms, in-memory databases and analytics, and high-level messaging structures are witnessing massive growth that resonates with the organisational needs for better understandings achievable by a wider and deeper data assessment.
  • 18. 1. Big Data Techniques • To analyse the datasets, there are many techniques available, some of which are as follows: • Massive Parallelism • Data Distribution • High-Performance Computing • Task and Thread Management • Data Mining and Analytics • Data Retrieval
  • 19. 2. Big Data Techniques Massive Parallelism • According to the simplest definition available, a parallel system is a system where multiple processors are involved and associated to carry out the concurrent computations. • Massive parallelism refers to a parallel system where multiple systems interconnected with each other pose as a single mighty conjoint processor and carry out tasks received from the data sets parallelly. • In terms of Big Data dynamics, the systems can not only be processor, but also memory, hardware and even network conjoint to scale up the operational efficiency posing as a massive system that can eat humongous datasets parallelly without breaking a sweat.
  • 20. 3. Big Data Techniques Data Distribution • There are approaches to data distribution in a Big Data system described as follows: • Centralised Approach: A central repository is used to store and download the essential dataset by virtual machines. • ‰ Semi-Centralised Approach: Semi-centralised approach reduces the stress on the networking infrastructure. • Hierarchical Approach: In a hierarchical approach, the data is fetched from the parent node, i.e., the virtual machine, in the hierarchy. • P2P Approach: P2P streaming connections are based on hierarchical multi-trees.
  • 21. 4. Big Data Techniques High-Performance Computing • High-performance computing is the simultaneous use of supercomputers and parallel processing techniques for solving intricate computation problems. • It emphasises on making parallel processing systems and algorithms by joining both parallel and administrative computational methods. • The words supercomputing and high-performance computing are often used to resemble each other. • High-performance computing is used for performing research activities and cracking advanced problems through computer simulation, modelling and analysis.
  • 22. 5. Big Data Techniques Task and Thread Management • Threads are simply the OS-based feature with their own Kernel and memory resources, and allow an application logic to be segregated into concurrent multiple execution paths. It is a useful feature when complex applications having multiple tasks need to be performed at the same time. • Task parallelism refers to the execution of computer programmes throughout the multiple processors on the different or same machines. It emphasises on performing diverse operations in parallel to best utilise the accessible computing resources like memory and processors. • Data parallelism focusses on effective distribution of datasets throughout the multiple calculation programs.
  • 23. 6. Big Data Techniques Data Mining and Analytics • Data mining is the process of data extraction, evaluating it from multiple perspectives and then producing the information summary in a meaningful form that identifies one or more relationships within the dataset. • Data analysis is an experiential activity, where the data scouring gives out some insight. • Data analytics is about applying an algorithmic or logical process to derive the insights from a given dataset. For example, looking at the past year’s weather and pest data, for the current month, we can determine that a particular type of fungus grows often when the humidity levels reach a definite point.
  • 24. 7. Big Data Techniques Data Retrieval • Big Data refers to the large amounts of multi-structural data that continuously flows around and within the organisations, and includes text, video, transactional records and sensor logs. • Big Data systems utilise the Hadoop and the HDFS architecture to retrieve the data using MapReduce – a distributed processing framework. • It helps programmers in solving parallel data problems where the dataset can be divided into small chunks and handled autonomously. • MapReduce is important step as it allows normal developers to utilise parallel programming concepts irrespective of cluster communication details, failure handling and task monitoring.
  • 25. 8. Big Data Techniques Machine Learning • Machine leaning formally focusses on the performance, theory and properties of learning algorithms and systems. Machine learning is considered an ideal research field for taking advantage of the opportunities available in big data. • It delivers on the potential of mining the value from huge and different data sources with less dependence on human instructions. It is data-driven and runs at machine scale and well-suited to the complication of dealing with different data sources and the enormous range of variables and quantities of data involved. • Machine learning systems utilise multiple algorithms to discover and show the patterns hidden in the datasets.
  • 26. 9. Big Data Techniques Data Visualisation • Data visualisation is a valuable means through which the larger datasets after being combined may appear practical, sensible and open to most people. Data visualisation is a trail-blazing method that not only keeps you enlightened but helps others with the attributes of a typical statistical and computational result that would’ve otherwise appeared intimidating for normal minds. • Visual representation is often considered the most effective medium of information and communication channel. As the saying goes, a picture is worth thousand words, data visualisation is a great example of that saying. When properly aligned, it can convey critical information of data analysis in probably the easiest way possible.
  • 27. Let’s Sum Up • The distributed computing works on the rules of divide and conquer, performing modules of the parent tasks on multiple machines and then combining the results. • Parallel computing refers to the utilisation of a single CPU present in a system or a group of internally coupled systems by the means of efficient and clever multi-threading operations. • Distributed computing is considered as the subset of parallel computing, which further is the subset of concurrent computing. • A Big Data system is vastly different from other solution-providing systems and is based on the seven Vs, namely: Volume, Velocity, Variety, Veracity, Variability, Value and Visualisation.
  • 28. References • Text Books: • Principles of Data Science by Sinan Ozdemir, (2016) PACKT. • Data Science For Dummies by Lillian Pierson (2015) • https://www.simplilearn.com/math-and-data-science-article • Reference Books: • R1. Data Science For Dummies by Lillian Pierson (2015) • Journals: • http://www.ijsmsjournal.org/ijsms-v4i4p137.html • https://www.springer.com/journal/41060 28
  • 29. Thank you Please Send Your Queries on: e-Mail:sukhman.e13261@cumail.in 29

Notes de l'éditeur

  1. Public Cloud (End-User Level Cloud): A cloud that is owned and managed by a company than the one (which can be either an individual user or a company) using it is known as a public cloud. Private Cloud (Enterprise Level Cloud): The cloud that remains entirely in the ownership of the organisation using it is known as a private cloud. Community Cloud: Community cloud is a type of cloud that is shared among various organisations with a common tie. Hybrid Cloud: The cloud environment in which various internal or external service providers offer services to many organisations is known as a hybrid cloud.