SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
1
Compiled By - Biniam Behailu
INTRODUCTION TO EMERGING TECHNOLOGIES
(EMTE1012)
CHAPTER – 2
INTRODUCTION TO DATA SCIENCE
2
3
 Describe what data science is and the role of data
scientists.
➢ Differentiate data and information.
➢ Describe data processing life cycle
➢ Understand different data types from diverse
perspectives
 Describe data value chain in emerging era of big
data.
➢ Understand the basics of Big Data.
➢ Describe the purpose of the Hadoop ecosystem
components.
Data
Science
What is Data Science?
4
5
 Data science is much more than simply analyzing data.
 Data science is a multi disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from structured and unstructured data.
6
 Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to "understand and
analyze actual phenomena" with data.
 It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science.
7
 As an academic discipline and profession, data science continues to
evolve as one of the most promising and in-demand career paths for
skilled professionals.
 They possess a strong quantitative background in statistics and linear
algebra as well as programming knowledge with focuses on data
warehousing, mining, and modeling to build and analyze algorithms.
8
9
 Data can be described as unprocessed facts, and figures.
 It is a representation of facts, concepts, or instructions in a
formalized manner, which should be suitable for communication,
interpretation, or processing, by human or electronic machines.
 It can exist in any form, usable or not. It does not have meaning of
itself.
 It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).
 In computer parlance, a spreadsheet generally starts out by holding
data.
10
 Information is data that has been given meaning by way of relational
connection.
 It is the processed data on which decisions and actions are based.
 Information is interpreted data; created from organized, structured,
and processed data in a particular context.
 In computer parlance, a relational database makes information from
the data stored within it.
11
 Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a
particular purpose.
 Data processing consists of the following basic steps - input,
processing, and output.
12
 Data can take many material forms including numbers, text, symbols,
images, sound, electromagnetic waves etc ...
 These are typically divided into two broad categories.
• Qualitative and
• Quantitative
13
 Quantitative data consist of numeric records.
 Generally, such data are extensive and relate to the
• Physical properties of phenomena (such as length, height, distance,
weight, area, volume),
• Non-physical characteristics of phenomena (such as social class,
educational attainment, quality of life rankings).
14
 Qualitative data deals with descriptions.
 Such data can be analyzed using visualizations, a variety of descriptive
and inferential statistics, and be used as the inputs to predictive and
simulation models.
15
16
 In computer science and computer programming, for instance, a data
type is simply an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.
 Almost all programming languages explicitly include the notion of data
type, though different languages may use different terminology.
17
 Integers(int) - is used to store whole numbers, mathematically known
as integers
 Booleans(bool) - is used to represent restricted to one of two values:
true or false
 Characters(char) - is used to store a single character
 Floating-point numbers(float) - is used to store real numbers
 Alphanumeric strings(string) - used to store a combination of
characters and numbers
18
 From a data analytics point of view, it is important to understand that
there are three common types of data types or structures:
• Structured,
• Semi-structured, and
• Unstructured data types.
19
 Structured data are those that can be easily organized, stored and
transferred in a defined data model, such as numbers/text set out in a
table or relational database that have a consistent format (e.g., name,
date of birth, address, gender, etc).
 Such data can be processed, searched, queried, combined, and
analyzed relatively straight forwardly using calculus and algorithms,
and can be visualized using various forms of graphs and maps, and
easily processed by computers.
 Often structured data is managed using Structured Query Language
(SQL).
20
 Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined
manner.
 A much bigger percentage of all the data in our world is
unstructured data.
 Unstructured data is data that cannot be contained in a row-column
database and doesn’t have an associated data model.
 Common examples of unstructured data include audio, video files.
 Unstructured data is usually stored in data lakes, NoSQL databases,
and data warehouses.
21
22
 Beyond structured and unstructured data, there is a third category,
which basically is a mix between both of them.
 Semi-structured data are loosely structured data that have no
predefined data model/schema and thus cannot be held in a relational
database.
 Contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Examples of semi-structured data include JSON and XML are forms of
semi-structured data.
23
JSON
24
25
 Meta data is data about data.
 It is one of the most important elements for Big Data analysis and big
data solutions.
 It provides additional information about a specific set of data.
 In a set of photographs, for example, metadata could describe when
and where the photos were taken.
 The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data.
26
 The data value chain describes the process of data creation and use
from first identifying a need for data to its final use and possible
reuse.
 A value chain is made up of a series of subsystem each with inputs,
transformation processes, and outputs.
 In a Data Value Chain, information flow is described as a series of
steps needed to generate value and useful insights from data.
27
 Data Acquisition is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage solution
on which data analysis can be carried out.
 Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.
28
29
 Data Analysis is concerned with making the raw data acquired
amenable to use in decision-making as well as domain-specific usage.
 Data analysis is the process of evaluating data using analytical and
statistical tools to discover useful information and aid in business
decision making.
 Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and extracting
useful hidden information with high potential from a business point of
view.
 Related areas include data mining, business intelligence, and
machine learning.
30
31
 Data Curation is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its
effective usage.
 Data curation is the organization and integration of data collected
from various sources.
 Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
 It involves annotation, publication and presentation of the data such
that the value of the data is maintained over time, and the data
remains available for reuse and preservation.
32
33
 Data Storage is the persistence and management of data in a scalable
way that satisfies the needs of applications that require fast access to
the data.
 Relational Database Management Systems (RDBMS) have been the
main, and almost unique, solution to the storage paradigm for nearly
40 years.
 NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
34
35
36
 Data Usage covers the data-driven business activities that need access
to data, its analysis, and the tools needed to integrate the data
analysis within the business activity.
 The process of decision-making includes reporting, exploration of
data (browsing and lookup), and exploratory search (finding
correlations, comparisons, what-if scenarios, etc.).
37
Introduction to Emerging Technologies------------ Compiled by Biniam Behailu
38
Compiled By - Biniam Behailu
39
 Data has not only become the lifeblood of any organization, but is also
growing exponentially.
 Data generated today is several magnitudes larger than what was
generated just a few years ago.
 Big Data is not simply a large amount of data
 Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
40
 Leading IT industry research group Gartner defines Big Data as:
 Big Data definition is based on the three Vs:
 Volume: Size of data (how big it is)
 Velocity: How fast data is being generated
 Variety: Variation of data types to include source, format, and
structure(data can be unstructured, semi-structured, or
structured).
“Big Data are high-volume, high-velocity, and/or high-
variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization.”
41
42
Importance of Big Data
 New generation data is changing in both quantity (volume) and format
(variety).
 Explosive growth (velocity) is the most obvious example of data
change.
• IBM estimates 2.5 quintillion bytes of data are generated each day.
• Ninety percent of the data in the world is less than two years old.
43
 Reasons for the data explosion are due to new technologies
generating and collecting vast amounts of data.
 These sources include
• Scientific sensors such as global mapping, meteorological tracking,
medical imaging, and DNA research
• Point of Sale (POS) tracking and inventory control systems
• Social media such as Facebook posts and Twitter Tweets
• Internet and intranet websites across the world
44
Clustered Computing  Resource Pooling: Combining the available
storage space to hold data is a clear benefit,
but CPU and memory pooling are also
extremely important.
 High Availability: Clusters can provide
varying levels of fault tolerance and
availability guarantees to prevent hardware
or software failures from affecting access to
data and processing.
 Easy Scalability: Clusters make it easy to
scale horizontally by adding additional
machines to the group.
45
 Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
 Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
 YARN allows the data stored in HDFS (Hadoop Distributed File System)
to be processed and run by various data processing engines such as
batch processing, stream processing, interactive processing, graph
processing.
46
Hadoop
 Open-source software from Apache
Software Foundation to store and
process large non-relational data sets
via a large, scalable distributed model.
 It is a scalable fault-tolerant system for
processing large datasets across a
cluster of commodity servers.
 The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models.
47
Four characteristics of Hadoop
 Economical: Its systems are highly economical as ordinary computers
can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically. A few
extra nodes help in scaling up the framework.
 Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
48
THANKYOU
49

Contenu connexe

Tendances

Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Edureka!
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introductionManokamnaKochar1
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science ProcessVishal Patel
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Introduction to-data-science
Introduction to-data-scienceIntroduction to-data-science
Introduction to-data-scienceAhmad karawash
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introductiondatatovalue
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 

Tendances (20)

Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Data Science
Data ScienceData Science
Data Science
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introduction
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data mining
Data mining Data mining
Data mining
 
Introduction to-data-science
Introduction to-data-scienceIntroduction to-data-science
Introduction to-data-science
 
Data visualization
Data visualizationData visualization
Data visualization
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Data science
Data scienceData science
Data science
 
Data science
Data scienceData science
Data science
 
Data analytics
Data analyticsData analytics
Data analytics
 

Similaire à Data science

Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxJethroDignadice2
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal
 
Research in Big Data - An Overview
Research in Big Data - An OverviewResearch in Big Data - An Overview
Research in Big Data - An Overviewieijjournal
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal1
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data scienceJohnson Ubah
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
 
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data AnalyticsIntroduction of Data Science and Data Analytics
Introduction of Data Science and Data AnalyticsVrushaliSolanke
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 

Similaire à Data science (20)

Chapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptxChapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptx
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
Research in Big Data - An Overview
Research in Big Data - An OverviewResearch in Big Data - An Overview
Research in Big Data - An Overview
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
Data literacy
Data literacyData literacy
Data literacy
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 
AIS 3 - EDITED.pdf
AIS 3 - EDITED.pdfAIS 3 - EDITED.pdf
AIS 3 - EDITED.pdf
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data AnalyticsIntroduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 

Dernier

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Dernier (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Data science

  • 1. 1 Compiled By - Biniam Behailu
  • 2. INTRODUCTION TO EMERGING TECHNOLOGIES (EMTE1012) CHAPTER – 2 INTRODUCTION TO DATA SCIENCE 2
  • 3. 3  Describe what data science is and the role of data scientists. ➢ Differentiate data and information. ➢ Describe data processing life cycle ➢ Understand different data types from diverse perspectives  Describe data value chain in emerging era of big data. ➢ Understand the basics of Big Data. ➢ Describe the purpose of the Hadoop ecosystem components. Data Science
  • 4. What is Data Science? 4
  • 5. 5  Data science is much more than simply analyzing data.  Data science is a multi disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
  • 6. 6  Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.  It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.
  • 7. 7  As an academic discipline and profession, data science continues to evolve as one of the most promising and in-demand career paths for skilled professionals.  They possess a strong quantitative background in statistics and linear algebra as well as programming knowledge with focuses on data warehousing, mining, and modeling to build and analyze algorithms.
  • 8. 8
  • 9. 9  Data can be described as unprocessed facts, and figures.  It is a representation of facts, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing, by human or electronic machines.  It can exist in any form, usable or not. It does not have meaning of itself.  It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).  In computer parlance, a spreadsheet generally starts out by holding data.
  • 10. 10  Information is data that has been given meaning by way of relational connection.  It is the processed data on which decisions and actions are based.  Information is interpreted data; created from organized, structured, and processed data in a particular context.  In computer parlance, a relational database makes information from the data stored within it.
  • 11. 11  Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose.  Data processing consists of the following basic steps - input, processing, and output.
  • 12. 12  Data can take many material forms including numbers, text, symbols, images, sound, electromagnetic waves etc ...  These are typically divided into two broad categories. • Qualitative and • Quantitative
  • 13. 13  Quantitative data consist of numeric records.  Generally, such data are extensive and relate to the • Physical properties of phenomena (such as length, height, distance, weight, area, volume), • Non-physical characteristics of phenomena (such as social class, educational attainment, quality of life rankings).
  • 14. 14  Qualitative data deals with descriptions.  Such data can be analyzed using visualizations, a variety of descriptive and inferential statistics, and be used as the inputs to predictive and simulation models.
  • 15. 15
  • 16. 16  In computer science and computer programming, for instance, a data type is simply an attribute of data that tells the compiler or interpreter how the programmer intends to use the data.  Almost all programming languages explicitly include the notion of data type, though different languages may use different terminology.
  • 17. 17  Integers(int) - is used to store whole numbers, mathematically known as integers  Booleans(bool) - is used to represent restricted to one of two values: true or false  Characters(char) - is used to store a single character  Floating-point numbers(float) - is used to store real numbers  Alphanumeric strings(string) - used to store a combination of characters and numbers
  • 18. 18  From a data analytics point of view, it is important to understand that there are three common types of data types or structures: • Structured, • Semi-structured, and • Unstructured data types.
  • 19. 19  Structured data are those that can be easily organized, stored and transferred in a defined data model, such as numbers/text set out in a table or relational database that have a consistent format (e.g., name, date of birth, address, gender, etc).  Such data can be processed, searched, queried, combined, and analyzed relatively straight forwardly using calculus and algorithms, and can be visualized using various forms of graphs and maps, and easily processed by computers.  Often structured data is managed using Structured Query Language (SQL).
  • 20. 20  Unstructured data is information that either does not have a predefined data model or is not organized in a pre-defined manner.  A much bigger percentage of all the data in our world is unstructured data.  Unstructured data is data that cannot be contained in a row-column database and doesn’t have an associated data model.  Common examples of unstructured data include audio, video files.  Unstructured data is usually stored in data lakes, NoSQL databases, and data warehouses.
  • 21. 21
  • 22. 22  Beyond structured and unstructured data, there is a third category, which basically is a mix between both of them.  Semi-structured data are loosely structured data that have no predefined data model/schema and thus cannot be held in a relational database.  Contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.  Examples of semi-structured data include JSON and XML are forms of semi-structured data.
  • 24. 24
  • 25. 25  Meta data is data about data.  It is one of the most important elements for Big Data analysis and big data solutions.  It provides additional information about a specific set of data.  In a set of photographs, for example, metadata could describe when and where the photos were taken.  The metadata then provides fields for dates and locations which, by themselves, can be considered structured data.
  • 26. 26  The data value chain describes the process of data creation and use from first identifying a need for data to its final use and possible reuse.  A value chain is made up of a series of subsystem each with inputs, transformation processes, and outputs.  In a Data Value Chain, information flow is described as a series of steps needed to generate value and useful insights from data.
  • 27. 27  Data Acquisition is the process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out.  Data acquisition is one of the major big data challenges in terms of infrastructure requirements.
  • 28. 28
  • 29. 29  Data Analysis is concerned with making the raw data acquired amenable to use in decision-making as well as domain-specific usage.  Data analysis is the process of evaluating data using analytical and statistical tools to discover useful information and aid in business decision making.  Data analysis involves exploring, transforming, and modeling data with the goal of highlighting relevant data, synthesizing and extracting useful hidden information with high potential from a business point of view.  Related areas include data mining, business intelligence, and machine learning.
  • 30. 30
  • 31. 31  Data Curation is the active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage.  Data curation is the organization and integration of data collected from various sources.  Data curation processes can be categorized into different activities such as content creation, selection, classification, transformation, validation, and preservation.  It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation.
  • 32. 32
  • 33. 33  Data Storage is the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data.  Relational Database Management Systems (RDBMS) have been the main, and almost unique, solution to the storage paradigm for nearly 40 years.  NoSQL technologies have been designed with the scalability goal in mind and present a wide range of solutions based on alternative data models.
  • 34. 34
  • 35. 35
  • 36. 36  Data Usage covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity.  The process of decision-making includes reporting, exploration of data (browsing and lookup), and exploratory search (finding correlations, comparisons, what-if scenarios, etc.).
  • 37. 37 Introduction to Emerging Technologies------------ Compiled by Biniam Behailu
  • 38. 38 Compiled By - Biniam Behailu
  • 39. 39  Data has not only become the lifeblood of any organization, but is also growing exponentially.  Data generated today is several magnitudes larger than what was generated just a few years ago.  Big Data is not simply a large amount of data  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 40. 40  Leading IT industry research group Gartner defines Big Data as:  Big Data definition is based on the three Vs:  Volume: Size of data (how big it is)  Velocity: How fast data is being generated  Variety: Variation of data types to include source, format, and structure(data can be unstructured, semi-structured, or structured). “Big Data are high-volume, high-velocity, and/or high- variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
  • 41. 41
  • 42. 42 Importance of Big Data  New generation data is changing in both quantity (volume) and format (variety).  Explosive growth (velocity) is the most obvious example of data change. • IBM estimates 2.5 quintillion bytes of data are generated each day. • Ninety percent of the data in the world is less than two years old.
  • 43. 43  Reasons for the data explosion are due to new technologies generating and collecting vast amounts of data.  These sources include • Scientific sensors such as global mapping, meteorological tracking, medical imaging, and DNA research • Point of Sale (POS) tracking and inventory control systems • Social media such as Facebook posts and Twitter Tweets • Internet and intranet websites across the world
  • 44. 44 Clustered Computing  Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are also extremely important.  High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware or software failures from affecting access to data and processing.  Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group.
  • 45. 45  Using clusters requires a solution for managing cluster membership, coordinating resource sharing, and scheduling actual work on individual nodes.  Cluster membership and resource allocation can be handled by software like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).  YARN allows the data stored in HDFS (Hadoop Distributed File System) to be processed and run by various data processing engines such as batch processing, stream processing, interactive processing, graph processing.
  • 46. 46 Hadoop  Open-source software from Apache Software Foundation to store and process large non-relational data sets via a large, scalable distributed model.  It is a scalable fault-tolerant system for processing large datasets across a cluster of commodity servers.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • 47. 47 Four characteristics of Hadoop  Economical: Its systems are highly economical as ordinary computers can be used for data processing.  Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.  Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in scaling up the framework.  Flexible: It is flexible and you can store as much structured and unstructured data as you need to and decide to use them later.
  • 48. 48