Data science

1
Compiled By - Biniam Behailu

INTRODUCTION TO EMERGING TECHNOLOGIES
(EMTE1012)
CHAPTER – 2
INTRODUCTION TO DATA SCIENCE
2

3
 Describe what data science is and the role of data
scientists.
➢ Differentiate data and information.
➢ Describe data processing life cycle
➢ Understand different data types from diverse
perspectives
 Describe data value chain in emerging era of big
data.
➢ Understand the basics of Big Data.
➢ Describe the purpose of the Hadoop ecosystem
components.
Data
Science

5
 Data science is much more than simply analyzing data.
 Data science is a multi disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from structured and unstructured data.

6
 Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to "understand and
analyze actual phenomena" with data.
 It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science.

7
 As an academic discipline and profession, data science continues to
evolve as one of the most promising and in-demand career paths for
skilled professionals.
 They possess a strong quantitative background in statistics and linear
algebra as well as programming knowledge with focuses on data
warehousing, mining, and modeling to build and analyze algorithms.

9
 Data can be described as unprocessed facts, and figures.
 It is a representation of facts, concepts, or instructions in a
formalized manner, which should be suitable for communication,
interpretation, or processing, by human or electronic machines.
 It can exist in any form, usable or not. It does not have meaning of
itself.
 It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).
 In computer parlance, a spreadsheet generally starts out by holding
data.

10
 Information is data that has been given meaning by way of relational
connection.
 It is the processed data on which decisions and actions are based.
 Information is interpreted data; created from organized, structured,
and processed data in a particular context.
 In computer parlance, a relational database makes information from
the data stored within it.

11
 Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a
particular purpose.
 Data processing consists of the following basic steps - input,
processing, and output.

12
 Data can take many material forms including numbers, text, symbols,
images, sound, electromagnetic waves etc ...
 These are typically divided into two broad categories.
• Qualitative and
• Quantitative

13
 Quantitative data consist of numeric records.
 Generally, such data are extensive and relate to the
• Physical properties of phenomena (such as length, height, distance,
weight, area, volume),
• Non-physical characteristics of phenomena (such as social class,
educational attainment, quality of life rankings).

14
 Qualitative data deals with descriptions.
 Such data can be analyzed using visualizations, a variety of descriptive
and inferential statistics, and be used as the inputs to predictive and
simulation models.

16
 In computer science and computer programming, for instance, a data
type is simply an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.
 Almost all programming languages explicitly include the notion of data
type, though different languages may use different terminology.

17
 Integers(int) - is used to store whole numbers, mathematically known
as integers
 Booleans(bool) - is used to represent restricted to one of two values:
true or false
 Characters(char) - is used to store a single character
 Floating-point numbers(float) - is used to store real numbers
 Alphanumeric strings(string) - used to store a combination of
characters and numbers

18
 From a data analytics point of view, it is important to understand that
there are three common types of data types or structures:
• Structured,
• Semi-structured, and
• Unstructured data types.

19
 Structured data are those that can be easily organized, stored and
transferred in a defined data model, such as numbers/text set out in a
table or relational database that have a consistent format (e.g., name,
date of birth, address, gender, etc).
 Such data can be processed, searched, queried, combined, and
analyzed relatively straight forwardly using calculus and algorithms,
and can be visualized using various forms of graphs and maps, and
easily processed by computers.
 Often structured data is managed using Structured Query Language
(SQL).

20
 Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined
manner.
 A much bigger percentage of all the data in our world is
unstructured data.
 Unstructured data is data that cannot be contained in a row-column
database and doesn’t have an associated data model.
 Common examples of unstructured data include audio, video files.
 Unstructured data is usually stored in data lakes, NoSQL databases,
and data warehouses.

22
 Beyond structured and unstructured data, there is a third category,
which basically is a mix between both of them.
 Semi-structured data are loosely structured data that have no
predefined data model/schema and thus cannot be held in a relational
database.
 Contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Examples of semi-structured data include JSON and XML are forms of
semi-structured data.

25
 Meta data is data about data.
 It is one of the most important elements for Big Data analysis and big
data solutions.
 It provides additional information about a specific set of data.
 In a set of photographs, for example, metadata could describe when
and where the photos were taken.
 The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data.

26
 The data value chain describes the process of data creation and use
from first identifying a need for data to its final use and possible
reuse.
 A value chain is made up of a series of subsystem each with inputs,
transformation processes, and outputs.
 In a Data Value Chain, information flow is described as a series of
steps needed to generate value and useful insights from data.

27
 Data Acquisition is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage solution
on which data analysis can be carried out.
 Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.

29
 Data Analysis is concerned with making the raw data acquired
amenable to use in decision-making as well as domain-specific usage.
 Data analysis is the process of evaluating data using analytical and
statistical tools to discover useful information and aid in business
decision making.
 Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and extracting
useful hidden information with high potential from a business point of
view.
 Related areas include data mining, business intelligence, and
machine learning.

31
 Data Curation is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its
effective usage.
 Data curation is the organization and integration of data collected
from various sources.
 Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
 It involves annotation, publication and presentation of the data such
that the value of the data is maintained over time, and the data
remains available for reuse and preservation.

33
 Data Storage is the persistence and management of data in a scalable
way that satisfies the needs of applications that require fast access to
the data.
 Relational Database Management Systems (RDBMS) have been the
main, and almost unique, solution to the storage paradigm for nearly
40 years.
 NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.

36
 Data Usage covers the data-driven business activities that need access
to data, its analysis, and the tools needed to integrate the data
analysis within the business activity.
 The process of decision-making includes reporting, exploration of
data (browsing and lookup), and exploratory search (finding
correlations, comparisons, what-if scenarios, etc.).

37
Introduction to Emerging Technologies------------ Compiled by Biniam Behailu

38
Compiled By - Biniam Behailu

39
 Data has not only become the lifeblood of any organization, but is also
growing exponentially.
 Data generated today is several magnitudes larger than what was
generated just a few years ago.
 Big Data is not simply a large amount of data
 Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

40
 Leading IT industry research group Gartner defines Big Data as:
 Big Data definition is based on the three Vs:
 Volume: Size of data (how big it is)
 Velocity: How fast data is being generated
 Variety: Variation of data types to include source, format, and
structure(data can be unstructured, semi-structured, or
structured).
“Big Data are high-volume, high-velocity, and/or high-
variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization.”

42
Importance of Big Data
 New generation data is changing in both quantity (volume) and format
(variety).
 Explosive growth (velocity) is the most obvious example of data
change.
• IBM estimates 2.5 quintillion bytes of data are generated each day.
• Ninety percent of the data in the world is less than two years old.

43
 Reasons for the data explosion are due to new technologies
generating and collecting vast amounts of data.
 These sources include
• Scientific sensors such as global mapping, meteorological tracking,
medical imaging, and DNA research
• Point of Sale (POS) tracking and inventory control systems
• Social media such as Facebook posts and Twitter Tweets
• Internet and intranet websites across the world

44
Clustered Computing  Resource Pooling: Combining the available
storage space to hold data is a clear benefit,
but CPU and memory pooling are also
extremely important.
 High Availability: Clusters can provide
varying levels of fault tolerance and
availability guarantees to prevent hardware
or software failures from affecting access to
data and processing.
 Easy Scalability: Clusters make it easy to
scale horizontally by adding additional
machines to the group.

45
 Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
 Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
 YARN allows the data stored in HDFS (Hadoop Distributed File System)
to be processed and run by various data processing engines such as
batch processing, stream processing, interactive processing, graph
processing.

46
Hadoop
 Open-source software from Apache
Software Foundation to store and
process large non-relational data sets
via a large, scalable distributed model.
 It is a scalable fault-tolerant system for
processing large datasets across a
cluster of commodity servers.
 The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models.

47
Four characteristics of Hadoop
 Economical: Its systems are highly economical as ordinary computers
can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically. A few
extra nodes help in scaling up the framework.
 Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.

Data science

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data science

Similaire à Data science (20)

Dernier

Dernier (20)

Data science