SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
D
R
A
F
T
Data Analytics (KIT-601)
Unit-1: Introduction to Data Analytics & Data
Analytics Lifecycle
Dr. Radhey Shyam
Professor
Department of Information Technology
SRMCEM Lucknow
(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)
Unit-1 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who
made their course contents freely available or (Contributed directly or indirectly). Feel free to use this
study material for your own academic purposes. For any query, communication can be made through this
email : shyam0058@gmail.com.
March 14, 2023
Data Analytics (KIT 601)
Course Outcome ( CO) Bloom’s Knowledge Level (KL)
At the end of course , the student will be able to
CO 1 Discuss various concepts of data analytics pipeline K1, K2
CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data K2, K3
CO 4 Compare different clustering and frequent pattern mining algorithms K4
CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
I
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle – discovery, data preparation, model planning, model building,
communicating results, operationalization.
08
II
Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
08
III
Mining Data Streams: Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering streams, counting
distinct elements in a stream, estimating moments, counting oneness in a window,
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real
time sentiment analysis, stock market predictions.
08
IV
Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means,
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
08
V
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR,
Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual
data analysis techniques, interaction techniques, systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
08
Text books and References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education
Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23
D
R
A
F
T
D
R
A
F
T
Part-I: Introduction to Data Analytics
1 Introduction To Big Data
What Is Big Data?
ˆ Big Data is often described as extremely large data sets that have grown beyond the ability to manage
and analyze them with traditional data processing tools.
ˆ Big Data defines a situation in which data sets have grown to such enormous sizes that conventional
information technologies can no longer effectively handle either the size of the data set or the scale and
growth of the data set.
ˆ In other words, the data set has grown so large that it is difficult to manage and even harder to garner
value out of it.
ˆ The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visual-
ization of data.
ˆ Big Data has its roots in the scientific and medical communities, where the complex analysis of massive
amounts of data has been done for drug development, physics modeling, and other forms of research,
all of which involve large data sets.
These 4Vs (See Figure 1) [13] 1
of Big Data lay out the path to analytics, with each having intrinsic value
in the process of discovering value. Nevertheless, the complexity of Big Data does not end with just four
1Volume—Organizations collect data from a variety of sources, including transactions, smart (IoT) devices, industrial
equipment, videos, images, audio, social media and more. In the past, storing all that data would have been too costly – but
cheaper storage using data lakes, Hadoop and the cloud have eased the burden.
Velocity—With the growth in the Internet of Things, data streams into businesses at an unprecedented speed and must be
handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in
near-real time.
Variety—Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text
documents, emails, videos, audios, stock ticker data and financial transactions.
Veracity—Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to link,
match, cleanse and transform data across systems. Businesses need to connect and correlate relationships, hierarchies and
multiple data linkages. Otherwise, their data can quickly spiral out of control.
Value—This refers to the value that the big data can provide and it relates directly to what organizations can do with that
collected data. It is often quantified as the potential social or economic value that the data might create.
3
D
R
A
F
T
Figure 1: Illustration of Big Data [14].
dimensions. There are other factors at work as well: the processes that Big Data drives. These processes
are a conglomeration of technologies and analytics that are used to define the value of data sources, which
translates to actionable elements that move businesses forward.
Many of those technologies or concepts are not new but have come to fall under the umbrella of Big
Volatility—It deals with “How long the data is valid?”
Validity—It refers to accuracy and correctness of data. Any data picked up for analysis needs to be accurate.
Variability—In addition to the increasing velocities and varieties of data, data flows are unpredictable – changing often and
varying greatly. It’s challenging, but businesses need to know when something is trending in social media, and how to manage
daily, seasonal and event-triggered peak data loads.
4
D
R
A
F
T
Data. Best defined as analysis categories, these technologies and concepts include the following:
Traditional business intelligence (BI). This consists of a broad category of applications and technologies
for gathering, storing, analyzing, and providing access to data. BI delivers actionable information, which
helps enterprise users make better business decisions using fact-based support systems. BI works by using
an in-depth analysis of detailed business data, provided by databases, application data, and other tangible
data sources.
In some circles, BI can provide historical, current, and predictive views of business operations.
Data mining. This is a process in which data are analyzed from different perspectives and then turned
into summary data that are deemed useful. Data mining is normally used with data at rest or with archival
data. Data mining techniques focus on modeling and knowledge discovery for predictive, rather than purely
descriptive, purposes—an ideal process for uncovering new patterns from large data sets.
Statistical applications. These look at data using algorithms based on statistical principles and normally
concentrate on data sets related to polls, census, and other static data sets. Statistical applications ideally
deliver sample observations that can be used to study populated data sets for the purpose of estimating,
testing, and predictive analysis. Empirical data, such as surveys and experimental reporting, are the primary
sources for analyzable information.
Predictive analysis. This is a subset of statistical applications in which data sets are examined to come
up with predictions, based on trends and information gleaned from databases. Predictive analysis tends
to be big in the financial and scientific worlds, where trending tends to drive predictions, once external
elements are added to the data set. One of the main goals of predictive analysis is to identify the risks
and opportunities for business process, markets, and manufacturing. Data modeling. This is a conceptual
application of analytics in which multiple “what-if” scenarios can be applied via algorithms to multiple data
sets. Ideally, the modeled information changes based on the information made available to the algorithms,
which then provide insight to the effects of the change on the data sets. Data modeling works hand in hand
with data visualization, in which uncovering information can help with a particular business endeavor.
5
D
R
A
F
T
The preceding analysis categories constitute only a portion of where Big Data is headed and why it has
intrinsic value to business. That value is driven by the never ending quest for a competitive advantage,
encouraging organizations to turn to large repositories of corporate and external data to uncover trends,
statistics, and other actionable information to help them decide on their next move. This has helped the
concept of Big Data to gain popularity with technologists and executives alike, along with its associated
tools, platforms, and analytics.
1.1 ARRIVAL OF ANALYTICS
ˆ As analytics and research were applied to large data sets, scientists came to the conclusion that more
is better—in this case, more data, more analysis, and more results.
ˆ Researchers started to incorporate related data sets, unstructured data, archival data, and
real-time data into the process
ˆ In the business world, Big Data is all about opportunity.
ˆ According to IBM, every day we create 2.5 quintillion (2.5 Ö 1018) bytes of data, so much that 90
percent of the data in the world today has been created in the last two years.
ˆ These data come from everywhere: sensors used to gather climate information, posts to so-
cial media sites, digital pictures and videos posted online, transaction records of online
purchases, and cell phone GPS signals, to name just a few.
ˆ That is the catalyst for Big Data, along with the more important fact that all of these data have
intrinsic value that can be extrapolated using analytics, algorithms, and other techniques.
ˆ NOAA uses Big Data approaches to aid in climate, ecosystem, weather, and commercial research,
while NASA uses Big Data for aeronautical and other research. Pharmaceutical companies and energy
companies have leveraged Big Data for more tangible results, such as drug testing and geophysical
analysis.
ˆ New York Times has used Big Data tools for text analysis and Web mining, while the Walt Disney
Company uses them to correlate and understand customer behavior in all of its stores, theme parks,
and Web properties.
6
D
R
A
F
T
ˆ Big Data is full of challenges, ranging from the technical to the conceptual to the operational, any of
which can derail the ability to discover value and leverage what Big Data is all about.
2 Characteristics of Data
ˆ Data is a collection of details in the form of either figures or texts or symbols, or descriptions etc.
ˆ Data contains raw figures and facts. Information unlike data provides insights analyzed through the
data collected. Data has 3 characteristics:
1. Composition:— The composition of data deals with the structure of data, i.e; the sources of
data, the granularity, the types and nature of data as to whether it is static or real time streaming.
2. Condition:—The condition of data deals with the state of data, i.e; “Can one use this data as is
for analysis?” or “Does it require cleaning for further enhancement and enrichment?” data?”
3. Context:— The context of data deals with “Where has this data been generated?”. “Why was
this data generated?”, “How sensitive is this data?”,“What are the events associated with this”.
3 Data Classification
The volume and overall size of the data set is only one portion of the Big Data equation. There is a growing
consensus that both semi-structured and unstructured data sources contain business-critical information and
must therefore be made accessible for both BI and operational needs. It is also clear that the amount of
relevant unstructured business data is not only growing but will continue to grow for the foreseeable future.
Data can be classified under several categories:
1. Structured data:—Structured data are normally found in traditional databases (SQL or others)
where data are organized into tables based on defined business rules. Structured data usually prove
to be the easiest type of data to work with, simply because the data are defined and indexed, making
access and filtering easier. For example, Database, Spread sheets, OLTP systems.
2. Semi-structured data:—Sem-istructured data fall between unstructured and structured data. Semi-
structured data do not have a formal structure like a database with tables and relationships. However,
unlike unstructured data, semi-structured data have tags or other markers to separate the elements
and provide a hierarchy of records and fields, which define the data. For example, XML, JSON, E-mail.
7
D
R
A
F
T
3. Unstructured data:—unstructured data, in contrast, normally have no BI behind them. Unstruc-
tured data are not organized into tables and cannot be natively used by applications or interpreted
by a database. A good example of unstructured data would be a collection of binary image files.
For example, memos, chat-rooms, PowerPoint presentations, images, videos, letters, researches, white
papers, body of an email, etc.
4 Introduction to Big Data Platform
Big data platforms refer to software technologies that are designed to manage and process large volumes of
data, often in real-time or near-real-time. These platforms are typically used by businesses and organizations
that generate or collect massive amounts of data, such as social media companies, financial institutions, and
healthcare providers.
There are several key components of big data platforms, including:
ˆ Data storage: Big data platforms provide large-scale data storage capabilities, often utilizing dis-
tributed file systems or NoSQL 2
databases to accommodate large amounts of data.
ˆ Data processing: Big data platforms offer powerful data processing capabilities, often utilizing par-
allel processing, distributed computing, and real-time streaming processing to analyze and transform
data.
ˆ Data analytics: Big data platforms provide advanced analytics capabilities, often utilizing machine
learning algorithms, statistical models, and visualization tools to extract insights from large datasets.
ˆ Data integration: Big data platforms allow for integration with other data sources, such as databases,
APIs, and streaming data sources, to provide a unified view of data.
Some of the most popular big data platforms include Hadoop, Apache Spark, Apache Cassandra, Apache
Storm, and Apache Kafka. These platforms are open source and freely available, making them accessible to
organizations of all sizes.
2To overcome the rigidity of normalized RDBMS schemas, big data system accepts NoSQL. NOSQL is a method to manage
and store unstructured and non-relational data, also known as “Not Only SQL” [15], for example, HBase database.
8
D
R
A
F
T
5 Need of Data Analytics
Data analytics is the process of examining and analyzing large sets of data to uncover useful insights, patterns,
and trends. There are several reasons why organizations and businesses need data analytics:
1. Better decision-making: Data analytics can provide valuable insights that enable organizations to
make better-informed decisions. By analyzing data, organizations can identify patterns and trends
that may not be visible through intuition or traditional methods of analysis.
2. Improved efficiency: Data analytics can help organizations optimize their operations and improve
efficiency. By analyzing data on business processes, organizations can identify areas for improvement
and streamline operations to reduce costs and increase productivity.
3. Enhanced customer experience: Data analytics can help organizations gain a better understanding
of their customers and their preferences. By analyzing customer data, organizations can tailor their
products and services to better meet customer needs, resulting in a more satisfying customer experience.
4. Competitive advantage: Data analytics can provide organizations with a competitive advantage
by enabling them to make better-informed decisions and identify new opportunities for growth. By
leveraging data analytics, organizations can stay ahead of their competitors and position themselves
for success.
5. Risk management: Data analytics can help organizations identify potential risks and mitigate them
before they become major issues. By analyzing data on business processes and operations, organizations
can identify potential areas of risk and take steps to prevent them from occurring.
In summary, data analytics is essential for organizations looking to improve their decision-making, ef-
ficiency, customer experience, competitive advantage, and risk management. By leveraging the insights
provided by data analytics, organizations can stay ahead of the curve and position themselves for long-term
success.
6 Evolution of Data Analytics Scalability
The evolution of data analytics scalability has been driven by the need to process and analyze ever-increasing
volumes of data. Here are some of the key stages in the evolution of data analytics scalability:
9
D
R
A
F
T
1. Traditional databases: In the early days of data analytics, traditional databases were used to store
and analyze data. These databases were limited in their ability to handle large volumes of data, which
made them unsuitable for many analytics use cases.
2. Data warehouses: To address the limitations of traditional databases, data warehouses were devel-
oped in the 1990s. Data warehouses were designed to store and manage large volumes of structured
data, providing a more scalable solution for data analytics.
3. Hadoop and MapReduce: In the mid-2000s, Hadoop and MapReduce were developed as open-
source solutions for big data processing. These technologies enabled organizations to store and analyze
massive volumes of data in a distributed computing environment, making data analytics more scalable
and cost-effective.
4. Cloud computing: With the rise of cloud computing in the 2010s, organizations were able to scale
their data analytics infrastructure more easily and cost-effectively. Cloud-based data analytics plat-
forms such as Amazon Web Services (AWS) and Microsoft Azure provided scalable storage and pro-
cessing capabilities for big data.
5. Real-time analytics: With the growth of the Internet of Things (IoT) and other real-time data
sources, the need for real-time analytics capabilities became increasingly important. Technologies such
as Apache Kafka and Apache Spark Streaming were developed to enable real-time processing and
analysis of streaming data.
6. Machine learning and AI: In recent years, machine learning and artificial intelligence (AI) have
become key components of data analytics scalability. These technologies enable organizations to analyze
and make predictions based on massive volumes of data, providing valuable insights for decision-making
and business optimization.
Overall, the evolution of data analytics scalability has been driven by the need to process and analyze
increasingly large and complex datasets. With the development of new technologies and approaches, orga-
nizations are now able to derive insights from data at a scale that would have been unimaginable just a few
decades ago.
10
D
R
A
F
T
Figure 2: Illustration of types of analytics.
7 What is Data Analytics?
Data analytics is the process of examining large sets of data to extract insights, identify patterns, and make
informed decisions. It involves using various techniques, including statistical analysis, machine learning, and
data visualization, to analyze data and draw conclusions from it.
Data analytics can be applied to different types of data, including structured data (e.g., data stored in
databases) and unstructured data (e.g., social media posts, emails, and images). The goal of data analytics
is to turn raw data into meaningful and actionable insights that can help organizations make better decisions
and improve their operations.
Data analytics is used in many different fields, including business, healthcare, finance, marketing, and
social sciences. It can help businesses identify opportunities for growth, optimize their marketing strategies,
reduce costs, and improve customer experiences. In healthcare, data analytics can be used to predict and
prevent diseases, improve patient outcomes, and optimize resource allocation.
Overall, data analytics is a powerful tool that enables organizations to make informed decisions and gain
a competitive edge in today’s data-driven world.
11
D
R
A
F
T
7.1 Types of Data Analytics
There are five types of data analytics (See Figure 2):
1. Descriptive Analytics:—what is happening in your business? it gives us only insight about every-
thing is going well or not in our business without explaining the root cause.
2. Diagnostic Analytics:—why it is happening in your business? it explain the root cause behind the
outcome of descriptive analytic.
3. Predictive Analytics:—explains what likely to happen in the future based on previous trends and
patterns. By utilizing various statistical and machine learning algorithms to provide recommendations
and provide answers to questions related to what might happen in the future, that can be answer BI.
4. Prescriptive Analytics:—helps you to determine the best course of action to choose to bypass or
eliminate future issues. You can use prescriptive analytics to advise users on possible outcomes and
what should they do to maximize their key metrics i.e., business metrics.
5. Cognitive Analytics:—it combines a number of intelligent techniques like AI, ML, DL, etc. to apply
human brain like intelligence to perform certain task.
8 Analytic processes and tools
There are several analytic processes and tools used in data analytics to extract insights from data. Here are
some of the most commonly used:
1. Data collection: This involves gathering relevant data from various sources, including databases,
data warehouses, and data lakes.
2. Data cleaning: Once the data is collected, it needs to be cleaned and preprocessed to remove any
errors, duplicates, or inconsistencies.
3. Data integration: This involves combining data from different sources into a single, unified dataset
that can be used for analysis.
4. Data analysis: This is the core of data analytics, where various techniques such as statistical analysis,
machine learning, and data mining are used to extract insights from the data.
12
D
R
A
F
T
5. Data visualization: Once the data has been analyzed, it is often visualized using graphs, charts, and
other visual aids to make it easier to understand and communicate the findings.
6. Business intelligence (BI) tools: These are software tools that help organizations make sense of
their data by providing dashboards, reports, and other tools for data visualization and analysis.
7. Big data tools: These are specialized tools designed to handle large volumes of data and process it
efficiently. Examples include Apache Hadoop, Apache Spark, and Apache Storm.
8. Machine learning tools: These are tools that use algorithms to learn from data and make predictions
or decisions based on that learning. Examples include scikit-learn, TensorFlow, and Keras.
Overall, the tools and processes used in data analytics are constantly evolving, driven by advances in tech-
nology and the increasing demand for data-driven insights in various industries.
9 Analysis vs Reporting
Analysis and reporting are two important aspects of data management and interpretation, but they serve
different purposes.
Reporting involves the presentation of information in a standardized format, typically using charts,
graphs, or tables. The purpose of reporting is to provide a clear and concise overview of data and to
communicate key insights to stakeholders. Reporting is often used to provide regular updates on business
performance, highlight trends, or share key metrics with stakeholders.
Analysis, on the other hand, involves the exploration and interpretation of data to gain insights and make
informed decisions. Analysis involves digging deeper into the data to identify patterns, relationships, and
trends that may not be immediately apparent from simple reporting. Analysis often involves using statistical
techniques, modeling, and machine learning to extract insights from the data.
In summary, reporting is focused on presenting data in a clear and concise way, while analysis is focused
on exploring and interpreting the data to gain insights and make decisions. Both reporting and analysis are
important for effective data management, but they serve different purposes and require different skills and
tools.
13
D
R
A
F
T
10 Modern Data Analytic Tools
There are many modern data analytic tools available today that are designed to help organizations analyze
and interpret large volumes of data. Here are some of the most popular ones:
1. Tableau: This is a popular data visualization tool that allows users to create interactive dashboards
and reports from their data. It supports a wide range of data sources and is used by many organizations
to quickly visualize and explore data.
2. Power BI: This is a business analytics service provided by Microsoft that allows users to create
interactive visualizations and reports from their data. It integrates with other Microsoft products like
Excel and SharePoint, making it a popular choice for organizations that use these tools.
3. Google Analytics: This is a free web analytics service provided by Google that allows users to track
and analyze website traffic. It provides a wealth of data on user behavior, including pageviews, bounce
rates, and conversion rates.
4. Apache Spark: This is a fast and powerful open-source data processing engine that can be used for
large-scale data processing, machine learning, and graph processing. It supports multiple programming
languages, including Java, Scala, and Python.
5. Python: This is a popular programming language for data analysis and machine learning. It has a
large and active community that has developed many libraries and tools for data analysis, including
pandas, NumPy, and scikit-learn.
6. R: This is another popular programming language for data analysis and statistical computing. It has a
large library of statistical and graphical techniques and is used by many researchers and data analysts.
Overall, these are just a few examples of the many modern data analytic tools available today. Organizations
can choose the tools that best fit their needs and use them to gain insights and make informed decisions
based on their data.
11 Applications of Data Analytics
Data analytics has a wide range of applications across industries and organizations. Here are some of the
most common applications:
14
D
R
A
F
T
1. Business intelligence: Data analytics is used to analyze data and generate insights that help orga-
nizations make data-driven decisions. Business intelligence tools and techniques are used to track key
performance indicators (KPIs), monitor business processes, and identify trends and patterns.
2. Marketing: Data analytics is used to analyze customer behavior, preferences, and demographics to de-
velop targeted marketing campaigns. This includes analyzing website traffic, social media engagement,
and email marketing campaigns.
3. Healthcare: Data analytics is used in healthcare to analyze patient data and improve patient out-
comes. This includes analyzing electronic health records (EHRs) to identify disease patterns and
improve treatment plans, as well as analyzing clinical trial data to develop new treatments and drugs.
4. Finance: Data analytics is used in finance to analyze financial data and identify trends and patterns.
This includes analyzing stock prices, predicting market trends, and identifying fraudulent activity.
5. Manufacturing: Data analytics is used in manufacturing to optimize production processes and im-
prove product quality. This includes analyzing sensor data from production lines, predicting equipment
failures, and identifying quality issues.
6. Human resources: Data analytics is used in human resources to analyze employee data and identify
areas for improvement. This includes analyzing employee performance, identifying training needs, and
predicting employee turnover.
7. Transportation: Data analytics is used in transportation to optimize logistics and improve cus-
tomer service. This includes analyzing shipping data to optimize routes and delivery times, as well as
analyzing customer data to improve the customer experience.
Overall, data analytics has a wide range of applications across industries and organizations, and is increas-
ingly seen as a critical tool for success in the modern business world.
15
D
R
A
F
T
Part-II: Data Analytics Life-cycle
1 What is Data Analytics Life Cycle?
Data is precious in today’s digital environment. It goes through several life stages, including creation,
testing, processing, consumption, and reuse. These stages are mapped out in the Data Analytics Life Cycle
for professionals working on data analytics initiatives. Each stage has its significance and characteristics.
1.1 key roles for successful analytic projects
There are several key roles that are essential for successful analytic projects. These roles are:
ˆ Project Sponsor: The project sponsor is the person who champions the project and is responsible
for securing funding and resources. They are the driving force behind the project and are accountable
for its success.
ˆ Project Manager: The project manager is responsible for the overall planning, coordination, and
execution of the project. They ensure that the project is completed on time, within budget, and meets
the required quality standards.
ˆ Data Analyst: The data analyst is responsible for collecting, analyzing, and interpreting data. They
use statistical methods and software tools to identify patterns and relationships in the data, and to
develop insights and recommendations.
ˆ Data Scientist: The data scientist is responsible for developing predictive models and algorithms.
They use machine learning and other advanced techniques to analyze complex data sets and to uncover
hidden patterns and trends.
ˆ Subject Matter Expert: The subject matter expert (SME) is an individual who has deep knowledge
and expertise in a particular domain. They provide insights into the context and meaning of the data,
and help to ensure that the project aligns with the business objectives.
ˆ IT Specialist: The IT specialist is responsible for managing the technical infrastructure that supports
the project. They ensure that the necessary hardware and software are in place, and that the system
is secure, scalable, and reliable.
16
D
R
A
F
T
ˆ Business Analyst: The business analyst is responsible for understanding the business requirements
and translating them into technical specifications. They work closely with the project manager and
data analyst to ensure that the project meets the needs of the business.
ˆ Quality Assurance Specialist: The quality assurance specialist is responsible for testing the project
deliverables to ensure that they meet the required quality standards. They perform various tests and
evaluations to identify defects and ensure that the system is functioning as intended.
Each of these roles is essential for the success of analytic projects, and the team must work together closely
to achieve the project objectives.
1.2 Importance of Data Analytics Life Cycle
In today’s digital-first world, data is of immense importance. It undergoes various stages throughout its life,
during its creation, testing, processing, consumption, and reuse. Data Analytics Lifecycle maps out these
stages for professionals working on data analytics projects. These phases are arranged in a circular structure
that forms a Data Analytics Lifecycle. (See Figure 3). Each step has its significance and characteristics.
The Data Analytics Lifecycle is designed to be used with significant big data projects. It is used to
portray the actual project correctly; the cycle is iterative. A step-by-step technique is needed to arrange the
actions and tasks involved in gathering, processing, analyzing, and reusing data to explore the various needs
for assessing the information on big data. Data analysis is modifying, processing, and cleaning raw data to
obtain useful, significant information that supports business decision-making.
1.3 Data Analytics Lifecycle Phases
There’s no defined structure of the phases in the life cycle of Data Analytics; thus, there may not be
uniformity in these steps. There can be some data professionals that follow additional steps, while there
may be some who skip some stages altogether or work on different phases simultaneously. Let us discuss the
various phases of the data analytics life cycle.
This guide talks about the fundamental phases of each data analytics process. Hence, they are more likely
to be present in most data analytics projects’ lifecycles. The Data Analytics lifecycle primarily consists of 6
phases.
17
D
R
A
F
T
Figure 3: Illustration of phases of data analytics lifecycle [12].
1.3.1 Phase 1: Data Discovery
This phase is all about defining the data’s purpose and how to achieve it by the end of the data analytics
lifecycle. The stage consists of identifying critical objectives a business is trying to discover by mapping out
the data. During this process, the team learns about the business domain and checks whether the business
unit or organization has worked on similar projects to refer to any learnings.
The team also evaluates technology, people, data, and time in this phase. For example, the team can
use Excel while dealing with a small dataset. However, heftier tasks demand more rigid tools for data
preparation and exploration. The team will need to use Python, R, Tableau Desktop or Tableau Prep, and
other data-cleaning tools in such scenarios.
This phase’s critical activities include framing the business problem, formulating initial hypotheses to
test, and beginning data learning.
18
D
R
A
F
T
1.3.2 Phase 2: Data Preparation
In this phase, the experts’ focus shifts from business requirements to information requirements. One of the
essential aspects of this phase is ensuring data availability for processing. The stage encompasses collecting,
processing, and cleansing the accumulated data.
1.3.3 Phase 3: Model Planning
This phase needs the availability of an analytic sandbox for the team to work with data and perform analytics
throughout the project duration. The team can load data in several ways.
ˆ Extract, Transform, Load (ETL) – It transforms the data based on a set of business rules before loading
it into the sandbox.
ˆ Extract, Load, Transform (ELT) – It loads the data into the sandbox and then transforms it based on
a set of business rules.
ˆ Extract, Transform, Load, Transform (ETLT) – It’s the combination of ETL and ELT and has two
transformation levels.
The team identifies variables for categorizing data, and identifies and amends data errors. Data errors can
be anything, including missing data, illogical values, duplicates, and spelling errors. For example, the team
imputes the average data score for categories for missing values. It enables more efficient data processing
without skewing the data.
After cleaning the data, the team determines the techniques, methods, and workflow for building a model
in the next phase. The team explores the data, identifies relations between data points to select the key
variables, and eventually devises a suitable model.
1.3.4 Phase 4: Model Building
The team develops testing, training, and production datasets in this phase. Further, the team builds and
executes models meticulously as planned during the model planning phase. They test data and try to find out
answers to the given objectives. They use various statistical modeling methods such as regression techniques,
decision trees, random forest modeling, and neural networks and perform a trial run to determine whether
it corresponds to the datasets.
19
D
R
A
F
T
1.3.5 Phase 5: Communication and Publication of Results
This phase aims to determine whether the project results are a success or failure and start collaborating
with significant stakeholders. The team identifies the vital findings of their analysis, measures the associated
business value, and creates a summarized narrative to convey the stakeholders’ results.
1.3.6 Phase 6: Operationalize/Measuring of Effectiveness
In this final phase, the team presents an in-depth report with coding, briefing, key findings, and technical
documents and papers to the stakeholders. Besides this, the data is moved to a live environment and
monitored to measure the analysis’s effectiveness. If the findings are in line with the objective, the results
and reports are finalized. On the other hand, if they deviate from the set intent, the team moves backward
in the lifecycle to any previous phase to change the input and get a different outcome.
1.4 Data Analytics Lifecycle Example
Consider an example of a retail store chain that wants to optimize its products’ prices to boost its revenue.
The store chain has thousands of products over hundreds of outlets, making it a highly complex scenario.
Once you identify the store chain’s objective, you find the data you need, prepare it, and go through the
Data Analytics lifecycle process.
You observe different types of customers, such as ordinary customers and customers like contractors who
buy in bulk. According to you, treating various types of customers differently can give you the solution.
However, you don’t have enough information about it and need to discuss this with the client team.
In this case, you need to get the definition, find data, and conduct hypothesis testing to check whether
various customer types impact the model results and get the right output. Once you are convinced with the
model results, you can deploy the model, and integrate it into the business, and you are all set to deploy the
prices you think are the most optimal across the outlets of the store.
20
Printed Page: 1 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Time: 3 Hours Total Marks: 100
Note: Attempt all Sections. If you require any missing data, then choose suitably.
SECTION A
1. Attempt all questions in brief. 2*10 = 20
Qno Questions CO
(a) Discuss the need of data analytics. 1
(b) Give the classification of data. 1
(c) Define neural network. 2
(d) What is multivariate analysis? 2
(e) Give the full form of RTAP and discuss its application. 3
(f) What is the role of sampling data in a stream? 3
(g) Discuss the use of limited pass algorithm. 4
(h) What is the principle behind hierarchical clustering technique? 4
(i) List five R functions used in descriptive statistics. 5
(j) List the names of any 2 visualization tools. 5
SECTION B
2. Attempt any three of the following: 10*3 = 30
Qno Questions CO
(a) Explain the process model and computation model for Big data
platform.
1
(b) Explain the use and advantages of decision trees. 2
(c) Explain the architecture of data stream model. 3
(d) Illustrate the K-means algorithm in detail with its advantages. 4
(e) Differentiate between NoSQL and RDBMS databases. 5
SECTION C
3. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the various phases of data analytics life cycle. 1
(b) Explain modern data analytics tools in detail. 1
4. Attempt any one part of the following: 10 *1 = 10
Qno Questions CO
(a) Compare various types of support vector and kernel methods of data
analysis.
2
(b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal
component using PCA algorithm.
2
D
R
A
F
T
Printed Page: 2 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
5. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain any one algorithm to count number of distinct elements in a
data stream.
3
(b) Discuss the case study of stock market predictions in detail. 3
6. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Differentiate between CLIQUE and ProCLUS clustering. 4
(b) A database has 5 transactions. Let min_sup=60% and min_conf=80%.
TID Items_Bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
i) Find all frequent itemsets using Apriori algorithm.
ii) List all the strong association rules (with support s and confidence
c).
4
7. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the HIVE architecture with its features in detail. 5
(b) Write R function to check whether the given number is prime or not. 5
D
R
A
F
T
D
R
A
F
T
2 Reference
[1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/
[2] https://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf
[3] https://www.youtube.com/watch?v=fDRa82lxzaU
[4] https://www.investopedia.com/terms/d/data-analytics.asp
[5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf
[6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/computer_science/16._data_analytics/
03._evolution_of_analytical_scalability/et/9280_et_3_et.pdf
[7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy-nieizv_
book.pdf
[8] https://www.researchgate.net/publication/317214679_Sentiment_Analysis_for_Effective_Stock_
Market_Prediction
[9] https://snscourseware.org/snscenew/files/1569681518.pdf
[10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience&BigDataAnalytics.pdf
[11] https://www.youtube.com/watch?v=mccsmoh2_3c
[12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-analytics-li
[13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20data%
20refers%20to%20data,around%20for%20a%20long%20time.
[14] https://www.javatpoint.com/big-data-characteristics
[15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross-
level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575–597.
http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x
********************
23

Contenu connexe

Tendances

Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
Advanced Database Lecture Notes
Advanced Database Lecture NotesAdvanced Database Lecture Notes
Advanced Database Lecture NotesJasour Obeidat
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patternsKrish_ver2
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 

Tendances (20)

Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data Mining
Data MiningData Mining
Data Mining
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Advanced Database Lecture Notes
Advanced Database Lecture NotesAdvanced Database Lecture Notes
Advanced Database Lecture Notes
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 

Similaire à KIT-601 Lecture Notes-UNIT-1.pdf

Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Toolsijsrd.com
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET Journal
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7Rohit Mittal
 
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...ijcseit
 
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...ijcseit
 
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...ijcseit
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective ApproachIRJET Journal
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)NikitaRajbhoj
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data scienceJohnson Ubah
 

Similaire à KIT-601 Lecture Notes-UNIT-1.pdf (20)

Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
 
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
 
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
149.pdf
149.pdf149.pdf
149.pdf
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 

Plus de Dr. Radhey Shyam (20)

SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfSE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdf
 
KCS-501-3.pdf
KCS-501-3.pdfKCS-501-3.pdf
KCS-501-3.pdf
 
KCS-055 U5.pdf
KCS-055 U5.pdfKCS-055 U5.pdf
KCS-055 U5.pdf
 
KCS-055 MLT U4.pdf
KCS-055 MLT U4.pdfKCS-055 MLT U4.pdf
KCS-055 MLT U4.pdf
 
Deep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptxDeep-Learning-2017-Lecture5CNN.pptx
Deep-Learning-2017-Lecture5CNN.pptx
 
SE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdfSE UNIT-3 (Software metrics).pdf
SE UNIT-3 (Software metrics).pdf
 
SE UNIT-2.pdf
SE UNIT-2.pdfSE UNIT-2.pdf
SE UNIT-2.pdf
 
SE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdfSE UNIT-1 Revised.pdf
SE UNIT-1 Revised.pdf
 
SE UNIT-3.pdf
SE UNIT-3.pdfSE UNIT-3.pdf
SE UNIT-3.pdf
 
Ip unit 5
Ip unit 5Ip unit 5
Ip unit 5
 
Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21Ip unit 4 modified on 22.06.21
Ip unit 4 modified on 22.06.21
 
Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021Ip unit 3 modified of 26.06.2021
Ip unit 3 modified of 26.06.2021
 
Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021Ip unit 2 modified on 8.6.2021
Ip unit 2 modified on 8.6.2021
 
Ip unit 1
Ip unit 1Ip unit 1
Ip unit 1
 
Cc unit 5
Cc unit 5Cc unit 5
Cc unit 5
 
Cc unit 4 updated version
Cc unit 4 updated versionCc unit 4 updated version
Cc unit 4 updated version
 
Cc unit 3 updated version
Cc unit 3 updated versionCc unit 3 updated version
Cc unit 3 updated version
 
Cc unit 2 updated
Cc unit 2 updatedCc unit 2 updated
Cc unit 2 updated
 
Cc unit 1 updated
Cc unit 1 updatedCc unit 1 updated
Cc unit 1 updated
 
Iot lab manual new
Iot lab manual newIot lab manual new
Iot lab manual new
 

Dernier

Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 

Dernier (20)

Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 

KIT-601 Lecture Notes-UNIT-1.pdf

  • 1. D R A F T Data Analytics (KIT-601) Unit-1: Introduction to Data Analytics & Data Analytics Lifecycle Dr. Radhey Shyam Professor Department of Information Technology SRMCEM Lucknow (Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow) Unit-1 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who made their course contents freely available or (Contributed directly or indirectly). Feel free to use this study material for your own academic purposes. For any query, communication can be made through this email : shyam0058@gmail.com. March 14, 2023
  • 2. Data Analytics (KIT 601) Course Outcome ( CO) Bloom’s Knowledge Level (KL) At the end of course , the student will be able to CO 1 Discuss various concepts of data analytics pipeline K1, K2 CO 2 Apply classification and regression techniques K3 CO 3 Explain and apply mining techniques on streaming data K2, K3 CO 4 Compare different clustering and frequent pattern mining algorithms K4 CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3 DETAILED SYLLABUS 3-0-0 Unit Topic Proposed Lecture I Introduction to Data Analytics: Sources and nature of data, classification of data (structured, semi-structured, unstructured), characteristics of data, introduction to Big Data platform, need of data analytics, evolution of analytic scalability, analytic process and tools, analysis vs reporting, modern data analytic tools, applications of data analytics. Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases of data analytics lifecycle – discovery, data preparation, model planning, model building, communicating results, operationalization. 08 II Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference and Bayesian networks, support vector and kernel methods, analysis of time series: linear systems analysis & nonlinear dynamics, rule induction, neural networks: learning and generalisation, competitive learning, principal component analysis and neural networks, fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search methods. 08 III Mining Data Streams: Introduction to streams concepts, stream data model and architecture, stream computing, sampling data in a stream, filtering streams, counting distinct elements in a stream, estimating moments, counting oneness in a window, decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real time sentiment analysis, stock market predictions. 08 IV Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling, Apriori algorithm, handling large data sets in main memory, limited pass algorithm, counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means, clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering methods, clustering in non-euclidean space, clustering for streams and parallelism. 08 V Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR, Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual data analysis techniques, interaction techniques, systems and applications. Introduction to R - R graphical user interfaces, data import and export, attribute and data types, descriptive statistics, exploratory data analysis, visualization before analysis, analytics for unstructured data. 08 Text books and References: 1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer 2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press. 3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23 D R A F T
  • 3. D R A F T Part-I: Introduction to Data Analytics 1 Introduction To Big Data What Is Big Data? ˆ Big Data is often described as extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools. ˆ Big Data defines a situation in which data sets have grown to such enormous sizes that conventional information technologies can no longer effectively handle either the size of the data set or the scale and growth of the data set. ˆ In other words, the data set has grown so large that it is difficult to manage and even harder to garner value out of it. ˆ The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visual- ization of data. ˆ Big Data has its roots in the scientific and medical communities, where the complex analysis of massive amounts of data has been done for drug development, physics modeling, and other forms of research, all of which involve large data sets. These 4Vs (See Figure 1) [13] 1 of Big Data lay out the path to analytics, with each having intrinsic value in the process of discovering value. Nevertheless, the complexity of Big Data does not end with just four 1Volume—Organizations collect data from a variety of sources, including transactions, smart (IoT) devices, industrial equipment, videos, images, audio, social media and more. In the past, storing all that data would have been too costly – but cheaper storage using data lakes, Hadoop and the cloud have eased the burden. Velocity—With the growth in the Internet of Things, data streams into businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time. Variety—Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions. Veracity—Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to link, match, cleanse and transform data across systems. Businesses need to connect and correlate relationships, hierarchies and multiple data linkages. Otherwise, their data can quickly spiral out of control. Value—This refers to the value that the big data can provide and it relates directly to what organizations can do with that collected data. It is often quantified as the potential social or economic value that the data might create. 3
  • 4. D R A F T Figure 1: Illustration of Big Data [14]. dimensions. There are other factors at work as well: the processes that Big Data drives. These processes are a conglomeration of technologies and analytics that are used to define the value of data sources, which translates to actionable elements that move businesses forward. Many of those technologies or concepts are not new but have come to fall under the umbrella of Big Volatility—It deals with “How long the data is valid?” Validity—It refers to accuracy and correctness of data. Any data picked up for analysis needs to be accurate. Variability—In addition to the increasing velocities and varieties of data, data flows are unpredictable – changing often and varying greatly. It’s challenging, but businesses need to know when something is trending in social media, and how to manage daily, seasonal and event-triggered peak data loads. 4
  • 5. D R A F T Data. Best defined as analysis categories, these technologies and concepts include the following: Traditional business intelligence (BI). This consists of a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data. BI delivers actionable information, which helps enterprise users make better business decisions using fact-based support systems. BI works by using an in-depth analysis of detailed business data, provided by databases, application data, and other tangible data sources. In some circles, BI can provide historical, current, and predictive views of business operations. Data mining. This is a process in which data are analyzed from different perspectives and then turned into summary data that are deemed useful. Data mining is normally used with data at rest or with archival data. Data mining techniques focus on modeling and knowledge discovery for predictive, rather than purely descriptive, purposes—an ideal process for uncovering new patterns from large data sets. Statistical applications. These look at data using algorithms based on statistical principles and normally concentrate on data sets related to polls, census, and other static data sets. Statistical applications ideally deliver sample observations that can be used to study populated data sets for the purpose of estimating, testing, and predictive analysis. Empirical data, such as surveys and experimental reporting, are the primary sources for analyzable information. Predictive analysis. This is a subset of statistical applications in which data sets are examined to come up with predictions, based on trends and information gleaned from databases. Predictive analysis tends to be big in the financial and scientific worlds, where trending tends to drive predictions, once external elements are added to the data set. One of the main goals of predictive analysis is to identify the risks and opportunities for business process, markets, and manufacturing. Data modeling. This is a conceptual application of analytics in which multiple “what-if” scenarios can be applied via algorithms to multiple data sets. Ideally, the modeled information changes based on the information made available to the algorithms, which then provide insight to the effects of the change on the data sets. Data modeling works hand in hand with data visualization, in which uncovering information can help with a particular business endeavor. 5
  • 6. D R A F T The preceding analysis categories constitute only a portion of where Big Data is headed and why it has intrinsic value to business. That value is driven by the never ending quest for a competitive advantage, encouraging organizations to turn to large repositories of corporate and external data to uncover trends, statistics, and other actionable information to help them decide on their next move. This has helped the concept of Big Data to gain popularity with technologists and executives alike, along with its associated tools, platforms, and analytics. 1.1 ARRIVAL OF ANALYTICS ˆ As analytics and research were applied to large data sets, scientists came to the conclusion that more is better—in this case, more data, more analysis, and more results. ˆ Researchers started to incorporate related data sets, unstructured data, archival data, and real-time data into the process ˆ In the business world, Big Data is all about opportunity. ˆ According to IBM, every day we create 2.5 quintillion (2.5 Ö 1018) bytes of data, so much that 90 percent of the data in the world today has been created in the last two years. ˆ These data come from everywhere: sensors used to gather climate information, posts to so- cial media sites, digital pictures and videos posted online, transaction records of online purchases, and cell phone GPS signals, to name just a few. ˆ That is the catalyst for Big Data, along with the more important fact that all of these data have intrinsic value that can be extrapolated using analytics, algorithms, and other techniques. ˆ NOAA uses Big Data approaches to aid in climate, ecosystem, weather, and commercial research, while NASA uses Big Data for aeronautical and other research. Pharmaceutical companies and energy companies have leveraged Big Data for more tangible results, such as drug testing and geophysical analysis. ˆ New York Times has used Big Data tools for text analysis and Web mining, while the Walt Disney Company uses them to correlate and understand customer behavior in all of its stores, theme parks, and Web properties. 6
  • 7. D R A F T ˆ Big Data is full of challenges, ranging from the technical to the conceptual to the operational, any of which can derail the ability to discover value and leverage what Big Data is all about. 2 Characteristics of Data ˆ Data is a collection of details in the form of either figures or texts or symbols, or descriptions etc. ˆ Data contains raw figures and facts. Information unlike data provides insights analyzed through the data collected. Data has 3 characteristics: 1. Composition:— The composition of data deals with the structure of data, i.e; the sources of data, the granularity, the types and nature of data as to whether it is static or real time streaming. 2. Condition:—The condition of data deals with the state of data, i.e; “Can one use this data as is for analysis?” or “Does it require cleaning for further enhancement and enrichment?” data?” 3. Context:— The context of data deals with “Where has this data been generated?”. “Why was this data generated?”, “How sensitive is this data?”,“What are the events associated with this”. 3 Data Classification The volume and overall size of the data set is only one portion of the Big Data equation. There is a growing consensus that both semi-structured and unstructured data sources contain business-critical information and must therefore be made accessible for both BI and operational needs. It is also clear that the amount of relevant unstructured business data is not only growing but will continue to grow for the foreseeable future. Data can be classified under several categories: 1. Structured data:—Structured data are normally found in traditional databases (SQL or others) where data are organized into tables based on defined business rules. Structured data usually prove to be the easiest type of data to work with, simply because the data are defined and indexed, making access and filtering easier. For example, Database, Spread sheets, OLTP systems. 2. Semi-structured data:—Sem-istructured data fall between unstructured and structured data. Semi- structured data do not have a formal structure like a database with tables and relationships. However, unlike unstructured data, semi-structured data have tags or other markers to separate the elements and provide a hierarchy of records and fields, which define the data. For example, XML, JSON, E-mail. 7
  • 8. D R A F T 3. Unstructured data:—unstructured data, in contrast, normally have no BI behind them. Unstruc- tured data are not organized into tables and cannot be natively used by applications or interpreted by a database. A good example of unstructured data would be a collection of binary image files. For example, memos, chat-rooms, PowerPoint presentations, images, videos, letters, researches, white papers, body of an email, etc. 4 Introduction to Big Data Platform Big data platforms refer to software technologies that are designed to manage and process large volumes of data, often in real-time or near-real-time. These platforms are typically used by businesses and organizations that generate or collect massive amounts of data, such as social media companies, financial institutions, and healthcare providers. There are several key components of big data platforms, including: ˆ Data storage: Big data platforms provide large-scale data storage capabilities, often utilizing dis- tributed file systems or NoSQL 2 databases to accommodate large amounts of data. ˆ Data processing: Big data platforms offer powerful data processing capabilities, often utilizing par- allel processing, distributed computing, and real-time streaming processing to analyze and transform data. ˆ Data analytics: Big data platforms provide advanced analytics capabilities, often utilizing machine learning algorithms, statistical models, and visualization tools to extract insights from large datasets. ˆ Data integration: Big data platforms allow for integration with other data sources, such as databases, APIs, and streaming data sources, to provide a unified view of data. Some of the most popular big data platforms include Hadoop, Apache Spark, Apache Cassandra, Apache Storm, and Apache Kafka. These platforms are open source and freely available, making them accessible to organizations of all sizes. 2To overcome the rigidity of normalized RDBMS schemas, big data system accepts NoSQL. NOSQL is a method to manage and store unstructured and non-relational data, also known as “Not Only SQL” [15], for example, HBase database. 8
  • 9. D R A F T 5 Need of Data Analytics Data analytics is the process of examining and analyzing large sets of data to uncover useful insights, patterns, and trends. There are several reasons why organizations and businesses need data analytics: 1. Better decision-making: Data analytics can provide valuable insights that enable organizations to make better-informed decisions. By analyzing data, organizations can identify patterns and trends that may not be visible through intuition or traditional methods of analysis. 2. Improved efficiency: Data analytics can help organizations optimize their operations and improve efficiency. By analyzing data on business processes, organizations can identify areas for improvement and streamline operations to reduce costs and increase productivity. 3. Enhanced customer experience: Data analytics can help organizations gain a better understanding of their customers and their preferences. By analyzing customer data, organizations can tailor their products and services to better meet customer needs, resulting in a more satisfying customer experience. 4. Competitive advantage: Data analytics can provide organizations with a competitive advantage by enabling them to make better-informed decisions and identify new opportunities for growth. By leveraging data analytics, organizations can stay ahead of their competitors and position themselves for success. 5. Risk management: Data analytics can help organizations identify potential risks and mitigate them before they become major issues. By analyzing data on business processes and operations, organizations can identify potential areas of risk and take steps to prevent them from occurring. In summary, data analytics is essential for organizations looking to improve their decision-making, ef- ficiency, customer experience, competitive advantage, and risk management. By leveraging the insights provided by data analytics, organizations can stay ahead of the curve and position themselves for long-term success. 6 Evolution of Data Analytics Scalability The evolution of data analytics scalability has been driven by the need to process and analyze ever-increasing volumes of data. Here are some of the key stages in the evolution of data analytics scalability: 9
  • 10. D R A F T 1. Traditional databases: In the early days of data analytics, traditional databases were used to store and analyze data. These databases were limited in their ability to handle large volumes of data, which made them unsuitable for many analytics use cases. 2. Data warehouses: To address the limitations of traditional databases, data warehouses were devel- oped in the 1990s. Data warehouses were designed to store and manage large volumes of structured data, providing a more scalable solution for data analytics. 3. Hadoop and MapReduce: In the mid-2000s, Hadoop and MapReduce were developed as open- source solutions for big data processing. These technologies enabled organizations to store and analyze massive volumes of data in a distributed computing environment, making data analytics more scalable and cost-effective. 4. Cloud computing: With the rise of cloud computing in the 2010s, organizations were able to scale their data analytics infrastructure more easily and cost-effectively. Cloud-based data analytics plat- forms such as Amazon Web Services (AWS) and Microsoft Azure provided scalable storage and pro- cessing capabilities for big data. 5. Real-time analytics: With the growth of the Internet of Things (IoT) and other real-time data sources, the need for real-time analytics capabilities became increasingly important. Technologies such as Apache Kafka and Apache Spark Streaming were developed to enable real-time processing and analysis of streaming data. 6. Machine learning and AI: In recent years, machine learning and artificial intelligence (AI) have become key components of data analytics scalability. These technologies enable organizations to analyze and make predictions based on massive volumes of data, providing valuable insights for decision-making and business optimization. Overall, the evolution of data analytics scalability has been driven by the need to process and analyze increasingly large and complex datasets. With the development of new technologies and approaches, orga- nizations are now able to derive insights from data at a scale that would have been unimaginable just a few decades ago. 10
  • 11. D R A F T Figure 2: Illustration of types of analytics. 7 What is Data Analytics? Data analytics is the process of examining large sets of data to extract insights, identify patterns, and make informed decisions. It involves using various techniques, including statistical analysis, machine learning, and data visualization, to analyze data and draw conclusions from it. Data analytics can be applied to different types of data, including structured data (e.g., data stored in databases) and unstructured data (e.g., social media posts, emails, and images). The goal of data analytics is to turn raw data into meaningful and actionable insights that can help organizations make better decisions and improve their operations. Data analytics is used in many different fields, including business, healthcare, finance, marketing, and social sciences. It can help businesses identify opportunities for growth, optimize their marketing strategies, reduce costs, and improve customer experiences. In healthcare, data analytics can be used to predict and prevent diseases, improve patient outcomes, and optimize resource allocation. Overall, data analytics is a powerful tool that enables organizations to make informed decisions and gain a competitive edge in today’s data-driven world. 11
  • 12. D R A F T 7.1 Types of Data Analytics There are five types of data analytics (See Figure 2): 1. Descriptive Analytics:—what is happening in your business? it gives us only insight about every- thing is going well or not in our business without explaining the root cause. 2. Diagnostic Analytics:—why it is happening in your business? it explain the root cause behind the outcome of descriptive analytic. 3. Predictive Analytics:—explains what likely to happen in the future based on previous trends and patterns. By utilizing various statistical and machine learning algorithms to provide recommendations and provide answers to questions related to what might happen in the future, that can be answer BI. 4. Prescriptive Analytics:—helps you to determine the best course of action to choose to bypass or eliminate future issues. You can use prescriptive analytics to advise users on possible outcomes and what should they do to maximize their key metrics i.e., business metrics. 5. Cognitive Analytics:—it combines a number of intelligent techniques like AI, ML, DL, etc. to apply human brain like intelligence to perform certain task. 8 Analytic processes and tools There are several analytic processes and tools used in data analytics to extract insights from data. Here are some of the most commonly used: 1. Data collection: This involves gathering relevant data from various sources, including databases, data warehouses, and data lakes. 2. Data cleaning: Once the data is collected, it needs to be cleaned and preprocessed to remove any errors, duplicates, or inconsistencies. 3. Data integration: This involves combining data from different sources into a single, unified dataset that can be used for analysis. 4. Data analysis: This is the core of data analytics, where various techniques such as statistical analysis, machine learning, and data mining are used to extract insights from the data. 12
  • 13. D R A F T 5. Data visualization: Once the data has been analyzed, it is often visualized using graphs, charts, and other visual aids to make it easier to understand and communicate the findings. 6. Business intelligence (BI) tools: These are software tools that help organizations make sense of their data by providing dashboards, reports, and other tools for data visualization and analysis. 7. Big data tools: These are specialized tools designed to handle large volumes of data and process it efficiently. Examples include Apache Hadoop, Apache Spark, and Apache Storm. 8. Machine learning tools: These are tools that use algorithms to learn from data and make predictions or decisions based on that learning. Examples include scikit-learn, TensorFlow, and Keras. Overall, the tools and processes used in data analytics are constantly evolving, driven by advances in tech- nology and the increasing demand for data-driven insights in various industries. 9 Analysis vs Reporting Analysis and reporting are two important aspects of data management and interpretation, but they serve different purposes. Reporting involves the presentation of information in a standardized format, typically using charts, graphs, or tables. The purpose of reporting is to provide a clear and concise overview of data and to communicate key insights to stakeholders. Reporting is often used to provide regular updates on business performance, highlight trends, or share key metrics with stakeholders. Analysis, on the other hand, involves the exploration and interpretation of data to gain insights and make informed decisions. Analysis involves digging deeper into the data to identify patterns, relationships, and trends that may not be immediately apparent from simple reporting. Analysis often involves using statistical techniques, modeling, and machine learning to extract insights from the data. In summary, reporting is focused on presenting data in a clear and concise way, while analysis is focused on exploring and interpreting the data to gain insights and make decisions. Both reporting and analysis are important for effective data management, but they serve different purposes and require different skills and tools. 13
  • 14. D R A F T 10 Modern Data Analytic Tools There are many modern data analytic tools available today that are designed to help organizations analyze and interpret large volumes of data. Here are some of the most popular ones: 1. Tableau: This is a popular data visualization tool that allows users to create interactive dashboards and reports from their data. It supports a wide range of data sources and is used by many organizations to quickly visualize and explore data. 2. Power BI: This is a business analytics service provided by Microsoft that allows users to create interactive visualizations and reports from their data. It integrates with other Microsoft products like Excel and SharePoint, making it a popular choice for organizations that use these tools. 3. Google Analytics: This is a free web analytics service provided by Google that allows users to track and analyze website traffic. It provides a wealth of data on user behavior, including pageviews, bounce rates, and conversion rates. 4. Apache Spark: This is a fast and powerful open-source data processing engine that can be used for large-scale data processing, machine learning, and graph processing. It supports multiple programming languages, including Java, Scala, and Python. 5. Python: This is a popular programming language for data analysis and machine learning. It has a large and active community that has developed many libraries and tools for data analysis, including pandas, NumPy, and scikit-learn. 6. R: This is another popular programming language for data analysis and statistical computing. It has a large library of statistical and graphical techniques and is used by many researchers and data analysts. Overall, these are just a few examples of the many modern data analytic tools available today. Organizations can choose the tools that best fit their needs and use them to gain insights and make informed decisions based on their data. 11 Applications of Data Analytics Data analytics has a wide range of applications across industries and organizations. Here are some of the most common applications: 14
  • 15. D R A F T 1. Business intelligence: Data analytics is used to analyze data and generate insights that help orga- nizations make data-driven decisions. Business intelligence tools and techniques are used to track key performance indicators (KPIs), monitor business processes, and identify trends and patterns. 2. Marketing: Data analytics is used to analyze customer behavior, preferences, and demographics to de- velop targeted marketing campaigns. This includes analyzing website traffic, social media engagement, and email marketing campaigns. 3. Healthcare: Data analytics is used in healthcare to analyze patient data and improve patient out- comes. This includes analyzing electronic health records (EHRs) to identify disease patterns and improve treatment plans, as well as analyzing clinical trial data to develop new treatments and drugs. 4. Finance: Data analytics is used in finance to analyze financial data and identify trends and patterns. This includes analyzing stock prices, predicting market trends, and identifying fraudulent activity. 5. Manufacturing: Data analytics is used in manufacturing to optimize production processes and im- prove product quality. This includes analyzing sensor data from production lines, predicting equipment failures, and identifying quality issues. 6. Human resources: Data analytics is used in human resources to analyze employee data and identify areas for improvement. This includes analyzing employee performance, identifying training needs, and predicting employee turnover. 7. Transportation: Data analytics is used in transportation to optimize logistics and improve cus- tomer service. This includes analyzing shipping data to optimize routes and delivery times, as well as analyzing customer data to improve the customer experience. Overall, data analytics has a wide range of applications across industries and organizations, and is increas- ingly seen as a critical tool for success in the modern business world. 15
  • 16. D R A F T Part-II: Data Analytics Life-cycle 1 What is Data Analytics Life Cycle? Data is precious in today’s digital environment. It goes through several life stages, including creation, testing, processing, consumption, and reuse. These stages are mapped out in the Data Analytics Life Cycle for professionals working on data analytics initiatives. Each stage has its significance and characteristics. 1.1 key roles for successful analytic projects There are several key roles that are essential for successful analytic projects. These roles are: ˆ Project Sponsor: The project sponsor is the person who champions the project and is responsible for securing funding and resources. They are the driving force behind the project and are accountable for its success. ˆ Project Manager: The project manager is responsible for the overall planning, coordination, and execution of the project. They ensure that the project is completed on time, within budget, and meets the required quality standards. ˆ Data Analyst: The data analyst is responsible for collecting, analyzing, and interpreting data. They use statistical methods and software tools to identify patterns and relationships in the data, and to develop insights and recommendations. ˆ Data Scientist: The data scientist is responsible for developing predictive models and algorithms. They use machine learning and other advanced techniques to analyze complex data sets and to uncover hidden patterns and trends. ˆ Subject Matter Expert: The subject matter expert (SME) is an individual who has deep knowledge and expertise in a particular domain. They provide insights into the context and meaning of the data, and help to ensure that the project aligns with the business objectives. ˆ IT Specialist: The IT specialist is responsible for managing the technical infrastructure that supports the project. They ensure that the necessary hardware and software are in place, and that the system is secure, scalable, and reliable. 16
  • 17. D R A F T ˆ Business Analyst: The business analyst is responsible for understanding the business requirements and translating them into technical specifications. They work closely with the project manager and data analyst to ensure that the project meets the needs of the business. ˆ Quality Assurance Specialist: The quality assurance specialist is responsible for testing the project deliverables to ensure that they meet the required quality standards. They perform various tests and evaluations to identify defects and ensure that the system is functioning as intended. Each of these roles is essential for the success of analytic projects, and the team must work together closely to achieve the project objectives. 1.2 Importance of Data Analytics Life Cycle In today’s digital-first world, data is of immense importance. It undergoes various stages throughout its life, during its creation, testing, processing, consumption, and reuse. Data Analytics Lifecycle maps out these stages for professionals working on data analytics projects. These phases are arranged in a circular structure that forms a Data Analytics Lifecycle. (See Figure 3). Each step has its significance and characteristics. The Data Analytics Lifecycle is designed to be used with significant big data projects. It is used to portray the actual project correctly; the cycle is iterative. A step-by-step technique is needed to arrange the actions and tasks involved in gathering, processing, analyzing, and reusing data to explore the various needs for assessing the information on big data. Data analysis is modifying, processing, and cleaning raw data to obtain useful, significant information that supports business decision-making. 1.3 Data Analytics Lifecycle Phases There’s no defined structure of the phases in the life cycle of Data Analytics; thus, there may not be uniformity in these steps. There can be some data professionals that follow additional steps, while there may be some who skip some stages altogether or work on different phases simultaneously. Let us discuss the various phases of the data analytics life cycle. This guide talks about the fundamental phases of each data analytics process. Hence, they are more likely to be present in most data analytics projects’ lifecycles. The Data Analytics lifecycle primarily consists of 6 phases. 17
  • 18. D R A F T Figure 3: Illustration of phases of data analytics lifecycle [12]. 1.3.1 Phase 1: Data Discovery This phase is all about defining the data’s purpose and how to achieve it by the end of the data analytics lifecycle. The stage consists of identifying critical objectives a business is trying to discover by mapping out the data. During this process, the team learns about the business domain and checks whether the business unit or organization has worked on similar projects to refer to any learnings. The team also evaluates technology, people, data, and time in this phase. For example, the team can use Excel while dealing with a small dataset. However, heftier tasks demand more rigid tools for data preparation and exploration. The team will need to use Python, R, Tableau Desktop or Tableau Prep, and other data-cleaning tools in such scenarios. This phase’s critical activities include framing the business problem, formulating initial hypotheses to test, and beginning data learning. 18
  • 19. D R A F T 1.3.2 Phase 2: Data Preparation In this phase, the experts’ focus shifts from business requirements to information requirements. One of the essential aspects of this phase is ensuring data availability for processing. The stage encompasses collecting, processing, and cleansing the accumulated data. 1.3.3 Phase 3: Model Planning This phase needs the availability of an analytic sandbox for the team to work with data and perform analytics throughout the project duration. The team can load data in several ways. ˆ Extract, Transform, Load (ETL) – It transforms the data based on a set of business rules before loading it into the sandbox. ˆ Extract, Load, Transform (ELT) – It loads the data into the sandbox and then transforms it based on a set of business rules. ˆ Extract, Transform, Load, Transform (ETLT) – It’s the combination of ETL and ELT and has two transformation levels. The team identifies variables for categorizing data, and identifies and amends data errors. Data errors can be anything, including missing data, illogical values, duplicates, and spelling errors. For example, the team imputes the average data score for categories for missing values. It enables more efficient data processing without skewing the data. After cleaning the data, the team determines the techniques, methods, and workflow for building a model in the next phase. The team explores the data, identifies relations between data points to select the key variables, and eventually devises a suitable model. 1.3.4 Phase 4: Model Building The team develops testing, training, and production datasets in this phase. Further, the team builds and executes models meticulously as planned during the model planning phase. They test data and try to find out answers to the given objectives. They use various statistical modeling methods such as regression techniques, decision trees, random forest modeling, and neural networks and perform a trial run to determine whether it corresponds to the datasets. 19
  • 20. D R A F T 1.3.5 Phase 5: Communication and Publication of Results This phase aims to determine whether the project results are a success or failure and start collaborating with significant stakeholders. The team identifies the vital findings of their analysis, measures the associated business value, and creates a summarized narrative to convey the stakeholders’ results. 1.3.6 Phase 6: Operationalize/Measuring of Effectiveness In this final phase, the team presents an in-depth report with coding, briefing, key findings, and technical documents and papers to the stakeholders. Besides this, the data is moved to a live environment and monitored to measure the analysis’s effectiveness. If the findings are in line with the objective, the results and reports are finalized. On the other hand, if they deviate from the set intent, the team moves backward in the lifecycle to any previous phase to change the input and get a different outcome. 1.4 Data Analytics Lifecycle Example Consider an example of a retail store chain that wants to optimize its products’ prices to boost its revenue. The store chain has thousands of products over hundreds of outlets, making it a highly complex scenario. Once you identify the store chain’s objective, you find the data you need, prepare it, and go through the Data Analytics lifecycle process. You observe different types of customers, such as ordinary customers and customers like contractors who buy in bulk. According to you, treating various types of customers differently can give you the solution. However, you don’t have enough information about it and need to discuss this with the client team. In this case, you need to get the definition, find data, and conduct hypothesis testing to check whether various customer types impact the model results and get the right output. Once you are convinced with the model results, you can deploy the model, and integrate it into the business, and you are all set to deploy the prices you think are the most optimal across the outlets of the store. 20
  • 21. Printed Page: 1 of 2 Subject Code: KIT601 0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0 BTECH (SEM VI) THEORY EXAMINATION 2021-22 DATA ANALYTICS Time: 3 Hours Total Marks: 100 Note: Attempt all Sections. If you require any missing data, then choose suitably. SECTION A 1. Attempt all questions in brief. 2*10 = 20 Qno Questions CO (a) Discuss the need of data analytics. 1 (b) Give the classification of data. 1 (c) Define neural network. 2 (d) What is multivariate analysis? 2 (e) Give the full form of RTAP and discuss its application. 3 (f) What is the role of sampling data in a stream? 3 (g) Discuss the use of limited pass algorithm. 4 (h) What is the principle behind hierarchical clustering technique? 4 (i) List five R functions used in descriptive statistics. 5 (j) List the names of any 2 visualization tools. 5 SECTION B 2. Attempt any three of the following: 10*3 = 30 Qno Questions CO (a) Explain the process model and computation model for Big data platform. 1 (b) Explain the use and advantages of decision trees. 2 (c) Explain the architecture of data stream model. 3 (d) Illustrate the K-means algorithm in detail with its advantages. 4 (e) Differentiate between NoSQL and RDBMS databases. 5 SECTION C 3. Attempt any one part of the following: 10*1 = 10 Qno Questions CO (a) Explain the various phases of data analytics life cycle. 1 (b) Explain modern data analytics tools in detail. 1 4. Attempt any one part of the following: 10 *1 = 10 Qno Questions CO (a) Compare various types of support vector and kernel methods of data analysis. 2 (b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal component using PCA algorithm. 2 D R A F T
  • 22. Printed Page: 2 of 2 Subject Code: KIT601 0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0 BTECH (SEM VI) THEORY EXAMINATION 2021-22 DATA ANALYTICS 5. Attempt any one part of the following: 10*1 = 10 Qno Questions CO (a) Explain any one algorithm to count number of distinct elements in a data stream. 3 (b) Discuss the case study of stock market predictions in detail. 3 6. Attempt any one part of the following: 10*1 = 10 Qno Questions CO (a) Differentiate between CLIQUE and ProCLUS clustering. 4 (b) A database has 5 transactions. Let min_sup=60% and min_conf=80%. TID Items_Bought T100 {M, O, N, K, E, Y} T200 {D, O, N, K, E, Y} T300 {M, A, K, E} T400 {M, U, C, K, Y} T500 {C, O, O, K, I, E} i) Find all frequent itemsets using Apriori algorithm. ii) List all the strong association rules (with support s and confidence c). 4 7. Attempt any one part of the following: 10*1 = 10 Qno Questions CO (a) Explain the HIVE architecture with its features in detail. 5 (b) Write R function to check whether the given number is prime or not. 5 D R A F T
  • 23. D R A F T 2 Reference [1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/ [2] https://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf [3] https://www.youtube.com/watch?v=fDRa82lxzaU [4] https://www.investopedia.com/terms/d/data-analytics.asp [5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf [6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/computer_science/16._data_analytics/ 03._evolution_of_analytical_scalability/et/9280_et_3_et.pdf [7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy-nieizv_ book.pdf [8] https://www.researchgate.net/publication/317214679_Sentiment_Analysis_for_Effective_Stock_ Market_Prediction [9] https://snscourseware.org/snscenew/files/1569681518.pdf [10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience&BigDataAnalytics.pdf [11] https://www.youtube.com/watch?v=mccsmoh2_3c [12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-analytics-li [13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20data% 20refers%20to%20data,around%20for%20a%20long%20time. [14] https://www.javatpoint.com/big-data-characteristics [15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross- level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575–597. http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x ******************** 23