SlideShare a Scribd company logo
1 of 40
INTRODUCTION TO BIG DATA
ANALYTICS
Utkarsh Sharma
Asst. Prof. (CSE)
Jaypee University Of Engineering & Technology
Big Data Overview
Several industries have led the way in developing their ability to
gather and exploit data:
• Credit card companies monitor every purchase their customers make and
can identify fraudulent purchases with a high degree of accuracy using
rules derived by processing billions of transactions.
• Mobile phone companies analyze subscriber’s calling patterns to
determine, If that rival network is offering an attractive promotion that might
cause the subscriber to defect.
• For companies such as Linked In and Facebook, data itself is their primary
product.
Big Data Overview
Three attributes stand out as defining Big Data characteristics:
• Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows
and millions of columns.
• Complexity of data types and structures: Big Data reflects the variety of new data sources, formats,
and structures, including digital traces being left on the web and other digital repositories for
subsequent analysis.
• Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data
ingestion and near real time analysis.
Another definition of Big Data comes from the McKinsey Global report from 2011:
• Big Data is data whose scale, distribution, diversity, and/or timeliness
require the use of new technical architectures and analytics to enable
insights that unlock new sources of business value.
McKinsey's definition of Big Data implies that organizations will need new data architectures and
analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into the
new role of the data scientist.
Data Deluge
An Example(Genomic sequencing)
While data has grown, the cost to perform this work has fallen dramatically. The cost to sequence one
human genome has fallen from $100 million in 2001 to $10,000 in 2011, and the cost continues to drop. Now,
websites such as 23andme offer genotyping for less than $100.
Data Structures
• Big data can come in multiple forms, including structured and
non-structured data such as financial data, text files, multimedia
files, and genetic mappings.
• Most of the Big Data is unstructured or semi-structured in
nature, which requires different techniques and tools to process
and analyze.
• Distributed computing environments and massively parallel
processing (MPP) architectures that enable parallelized data
ingest and analysis are the preferred approach to process such
complex data.
Data Structures
Structured Data
• Data containing a defined data type, format, and structure (that is, transaction data, online analytical
processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spreadsheets).
Semi-structured data
• Textual data files with a discernible pattern that enables parsing (such as Extensible Markup
Language [XML] data files that are self-describing and defined by an XML schema).
Quasi-structured data
• Textual data with erratic data formats that can be formatted with effort, tools, and time (for instance,
web clickstream data that may contain inconsistencies in data values and formats).
• Consider the following example. A user attends the EMC World conference and subsequently runs
a Google search online to find information related to EMC and Data Science. This would produce a
URL such as https: I /www . google. com/ #q=EMC+ data+science
• After doing this search, the user may choose the second link, to read more about the headline "Data
Scientist- EM( Education, Training, and Certification." This brings the user to an erne . com site
focused on this topic and a new URL, ht t p s : I / e ducation . e rne . com/ guest/ campai gn/ data_
science.aspx
• Arriving at this site, the user may decide to click to learn more about the process of becoming
certified in data science. The user chooses a link toward the top of the page on Certifications,
bringing the user to a new URL: ht tps :I I education. erne. com/guest / certifica tion/ framework/ stf/
data_science . aspx,
Unstructured data
• Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
• All of these heterogenous types of data structures created the need
of some specialized data storage and retrieval techniques, such as
Data warehouses and analytics sandbox.
Data Warehouse
• A data warehouse is a central repository of information that can be analyzed to make more informed
decisions.
• Data flows into a data warehouse from transactional systems, relational databases, and other sources,
typically on a regular cadence.
• Business analysts, data engineers, data scientists, and decision makers access the data
through business intelligence (BI) tools, SQL clients, and other analytics applications.
Intro. to Data Warehouse
• The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This
data helps analysts to take informed decisions in an organization.
• An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place whereas a Data Warehouse keeps historical data also.
• A data warehouses provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us Online
Analytical Processing (OLAP) tools.
Understanding a Data Warehouse
• A data warehouse is a database, which is kept separate from the organization's operational
database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to analyze its business.
• A data warehouse helps executives to organize, understand, and use their data to take strategic
decisions.
• Data warehouse systems help in the integration of diversity of application systems.
• A data warehouse system helps in consolidated historical data analysis.
Analytics sandbox
• A workspace in which data assets are gathered from multiple sources
and technologies for analysis.
• To lessen the performance burden of the analysis, the workspace may
use in-database processing and is considered to be owned by the
analysts rather than database administrators.
• Often, this workspace is created by using a sampling of the dataset
rather than the entire dataset.
• The sandbox may also reduce the stove-piped and partial versions of
the true data that may have been developed in business units.
Analytics sandbox
Types of Data Repositories
Business Intelligence vs Data Science
Examples of Big Data Analytics
• As mentioned earlier, Big Data presents many opportunities to improve sales and marketing
analytics.
• An example of this is the U.S. retailer Target. After analysing consumer purchasing behavior,
Target's statisticians determined that the retailer made a great deal of money from three main life-
event situations.
• Marriage, when people tend to buy many new products.
• Divorce, when people buy new products and change their spending habits.
• Pregnancy, when people have many new things to buy and have an urgency to buy them.
• Target determined that the most lucrative of these life-events is the third situation: pregnancy. Using
data collected from shoppers, Target was able to identify this fact and predict which of its shoppers
were pregnant. In one case, Target knew a female shopper was pregnant even before her family
knew
Data Science Project Lifecycle
Data Science Project Lifecycle
• 1. Obtain Data
• Skills required
• how to use MySQL, PostgreSQL or MongoDB
• 2. Scrub Data
• Skills required
• You will need scripting tools like Python or R to help you to scrub the data.
• 3. Explore Data
• Skills required
• If you are using Python then Numpy, Matplotlib, Pandas or Scipy; if you are using R, then
GGplot2 or the data exploration swiss knife Dplyr. On top of that, you need to have knowledge
and skills in inferential statistics and data visualization.
• 4. Model Data
• Skills required
• In Machine Learning, the skills you will need is both supervised and unsupervised algorithms.
• 5. Interpreting Data
• Skills required
• You will need strong business domain knowledge to present your findings in a way that can
answer the business questions you set out to answer
The Analytics Process
An Analysis process contains all or some of the following phases:
• Business understanding: Identifying and understanding the business objectives
• Data Collection: Collection of data from different sources and its representation
in terms of its application.
• Data Preparation: Removing the unnecessary and unwanted data
• Data Modelling: Create a model to analyse the different relationships between
the objects.
• Data Evaluation: Evaluation and preparation
of analysis report
• Deployment: Finalizing the plan for
deployment
Types of Analytics
On the basis of problem description, four types of data analytics are used:
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Descriptive analytics : What is happening?
• This is the most common of all forms. In business it provides the analyst a view of
key metrics and measures within the business.
• Descriptive analytics juggles raw data from
multiple data sources to give valuable insights
into the past.
• However, these findings simply signal that something
is wrong or right, without explaining why.
Diagnostic: Why is it happening?
• At this stage, historical data can be measured against other data to answer the question
of why something happened.
• Diagnostic analytics gives in-depth insights into a
particular problem.
• On assessment of the descriptive data, diagnostic
analytical tools will empower an analyst to drill down
and in so doing isolate the root-cause of a problem.
Predictive: What is likely to happen?
• Predictive analytics tells what is likely to happen. It uses the findings
of descriptive and diagnostic analytics to detect clusters and
exceptions, and to predict future trends.
• Predictive models typically utilize
a variety of variable data to make
the prediction.
• Predictive analytics belongs to
advanced analytics types and brings
many advantages like sophisticated
analysis based on machine or deep
learning.
Prescriptive: What do I need to do?
• The purpose of prescriptive analytics is to literally prescribe what action to take to
eliminate a future problem or take full advantage of a promising trend.
• The prescriptive model utilizes an understanding of what has
happened, why it has happened and a variety of
“what-might-happen” analysis to help the user determine
the best course of action to take.
• Besides, this state-of-the-art type of data analytics requires not
only historical internal data but also external information due
to the nature of algorithms it’s based on.
Big Data Analytics(One more categorization)
• Basic Analytics
Slicing & Dicing
Basic monitoring
Anomaly identification
• Advanced Analytics
Predictive Modelling
Text Analytics
Statistics and data mining algorithms
• Operational Analytics
• Monetized Analytics
Data Analytics Lifecycle
Brief Overview
• The Data Analytics Lifecycle is designed specifically for Big Data problems and data
science projects.
• The lifecycle has six phases, and project work can occur in several phases at once.
• For most phases in the lifecycle, the movement can be either forward or backward.
• In recent years, substantial attention has been placed on the emerging role of the data
scientist.
• Despite this strong focus on the emerging role of the data scientist specifically, there are
actually seven key roles that need to be fulfilled for a high-functioning data science team
to execute analytic projects successfully.
Key Roles for a Successful Analytics Project
• For a small, versatile team, the seven roles may be fulfilled by only 3 people, but a very large
project may require 20 or more people. The seven roles follow:
Key Roles for a Successful Analytics Project
• Business User :- business analyst, line manager, or deep subject matter expert in the project
domain.
• Project Sponsor :- provides the funding and gauges
• Project Manager :- Ensures that key milestones and objectives are met on time and at the expected
quality.
• Business Intelligence Analyst :- Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPis).
• Database Administrator (DBA) :- Provisions and configures the database environment to support
the analytics needs of the working team.
• Data Engineer :- Leverages deep technical skills to assist with tuning SQL queries for data
management and data extraction, and provides support for data ingestion into the analytic sandbox.
• Data Scientist :- Provides subject matter expertise for analytical techniques, data modeling, and
applying valid analytical techniques to given business problems.
Data Analytics Lifecycle
Phase 1- Discovery
• Learning the Business Domain
• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses
Phase 2: Data Preparation
• Preparing the Analytic Sandbox
• Performing ETLT
• Learning About the Data
• Data Conditioning
• Survey and Visualize
Phase 3: Model Planning
• Data Exploration and Variable Selection
• Model Selection
Phase 4: Model Building
• The team develops data sets for testing, training, and production purposes.
Phase 5: Communicate Results
• The team, in collaboration with major stakeholders, determines if the results of the project
are a success or a failure based on the criteria developed in Phase 1.
Phase 6: Operationalize
• The team delivers final reports, briefings, code, and technical documents.
• In addition, the team may run a pilot project to implement the models in a production
environment.
Key Outputs from a Successful Analytic Project
Big Data Pre-processing
• The set of techniques used prior to the application of a data mining
method is named as data preprocessing for data mining.
• The bigger amounts of data collected require more sophisticated
mechanisms to analyze it.
• Data preprocessing is able to adapt the data to the requirements
posed by each data mining algorithm, enabling to process data that
would be unfeasible otherwise.
Introduction to Big Data Analytics

More Related Content

What's hot

Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 
PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisAashish Patel
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data WarehousingAlex Meadows
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Edureka!
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseShanthi Mukkavilli
 
Altis: AWS Snowflake Practice
Altis: AWS Snowflake PracticeAltis: AWS Snowflake Practice
Altis: AWS Snowflake PracticeAltis Consulting
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data ScienceMaloy Manna, PMP®
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 

What's hot (20)

Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data Analysis
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data Warehousing
 
Big data
Big dataBig data
Big data
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Introduction to Tableau
Introduction to Tableau Introduction to Tableau
Introduction to Tableau
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Altis: AWS Snowflake Practice
Altis: AWS Snowflake PracticeAltis: AWS Snowflake Practice
Altis: AWS Snowflake Practice
 
NoSql
NoSqlNoSql
NoSql
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
OLAP
OLAPOLAP
OLAP
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 

Similar to Introduction to Big Data Analytics

Similar to Introduction to Big Data Analytics (20)

Data mining
Data miningData mining
Data mining
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Business Analytics and Data mining.pdf
Business Analytics and Data mining.pdfBusiness Analytics and Data mining.pdf
Business Analytics and Data mining.pdf
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Big_Data.pptx
Big_Data.pptxBig_Data.pptx
Big_Data.pptx
 
Data Mining & Data Warehousing
Data Mining & Data WarehousingData Mining & Data Warehousing
Data Mining & Data Warehousing
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
KIT601 Unit I.pptx
KIT601 Unit I.pptxKIT601 Unit I.pptx
KIT601 Unit I.pptx
 
Introductions to Business Analytics
Introductions to Business Analytics Introductions to Business Analytics
Introductions to Business Analytics
 
ERP technology Areas.pptx
ERP technology Areas.pptxERP technology Areas.pptx
ERP technology Areas.pptx
 
Abstract
AbstractAbstract
Abstract
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx9. Data Warehousing & Mining.pptx
9. Data Warehousing & Mining.pptx
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Cognos datawarehouse
Cognos datawarehouseCognos datawarehouse
Cognos datawarehouse
 

More from Utkarsh Sharma

Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsUtkarsh Sharma
 
Web mining: Concepts and applications
Web mining: Concepts and applicationsWeb mining: Concepts and applications
Web mining: Concepts and applicationsUtkarsh Sharma
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithmsUtkarsh Sharma
 
Principle Component Analysis
Principle Component AnalysisPrinciple Component Analysis
Principle Component AnalysisUtkarsh Sharma
 
Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )Utkarsh Sharma
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningUtkarsh Sharma
 

More from Utkarsh Sharma (9)

Model validation
Model validationModel validation
Model validation
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Web mining: Concepts and applications
Web mining: Concepts and applicationsWeb mining: Concepts and applications
Web mining: Concepts and applications
 
Time series analysis
Time series analysisTime series analysis
Time series analysis
 
Text analytics
Text analyticsText analytics
Text analytics
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithms
 
Principle Component Analysis
Principle Component AnalysisPrinciple Component Analysis
Principle Component Analysis
 
Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )Density based Clustering Algorithms(DB SCAN, Mean shift )
Density based Clustering Algorithms(DB SCAN, Mean shift )
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 

Recently uploaded

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 

Recently uploaded (20)

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 

Introduction to Big Data Analytics

  • 1. INTRODUCTION TO BIG DATA ANALYTICS Utkarsh Sharma Asst. Prof. (CSE) Jaypee University Of Engineering & Technology
  • 2. Big Data Overview Several industries have led the way in developing their ability to gather and exploit data: • Credit card companies monitor every purchase their customers make and can identify fraudulent purchases with a high degree of accuracy using rules derived by processing billions of transactions. • Mobile phone companies analyze subscriber’s calling patterns to determine, If that rival network is offering an attractive promotion that might cause the subscriber to defect. • For companies such as Linked In and Facebook, data itself is their primary product.
  • 3. Big Data Overview Three attributes stand out as defining Big Data characteristics: • Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows and millions of columns. • Complexity of data types and structures: Big Data reflects the variety of new data sources, formats, and structures, including digital traces being left on the web and other digital repositories for subsequent analysis. • Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data ingestion and near real time analysis.
  • 4. Another definition of Big Data comes from the McKinsey Global report from 2011: • Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. McKinsey's definition of Big Data implies that organizations will need new data architectures and analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into the new role of the data scientist.
  • 6. An Example(Genomic sequencing) While data has grown, the cost to perform this work has fallen dramatically. The cost to sequence one human genome has fallen from $100 million in 2001 to $10,000 in 2011, and the cost continues to drop. Now, websites such as 23andme offer genotyping for less than $100.
  • 7. Data Structures • Big data can come in multiple forms, including structured and non-structured data such as financial data, text files, multimedia files, and genetic mappings. • Most of the Big Data is unstructured or semi-structured in nature, which requires different techniques and tools to process and analyze. • Distributed computing environments and massively parallel processing (MPP) architectures that enable parallelized data ingest and analysis are the preferred approach to process such complex data.
  • 9. Structured Data • Data containing a defined data type, format, and structure (that is, transaction data, online analytical processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spreadsheets).
  • 10. Semi-structured data • Textual data files with a discernible pattern that enables parsing (such as Extensible Markup Language [XML] data files that are self-describing and defined by an XML schema).
  • 11. Quasi-structured data • Textual data with erratic data formats that can be formatted with effort, tools, and time (for instance, web clickstream data that may contain inconsistencies in data values and formats). • Consider the following example. A user attends the EMC World conference and subsequently runs a Google search online to find information related to EMC and Data Science. This would produce a URL such as https: I /www . google. com/ #q=EMC+ data+science • After doing this search, the user may choose the second link, to read more about the headline "Data Scientist- EM( Education, Training, and Certification." This brings the user to an erne . com site focused on this topic and a new URL, ht t p s : I / e ducation . e rne . com/ guest/ campai gn/ data_ science.aspx • Arriving at this site, the user may decide to click to learn more about the process of becoming certified in data science. The user chooses a link toward the top of the page on Certifications, bringing the user to a new URL: ht tps :I I education. erne. com/guest / certifica tion/ framework/ stf/ data_science . aspx,
  • 12. Unstructured data • Data that has no inherent structure, which may include text documents, PDFs, images, and video. • All of these heterogenous types of data structures created the need of some specialized data storage and retrieval techniques, such as Data warehouses and analytics sandbox.
  • 13. Data Warehouse • A data warehouse is a central repository of information that can be analyzed to make more informed decisions. • Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. • Business analysts, data engineers, data scientists, and decision makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications.
  • 14. Intro. to Data Warehouse • The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. • An operational database undergoes frequent changes on a daily basis on account of the transactions that take place whereas a Data Warehouse keeps historical data also. • A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools.
  • 15. Understanding a Data Warehouse • A data warehouse is a database, which is kept separate from the organization's operational database. • There is no frequent updating done in a data warehouse. • It possesses consolidated historical data, which helps the organization to analyze its business. • A data warehouse helps executives to organize, understand, and use their data to take strategic decisions. • Data warehouse systems help in the integration of diversity of application systems. • A data warehouse system helps in consolidated historical data analysis.
  • 16. Analytics sandbox • A workspace in which data assets are gathered from multiple sources and technologies for analysis. • To lessen the performance burden of the analysis, the workspace may use in-database processing and is considered to be owned by the analysts rather than database administrators. • Often, this workspace is created by using a sampling of the dataset rather than the entire dataset. • The sandbox may also reduce the stove-piped and partial versions of the true data that may have been developed in business units.
  • 18. Types of Data Repositories
  • 19. Business Intelligence vs Data Science
  • 20. Examples of Big Data Analytics • As mentioned earlier, Big Data presents many opportunities to improve sales and marketing analytics. • An example of this is the U.S. retailer Target. After analysing consumer purchasing behavior, Target's statisticians determined that the retailer made a great deal of money from three main life- event situations. • Marriage, when people tend to buy many new products. • Divorce, when people buy new products and change their spending habits. • Pregnancy, when people have many new things to buy and have an urgency to buy them. • Target determined that the most lucrative of these life-events is the third situation: pregnancy. Using data collected from shoppers, Target was able to identify this fact and predict which of its shoppers were pregnant. In one case, Target knew a female shopper was pregnant even before her family knew
  • 21. Data Science Project Lifecycle
  • 22. Data Science Project Lifecycle • 1. Obtain Data • Skills required • how to use MySQL, PostgreSQL or MongoDB • 2. Scrub Data • Skills required • You will need scripting tools like Python or R to help you to scrub the data. • 3. Explore Data • Skills required • If you are using Python then Numpy, Matplotlib, Pandas or Scipy; if you are using R, then GGplot2 or the data exploration swiss knife Dplyr. On top of that, you need to have knowledge and skills in inferential statistics and data visualization. • 4. Model Data • Skills required • In Machine Learning, the skills you will need is both supervised and unsupervised algorithms. • 5. Interpreting Data • Skills required • You will need strong business domain knowledge to present your findings in a way that can answer the business questions you set out to answer
  • 23. The Analytics Process An Analysis process contains all or some of the following phases: • Business understanding: Identifying and understanding the business objectives • Data Collection: Collection of data from different sources and its representation in terms of its application. • Data Preparation: Removing the unnecessary and unwanted data • Data Modelling: Create a model to analyse the different relationships between the objects. • Data Evaluation: Evaluation and preparation of analysis report • Deployment: Finalizing the plan for deployment
  • 24. Types of Analytics On the basis of problem description, four types of data analytics are used: • Descriptive Analytics • Diagnostic Analytics • Predictive Analytics • Prescriptive Analytics
  • 25. Descriptive analytics : What is happening? • This is the most common of all forms. In business it provides the analyst a view of key metrics and measures within the business. • Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the past. • However, these findings simply signal that something is wrong or right, without explaining why.
  • 26. Diagnostic: Why is it happening? • At this stage, historical data can be measured against other data to answer the question of why something happened. • Diagnostic analytics gives in-depth insights into a particular problem. • On assessment of the descriptive data, diagnostic analytical tools will empower an analyst to drill down and in so doing isolate the root-cause of a problem.
  • 27. Predictive: What is likely to happen? • Predictive analytics tells what is likely to happen. It uses the findings of descriptive and diagnostic analytics to detect clusters and exceptions, and to predict future trends. • Predictive models typically utilize a variety of variable data to make the prediction. • Predictive analytics belongs to advanced analytics types and brings many advantages like sophisticated analysis based on machine or deep learning.
  • 28. Prescriptive: What do I need to do? • The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future problem or take full advantage of a promising trend. • The prescriptive model utilizes an understanding of what has happened, why it has happened and a variety of “what-might-happen” analysis to help the user determine the best course of action to take. • Besides, this state-of-the-art type of data analytics requires not only historical internal data but also external information due to the nature of algorithms it’s based on.
  • 29. Big Data Analytics(One more categorization) • Basic Analytics Slicing & Dicing Basic monitoring Anomaly identification • Advanced Analytics Predictive Modelling Text Analytics Statistics and data mining algorithms • Operational Analytics • Monetized Analytics
  • 30. Data Analytics Lifecycle Brief Overview • The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects. • The lifecycle has six phases, and project work can occur in several phases at once. • For most phases in the lifecycle, the movement can be either forward or backward. • In recent years, substantial attention has been placed on the emerging role of the data scientist. • Despite this strong focus on the emerging role of the data scientist specifically, there are actually seven key roles that need to be fulfilled for a high-functioning data science team to execute analytic projects successfully.
  • 31. Key Roles for a Successful Analytics Project • For a small, versatile team, the seven roles may be fulfilled by only 3 people, but a very large project may require 20 or more people. The seven roles follow:
  • 32. Key Roles for a Successful Analytics Project • Business User :- business analyst, line manager, or deep subject matter expert in the project domain. • Project Sponsor :- provides the funding and gauges • Project Manager :- Ensures that key milestones and objectives are met on time and at the expected quality. • Business Intelligence Analyst :- Provides business domain expertise based on a deep understanding of the data, key performance indicators (KPis). • Database Administrator (DBA) :- Provisions and configures the database environment to support the analytics needs of the working team. • Data Engineer :- Leverages deep technical skills to assist with tuning SQL queries for data management and data extraction, and provides support for data ingestion into the analytic sandbox. • Data Scientist :- Provides subject matter expertise for analytical techniques, data modeling, and applying valid analytical techniques to given business problems.
  • 34. Phase 1- Discovery • Learning the Business Domain • Resources • Framing the Problem • Identifying Key Stakeholders • Interviewing the Analytics Sponsor • Developing Initial Hypotheses
  • 35. Phase 2: Data Preparation • Preparing the Analytic Sandbox • Performing ETLT • Learning About the Data • Data Conditioning • Survey and Visualize
  • 36. Phase 3: Model Planning • Data Exploration and Variable Selection • Model Selection Phase 4: Model Building • The team develops data sets for testing, training, and production purposes.
  • 37. Phase 5: Communicate Results • The team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. Phase 6: Operationalize • The team delivers final reports, briefings, code, and technical documents. • In addition, the team may run a pilot project to implement the models in a production environment.
  • 38. Key Outputs from a Successful Analytic Project
  • 39. Big Data Pre-processing • The set of techniques used prior to the application of a data mining method is named as data preprocessing for data mining. • The bigger amounts of data collected require more sophisticated mechanisms to analyze it. • Data preprocessing is able to adapt the data to the requirements posed by each data mining algorithm, enabling to process data that would be unfeasible otherwise.