1. Introduction to
Data Science
- Big Data & Data Analytics -
Yasas Senarath
Graduate Assistant Researcher at DataSEARCH
University of Moratuwa
2. Outline
● Introduction to Big Data and Data Science
● Data Driven Decision Making / D3M
● Importance of Big Data in Telehealth Services
● Data to Knowledge Process
● Techniques and Tools
3. Data is the new science.
Big data holds the answers.
-Pat Gelsinger, CEO, VMware
4. What is Big Data?
Big data is a term used to refer to data sets
that are too large or complex for traditional
data-processing application software to
adequately deal with.
--Wikipedia
“
5. ● Attributes that define big data (the 4 V’s)
How to identify Big Data?
Volume Velocity
Variety Veracity
6. ● Mobile Devices
● Internet of Things (IOT)
● Social Media
● Satellite Imagery
Where does Big Data come from?
8. ● Emerging Discipline
● No exact definition (Different definitions exist from different
perspectives)
Data Science
†National Institute of Standards and Technology
Data science is the empirical synthesis of
actionable knowledge from raw data
through the data lifecycle process
-NIST†
“
9. Why Data Science?
● Exact new values, Insights and Hypothesis
● Derive new knowledge from existing data
● Understand customers’ behaviour
● Facilitate the demand market to suppliers
● Build Recommender systems
● Build predictive systems.
10. Data-driven decision making (DDDM)
involves making decisions that are
backed up by hard data rather than
making decisions that are intuitive or
based on observation alone.
MIT Sloan School of Management
professors Andrew McAfee and Erik
Brynjolfsson explain in a Wall Street
Journal article that companies that
were mostly data-driven had 4%
higher productivity and 6% higher
profits than the average.
Data-driven decision making (DDDM/D3M)
12. Big Data in Telehealth Services
● Predict Admission Rates
○ Big data is helping to solve this problem, at least at a few hospitals in Paris
○ A Forbes article† details how four hospitals which are part of the Assistance
Publique-Hôpitaux de Paris have been using data from a variety of sources to
come up with daily and hourly predictions of how many patients are expected
to be at each hospital
● Electronic Health Records (EHRs)
○ Trigger warnings and reminders when a patient should get a new lab test or
track prescriptions to see if a patient has been following doctors’ orders
○ Hospitals adopting EHR?
†https://bit.ly/2FSzTZk
13. Big Data in Telehealth Services
● Real-Time Alerting
○ Wearables will collect data from patients and send this data to the cloud
○ React every time the results will be disturbing
Send data
periodically
Alert the
Doctor
Administer
measures
Analize
Better
Treatment
Plans
14. Big Data in Telehealth Services
● Patient Satisfaction Monitoring
○ Collect data on sentiment of the patient on Doctor / Hospital
○ For example,
■ Whether the doctor explained the treatment understandably
■ Whether the patient had confidence and trust in the treating physician
○ Analyze and use it to improve the quality of health services
● Minimizing Waiting Time
○ Predict the time patient should be available to the doctor
15. Big Data in Business
● Sentiment / Opinion Analysis
○ Analyze Social Media Posts and forums
○ Learn how customers feel about your products
○ Give attention where required
● Understanding, Targeting And Serving Customers
○ Analize usage patterns and understand the customer base (Eg: demographic)
○ Targeted Advertising
○ Improved service
17. Data to Knowledge Process [contd...]
Data
Manipulation
Analytics
Communication
& Visualization
Data
Acquisition
Data Storage
Data Cleaning
● Electronic Medical
Records (EMRs)
● User-generated data
(Fitbit, iWatch)
● Doctor Channelling
Records
● System Logs
● Patient Details
...
● Data acquisition and data
formats Privacy and
ethical issues
18. Data to Knowledge Process [contd...]
Data
Manipulation
Analytics
Communication
& Visualization
Data
Acquisition
Data Storage
Data Cleaning
● Big Data
● CSV, TSV,XL
● Databases (MySQL,
NoSQL)
19. Data to Knowledge Process [contd...]
Data
Manipulation
Analytics
Communication
& Visualization
Data
Acquisition
Data Storage
Data Cleaning
● Missing Values
● Outliers
● Human Error
● Machine Error
20. Data to Knowledge Process [contd...]
Data
Manipulation
Analytics
Communication
& Visualization
Exploratory
Data Analysis
Dependency
and
Relationship
Machine
Learning
● Descriptive Statistics
● Clustering
● Looking for patterns
● Hypothesis testing
● Data tendency
● Groups, subgroups
● Looking for abnormality
21. Data to Knowledge Process [contd...]
Data
Manipulation
Analytics
Communication
& Visualization
Exploratory
Data Analysis
Dependency
and
Relationship
Machine
Learning
● Association
- Do changes in X (seem to)
coincide with changes in Y?
● Correlation
- How to quantify the
association between X and Y?
● Agreement
- Do X and Y agre?
● Causation
- Do changes in X cause
changes in Y?
22. Data to Knowledge Process [contd...]
Data
Manipulation
Analytics
Communication
& Visualization
Exploratory
Data Analysis
Dependency
and
Relationship
Machine
Learning
24. Q & A
Hiding within those mounds of data is
knowledge that could change the life of a
patient, or change the world.
--Atul Butte, Stanford
“
ysenarath wayasas wayasas ypsenarath
25. Challenges
● Privacy and Security
● Data collection and management
○ Complex Data
○ Noisy Data
○ Distributed Data
○ Data Integration
● Performance
● Background Knowledge
Notes de l'éditeur
Data Veracity, uncertain or imprecise data.
Data veracity is the degree to whichdata is accurate, precise and trusted. Data is often viewed as certain and reliable. The reality of problem spaces, data sets and operational environments is that data is often uncertain, imprecise and difficult to trust. The following are illustrative examples of data veracity.
https://www.datapine.com/blog/big-data-examples-in-healthcare/
identify asthma trends both on an individual level and looking at larger populations
DS:
Descriptive statistics aims to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.
For example, in papers reporting on human subjects, typically a table is included giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, the proportion of subjects with related comorbidities, etc.
scatter plots
Pearson’s correlation coefficient for two MC data types (assumed normal), Spearman’s rank correlation coefficient for either or both variables is ordinal (not assume normal)
Cohen’s kappa coefficient
Linear regression, Structural equation modelling.
scatter plots
Pearson’s correlation coefficient for two MC data types (assumed normal), Spearman’s rank correlation coefficient for either or both variables is ordinal (not assume normal)
Cohen’s kappa coefficient
Linear regression, Structural equation modelling.
Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.