1. 7 Computational Giants of
Massive Data Analysis
Instructor: Assoc. Prof. PhD. Nguyễn Thanh Bình
Master students:
Đoàn Đức Thế Anh
Võ Nam Thục Đoan
Nguyễn Ngọc Bảo Trân
Trần Trung Hiếu
22C01001
22C01004
22C01021
22C01009
CHAPTER 10
2. Massive data analysis
cannot be processed using a stand-alone computer
use of existing (distributed and parallel) hardware platforms
challenges to traditional statistical methods and algorithms
overall system architecture
4. The “7 Computational Giants” of Data
(computational problem types)
Basic statistics
Generalized N-body problem
Graph-theoretic computations
Linear-algebraic computations
Optimization
Integration
Alignment problems
1
2
3
4
5
6
7
5. Basic statistics
• Descriptive statistics: summarize the data
and provide insights into its
– central tendency: mean, median, mode
– variability of a data set: variance,
standard deviation, count, min max,
quartiles, skewness and kurtosis
– frequency distribution
N data points O(N) calculations
6. Basic statistics
• Inferential statistics :
– generalize results to larger
populations based on small
samples
– looking at how things change over
time
– use sampling methods to find
samples that are representative of
the whole population
– determine what is happening
N data points O(N2) calculations
7. Why is statistical computing important in
research and decision-making?
Evidence-based analysis
Explore relationships between variables
Evaluating the effectiveness of interventions
Contributing to improved outcomes
A vital role in fields: healthcare, finance, marketing, and social
sciences
8. Basic statistics - Challenges
High dimensionality
High dimensionality + large
sample size
Big2 Data: from multiple
sources, at different time
points, using different
technologies
• noise accumulation
• spurious correlations
• Incidental homogeneity
• heavy computational
cost
• algorithmic instability
• heterogeneity
• experimental variations
• statistical biases
false scientific
conclusions
wrong statistical
inference
statistical biases
9. Basic statistics - Solutions
New
statistical
thinking
New
computational
methods
Solutions
variable selection
dimension
reduction
new regularization
methods
independence
screening
the development of new
computational infrastructure and
data storage methods
10. Generalized N-body problem
• The 17th century, Sir Isaac Newton
formulated:
– The laws of motion
– The law of universal gravitation
the behavior of objects and their interactions
Origin of the N-body problem: predicting the
motions of N celestial objects interacting with
each other gravitationally
• Karl Fritiof Sundman: solved for n = 3
• L. K. Babadzanjanz and Qiudong Wang: generalized to n > 3
11. N-body problem
• Three bodies
with equal
mass
[published
2000]
• Three bodies
of unequal
mass
• Two pairs of
bodies orbiting
about each
other
• An orbit discovered
in 2008 by
Tiancheng
Ouyang, Duokui
Yan, and Skyler
Simmons at BYU
12. Generalized N-body problem - Challenges
• Numerical approximations
• Chaotic behavior
• Interdisciplinary nature
• Main obstacle: O(N2)
13. Generalized N-body problem - Solutions
• Barnes-Hut Algorithm [Barnes and Hut, 87]:
if
r
s
s
r
i
R
R
i x
K
N
x
x
K )
,
(
)
,
(
O(N log N)
N(N-1)/2 = O(N2)
14. Generalized N-body problem - Solutions
• Fast Multipole Method [Greengard and Rokhlin 1987]:
i
i
x
x
K
x )
,
(
, O(N)
multipole/Taylor expansion
of order p
Quadtree
[Callahan-Kosaraju 95]: O(N) is impossible for log-depth tree
N(N-1)/2 = O(N2)
15. Linear Algebraic computations
Problems involves matrix operations, solving linear systems, finding eigenvalues
eigenvectors, inverves, orthogonality,...
Examples: linear regression, SVD, PCA, clustering, graph analysis, image processing
(edge detection, compression, blurring,...)
Linear regression
SVD
PCA
Clustering
Kernel cho
edge detection
16. - Matrix with slowly decaying spectra → high computational
complexity, sensitive to noise.
- Nearly singular matrix det(M)~0 → nearly non-invertible, sensitive to
small changes in matrix entries.
→ Some solution approaches:
- Truncated SVD, regularization, pseudoinverse using SVD
- Random sampling + Statistical methods
E.g.: Choose a random submatrix based on suitable probability
distributions from the given matrix to approximate SVD of the
whole.
Linear Algebraic computations - Challenges
17. Other challenges:
- Optimization problems: generic
LA approaches yield high
training accuracy which can
cause overfitting
→ Gradient descent, random
sampling
- The data grows too massive that
it cannot be stored or handled
by a single device
→ Distributed linear algebra
Gradient descent
Matrices are
checkerboard
distributed on
TPU during
multiplication
Linear Algebraic computations - Challenges
18. Appear in statistical methods from early on and frequently
E.g.: semidefinite programming in manifold learning.
→ Optimizations generally focuses on minimize/ maximize the objective function.
Optimization
Linear programing Quadratic programing
From unconstrained to
constrained, both convex
and non-convex
19. - A large number of variables and constraints
- Finding a global solution for non-convex problems is an open
problem.
- Problems with integer constraints (integer programming).
- Challenging problems, such as high-dimensional nonlinear objective
problems, may contain multiple local optima in which
deterministic optimization algorithms may get stuck
Optimization - Challenges
20. Some approaches:
- Exploit the particular mathematical forms of certain problems to
find more effective optimizers
E.g.: Sequential Minimal Optimization decomposes SVM into sub-
problems by iteratively selecting 2 Lagrange multipliers to solve
- Stochastic optimization (introduce randomness) + Online learning
E.g.: Stochastic Gradient Descent - iteratively update parameters
with a random subset of data instead of the entire data.
Online learning
Optimization
21. Some approaches:
- Distributed optimization
E.g.: Tensorflow, PyTorch
a) across processors b) across multiple nodes
Distribute optimization process
Optimization
22. Graph-Theoretic Computations
• Graph-theoretic computations
involve traversing graphs, which
can be the data itself or
represent statistical models.
• Common statistical
computations on graphs include
betweenness centrality and
commute distances, used to
identify nodes or communities of
interest.
• Large-scale, sparse graphs
present computational
challenges for these
computations.
23. Challenges and Approaches
• Challenges: High interconnectivity in graphs,
• large maximal clique size, and memory constraints.
• Notable approaches:
• Sampling and disk-based methods for handling large graphs.
• Parallel/distributed approaches using sparse linear algebra
or graph concepts.
• Graph partitioning and linear algebraic reconditioning for
efficient computations.
• Transformation of graphical model inference problems into
optimization or variational methods.
• Sampling and parallel/distributed approaches for graphical
model inference.
24. Additional Applications:
• Manifold learning methods: Iso-map requires all-pairs-shortest-paths
computation.
• Single-linkage hierarchical clustering: Equivalent to computing a
minimum spanning tree.
• These examples highlight the intersection between graph
computations and distance-based or N-body-type problems.
25. Integration in Data Analysis
• Integration is a key computation
in data analysis, essential for
Bayesian inference and statistical
modeling.
• Challenges arise with high-
dimensional integrals, requiring
specialized approaches.
26. Approaches to High-Dimensional Integration
1. Markov Chain Monte Carlo (MCMC)
– Default approach for high-dimensional integration.
– Utilizes a sequence of random samples to
approximate the integral.
– Widely used in Bayesian inference and random
effects models.
2. Approximate Bayesian Computation (ABC) Methods
– Operate on summary data to accelerate
computation.
– Useful for cases where exact inference is
challenging.
– Achieves acceleration by working with population
means or variances.
27. Alternative Approaches and Strategies
1. Population Monte Carlo
– Form of adaptive importance sampling.
– Enhances the efficiency of Monte Carlo integration.
– Particularly useful for certain sequential models, such as particle
filtering.
2. Variational Methods
– Convert integration problems into optimization problems.
– Provide a general framework for approximate inference.
– Offers an alternative strategy to address high-dimensional integration
challenges.
3. Optimization-Based Point Estimation
– Skirts the full integration problem.
– Used in approaches like maximum a posteriori inference and empirical
Bayesian inference.
– Involves optimizing point estimates rather than performing full Bayesian
inference.a
29. Genomic data science
Genomic data science emerged as a field in the
1990s to bring together two laboratory activities:
Experimentation: Generating genomic
information from studying the genomes of
living organisms
Data analysis: Using statistical and
computational tools to analyze and
visualize genomic data, which includes
processing and storing data and using
algorithms and software to make
predictions based on available genomic
data
Facts
Data about a single human genome
sequence alone would take up 200
gigabytes
Need an estimated 40 exabytes to
store the genome- sequence data
generated worldwide by 2025
30. DNA to RNA to Protein, Illustrating the Genetic Code
35. Question about sequence
1. Biological question: “How similar are the genomes of humans and
chimpanzees?”
– Computational question: Given two sequences r and s, compute
their similarity, sim(s,r)
2. Biological question: “This gene causes obesity in mice. Do humans
have the same gene?”
– Computational question: Given a sequence r (the mouse gene)
and a database D of sequences (all human genes), find
sequences s in D where sim(r,s) is above a threshold
36. Question about sequence
3. Biological question: “We know some mutations of this gene cause sickle-cell anemia.
We have the sequences of 100 patients and 100 normal people. Let find out the disease-
causing mutations.
– Computational question: Given two sets of sequences of different lengths, find an
alignment that maximizes the overall similarity. Then look for mutations that are
unique to one group.
Patients ACGCGT ACGCGT ACGCGT
CGCGT _CGCGT _CGCGT
ACGCGA ACGCGA ACGCGA
Control AGCTT A_GCTT A_GCTT
ACGCTT ACGCTT ACGCTT
ACGCTA ACGCTA ACGCTA
Perfoming aligment
makes it easy to
compute the
similarity between
two sequences.
37. Scoring function
To compare the similarity of two string up to changes such as: Mutation, Insertion,
Deletion. For string AGGCCTC
Mutations: AGG A CTC
Insertions: AGG G CTCT
Deletions: AGG . CTC
Symbol:
Match : +m
Mismatch: -s
Gap: -d
Simple Scoring Function: F = (#matches) x m - (#mismatches) x s - (#gap) x d
Total score will reflect the quality of alignment
Entered text
Massive data refers to a large amount of data that is too difficult to process using traditional tools like spreadsheets or text processors. It can exist in structured or unstructured form and consists of petabytes and exabytes of data. Big data can be analyzed for insights that improve decisions and give confidence for making strategic business moves.
Processing massive data, also known as big data, can present several challenges. Here are some common ones: Storage, Processing speed, Data quality, Security, Data integration, Cost, Scalability
Giới thiệu massive data -> kiến trúc hệ thống
Giảm chiều dữ liệu có thể được sử dụng cho giảm nhiễu (noise reduction), trực quan hóa dữ liệu (data visualization), phân tích cụm, hoặc là một bước trung gian để tạo điều kiện thuận lợi cho các phân tích khác.
its inverse may be highly sensitive to small changes in the matrix entries. Nearly non-inverible → iterative
dividing the computational workload and data across multiple processing units
Linear programing (determine the best outcome in a linear mathematical model, given a set of linear constraints.)
LA computations are a special case (2nd-order optimization).
quadratic(quadratic objective function and linear constraints)
2nd-order cone programming (linear objective, linear constraints bao gồm 2nd order cone
deals with the optimization of linear objective functions subject to linear matrix inequality constraints. It generalizes linear programming to handle optimization problems involving positive semidefinite matrices.
Manifold learning: học cấu trúc trong dữ lieệu cao chiều – biểu diển ít chiều hơn
Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với
Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian
A stochastic program is an optimization problem in which some or all problem parameters are uncertain, but follow known probability distributions. This framework contrasts with deterministic optimization, in which all problem parameters are assumed to be known exactly.
Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với
Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian
exploits the particular structure of this quadratic optimization problem of SVM by iteratively selecting two Lagrange multipliers and solving a sub-problem to update them.
he objective function aims to maximize the margin between the decision boundary and the support vectors while minimizing the classification errors. The Lagrange multipliers (α values) are the variables to be optimized. The constraints ensure that the sum of the Lagrange multipliers weighted by the corresponding target variables is zero and that the Lagrange multipliers are within a specified range (0 ≤ α[i] ≤ C).
thuật toán GD trong deep learning
receives a sequence of data points one at a time and updates its model iteratively.
the use of randomness in the objective function or in the optimization algorithm.
Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với
Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian
exploits the particular structure of this quadratic optimization problem of SVM by iteratively selecting two Lagrange multipliers and solving a sub-problem to update them.
he objective function aims to maximize the margin between the decision boundary and the support vectors while minimizing the classification errors. The Lagrange multipliers (α values) are the variables to be optimized. The constraints ensure that the sum of the Lagrange multipliers weighted by the corresponding target variables is zero and that the Lagrange multipliers are within a specified range (0 ≤ α[i] ≤ C).
thuật toán GD trong deep learning
receives a sequence of data points one at a time and updates its model iteratively.
the use of randomness in the objective function or in the optimization algorithm.
Để so sánh độ tương tự giữa 2 chuỗi với các thay đổi như đột biến, chèn hoặc xoá. Ví dụ chuỗi AGGCCTC
Interactive demo for Needleman–Wunsch algorithm (mostafa.io)
Tiêu chuẩn đánh giá Alignment
Để giải quyết vấn đề và đạt được hiệu quả tính toán có thể hướng đến các hướng sau: sampling, parallel/distributed computing, algorithms
Interactive demo for Needleman–Wunsch algorithm (mostafa.io)