•0 j'aime•2 vues

Signaler

Partager

Télécharger pour lire hors ligne

computational giants

- 1. 7 Computational Giants of Massive Data Analysis Instructor: Assoc. Prof. PhD. Nguyễn Thanh Bình Master students: Đoàn Đức Thế Anh Võ Nam Thục Đoan Nguyễn Ngọc Bảo Trân Trần Trung Hiếu 22C01001 22C01004 22C01021 22C01009 CHAPTER 10
- 2. Massive data analysis cannot be processed using a stand-alone computer use of existing (distributed and parallel) hardware platforms challenges to traditional statistical methods and algorithms overall system architecture
- 3. Tasks of machine learning / data mining •orthogonal range-search, nearest-neighbor O(N) •all-nearest-neighbors O(N2) Querying •mixture of Gaussians, kernel density estimation O(N2) •kernel conditional density estimation O(N3) 1.Density estimation •decision tree, nearest-neighbor classifier O(N2) •support vector machine O(N3) Classification •linear regression, LASSO, kernel regression O(N2) •Gaussian process regression O(N3) Regression • PCA, non-negative matrix factorization, kernel PCA O(N3) • maximum variance unfolding O(N3) Dimension reduction • k-means, mean-shift O(N2) • hierarchical (FoF) clustering O(N3) Clustering • MST O(N3) • bipartite cross-matching O(N3) • n-point correlation 2-sample testing O(Nn) Testing and matching
- 4. The “7 Computational Giants” of Data (computational problem types) Basic statistics Generalized N-body problem Graph-theoretic computations Linear-algebraic computations Optimization Integration Alignment problems 1 2 3 4 5 6 7
- 5. Basic statistics • Descriptive statistics: summarize the data and provide insights into its – central tendency: mean, median, mode – variability of a data set: variance, standard deviation, count, min max, quartiles, skewness and kurtosis – frequency distribution N data points O(N) calculations
- 6. Basic statistics • Inferential statistics : – generalize results to larger populations based on small samples – looking at how things change over time – use sampling methods to find samples that are representative of the whole population – determine what is happening N data points O(N2) calculations
- 7. Why is statistical computing important in research and decision-making? Evidence-based analysis Explore relationships between variables Evaluating the effectiveness of interventions Contributing to improved outcomes A vital role in fields: healthcare, finance, marketing, and social sciences
- 8. Basic statistics - Challenges High dimensionality High dimensionality + large sample size Big2 Data: from multiple sources, at diﬀerent time points, using diﬀerent technologies • noise accumulation • spurious correlations • Incidental homogeneity • heavy computational cost • algorithmic instability • heterogeneity • experimental variations • statistical biases false scientiﬁc conclusions wrong statistical inference statistical biases
- 9. Basic statistics - Solutions New statistical thinking New computational methods Solutions variable selection dimension reduction new regularization methods independence screening the development of new computational infrastructure and data storage methods
- 10. Generalized N-body problem • The 17th century, Sir Isaac Newton formulated: – The laws of motion – The law of universal gravitation the behavior of objects and their interactions Origin of the N-body problem: predicting the motions of N celestial objects interacting with each other gravitationally • Karl Fritiof Sundman: solved for n = 3 • L. K. Babadzanjanz and Qiudong Wang: generalized to n > 3
- 11. N-body problem • Three bodies with equal mass [published 2000] • Three bodies of unequal mass • Two pairs of bodies orbiting about each other • An orbit discovered in 2008 by Tiancheng Ouyang, Duokui Yan, and Skyler Simmons at BYU
- 12. Generalized N-body problem - Challenges • Numerical approximations • Chaotic behavior • Interdisciplinary nature • Main obstacle: O(N2)
- 13. Generalized N-body problem - Solutions • Barnes-Hut Algorithm [Barnes and Hut, 87]: if r s s r i R R i x K N x x K ) , ( ) , ( O(N log N) N(N-1)/2 = O(N2)
- 14. Generalized N-body problem - Solutions • Fast Multipole Method [Greengard and Rokhlin 1987]: i i x x K x ) , ( , O(N) multipole/Taylor expansion of order p Quadtree [Callahan-Kosaraju 95]: O(N) is impossible for log-depth tree N(N-1)/2 = O(N2)
- 15. Linear Algebraic computations Problems involves matrix operations, solving linear systems, finding eigenvalues eigenvectors, inverves, orthogonality,... Examples: linear regression, SVD, PCA, clustering, graph analysis, image processing (edge detection, compression, blurring,...) Linear regression SVD PCA Clustering Kernel cho edge detection
- 16. - Matrix with slowly decaying spectra → high computational complexity, sensitive to noise. - Nearly singular matrix det(M)~0 → nearly non-invertible, sensitive to small changes in matrix entries. → Some solution approaches: - Truncated SVD, regularization, pseudoinverse using SVD - Random sampling + Statistical methods E.g.: Choose a random submatrix based on suitable probability distributions from the given matrix to approximate SVD of the whole. Linear Algebraic computations - Challenges
- 17. Other challenges: - Optimization problems: generic LA approaches yield high training accuracy which can cause overfitting → Gradient descent, random sampling - The data grows too massive that it cannot be stored or handled by a single device → Distributed linear algebra Gradient descent Matrices are checkerboard distributed on TPU during multiplication Linear Algebraic computations - Challenges
- 18. Appear in statistical methods from early on and frequently E.g.: semidefinite programming in manifold learning. → Optimizations generally focuses on minimize/ maximize the objective function. Optimization Linear programing Quadratic programing From unconstrained to constrained, both convex and non-convex
- 19. - A large number of variables and constraints - Finding a global solution for non-convex problems is an open problem. - Problems with integer constraints (integer programming). - Challenging problems, such as high-dimensional nonlinear objective problems, may contain multiple local optima in which deterministic optimization algorithms may get stuck Optimization - Challenges
- 20. Some approaches: - Exploit the particular mathematical forms of certain problems to find more effective optimizers E.g.: Sequential Minimal Optimization decomposes SVM into sub- problems by iteratively selecting 2 Lagrange multipliers to solve - Stochastic optimization (introduce randomness) + Online learning E.g.: Stochastic Gradient Descent - iteratively update parameters with a random subset of data instead of the entire data. Online learning Optimization
- 21. Some approaches: - Distributed optimization E.g.: Tensorflow, PyTorch a) across processors b) across multiple nodes Distribute optimization process Optimization
- 22. Graph-Theoretic Computations • Graph-theoretic computations involve traversing graphs, which can be the data itself or represent statistical models. • Common statistical computations on graphs include betweenness centrality and commute distances, used to identify nodes or communities of interest. • Large-scale, sparse graphs present computational challenges for these computations.
- 23. Challenges and Approaches • Challenges: High interconnectivity in graphs, • large maximal clique size, and memory constraints. • Notable approaches: • Sampling and disk-based methods for handling large graphs. • Parallel/distributed approaches using sparse linear algebra or graph concepts. • Graph partitioning and linear algebraic reconditioning for efficient computations. • Transformation of graphical model inference problems into optimization or variational methods. • Sampling and parallel/distributed approaches for graphical model inference.
- 24. Additional Applications: • Manifold learning methods: Iso-map requires all-pairs-shortest-paths computation. • Single-linkage hierarchical clustering: Equivalent to computing a minimum spanning tree. • These examples highlight the intersection between graph computations and distance-based or N-body-type problems.
- 25. Integration in Data Analysis • Integration is a key computation in data analysis, essential for Bayesian inference and statistical modeling. • Challenges arise with high- dimensional integrals, requiring specialized approaches.
- 26. Approaches to High-Dimensional Integration 1. Markov Chain Monte Carlo (MCMC) – Default approach for high-dimensional integration. – Utilizes a sequence of random samples to approximate the integral. – Widely used in Bayesian inference and random effects models. 2. Approximate Bayesian Computation (ABC) Methods – Operate on summary data to accelerate computation. – Useful for cases where exact inference is challenging. – Achieves acceleration by working with population means or variances.
- 27. Alternative Approaches and Strategies 1. Population Monte Carlo – Form of adaptive importance sampling. – Enhances the efficiency of Monte Carlo integration. – Particularly useful for certain sequential models, such as particle filtering. 2. Variational Methods – Convert integration problems into optimization problems. – Provide a general framework for approximate inference. – Offers an alternative strategy to address high-dimensional integration challenges. 3. Optimization-Based Point Estimation – Skirts the full integration problem. – Used in approaches like maximum a posteriori inference and empirical Bayesian inference. – Involves optimizing point estimates rather than performing full Bayesian inference.a
- 28. Alignment
- 29. Genomic data science Genomic data science emerged as a field in the 1990s to bring together two laboratory activities: Experimentation: Generating genomic information from studying the genomes of living organisms Data analysis: Using statistical and computational tools to analyze and visualize genomic data, which includes processing and storing data and using algorithms and software to make predictions based on available genomic data Facts Data about a single human genome sequence alone would take up 200 gigabytes Need an estimated 40 exabytes to store the genome- sequence data generated worldwide by 2025
- 30. DNA to RNA to Protein, Illustrating the Genetic Code
- 35. Question about sequence 1. Biological question: “How similar are the genomes of humans and chimpanzees?” – Computational question: Given two sequences r and s, compute their similarity, sim(s,r) 2. Biological question: “This gene causes obesity in mice. Do humans have the same gene?” – Computational question: Given a sequence r (the mouse gene) and a database D of sequences (all human genes), find sequences s in D where sim(r,s) is above a threshold
- 36. Question about sequence 3. Biological question: “We know some mutations of this gene cause sickle-cell anemia. We have the sequences of 100 patients and 100 normal people. Let find out the disease- causing mutations. – Computational question: Given two sets of sequences of different lengths, find an alignment that maximizes the overall similarity. Then look for mutations that are unique to one group. Patients ACGCGT ACGCGT ACGCGT CGCGT _CGCGT _CGCGT ACGCGA ACGCGA ACGCGA Control AGCTT A_GCTT A_GCTT ACGCTT ACGCTT ACGCTT ACGCTA ACGCTA ACGCTA Perfoming aligment makes it easy to compute the similarity between two sequences.
- 37. Scoring function To compare the similarity of two string up to changes such as: Mutation, Insertion, Deletion. For string AGGCCTC Mutations: AGG A CTC Insertions: AGG G CTCT Deletions: AGG . CTC Symbol: Match : +m Mismatch: -s Gap: -d Simple Scoring Function: F = (#matches) x m - (#mismatches) x s - (#gap) x d Total score will reflect the quality of alignment
- 38. Standard of alignment The highest score?
- 39. Problems
- 40. Solutions
- 41. Thank you for your time 😊

- Entered text Massive data refers to a large amount of data that is too difficult to process using traditional tools like spreadsheets or text processors. It can exist in structured or unstructured form and consists of petabytes and exabytes of data. Big data can be analyzed for insights that improve decisions and give confidence for making strategic business moves. Processing massive data, also known as big data, can present several challenges. Here are some common ones: Storage, Processing speed, Data quality, Security, Data integration, Cost, Scalability
- Giới thiệu massive data -> kiến trúc hệ thống
- Giảm chiều dữ liệu có thể được sử dụng cho giảm nhiễu (noise reduction), trực quan hóa dữ liệu (data visualization), phân tích cụm, hoặc là một bước trung gian để tạo điều kiện thuận lợi cho các phân tích khác.
- its inverse may be highly sensitive to small changes in the matrix entries. Nearly non-inverible → iterative
- dividing the computational workload and data across multiple processing units
- Linear programing (determine the best outcome in a linear mathematical model, given a set of linear constraints.) LA computations are a special case (2nd-order optimization). quadratic(quadratic objective function and linear constraints) 2nd-order cone programming (linear objective, linear constraints bao gồm 2nd order cone deals with the optimization of linear objective functions subject to linear matrix inequality constraints. It generalizes linear programming to handle optimization problems involving positive semidefinite matrices. Manifold learning: học cấu trúc trong dữ lieệu cao chiều – biểu diển ít chiều hơn
- Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian A stochastic program is an optimization problem in which some or all problem parameters are uncertain, but follow known probability distributions. This framework contrasts with deterministic optimization, in which all problem parameters are assumed to be known exactly.
- Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian exploits the particular structure of this quadratic optimization problem of SVM by iteratively selecting two Lagrange multipliers and solving a sub-problem to update them. he objective function aims to maximize the margin between the decision boundary and the support vectors while minimizing the classification errors. The Lagrange multipliers (α values) are the variables to be optimized. The constraints ensure that the sum of the Lagrange multipliers weighted by the corresponding target variables is zero and that the Lagrange multipliers are within a specified range (0 ≤ α[i] ≤ C). thuật toán GD trong deep learning receives a sequence of data points one at a time and updates its model iteratively. the use of randomness in the objective function or in the optimization algorithm.
- Các bài toán tối ưu được biểu diễn dưới dạng mô hình hóa toán học với Huấn luyện SVM yêu cầu tìm nghiệm của QP rất lớn, tốn nhiều tgian exploits the particular structure of this quadratic optimization problem of SVM by iteratively selecting two Lagrange multipliers and solving a sub-problem to update them. he objective function aims to maximize the margin between the decision boundary and the support vectors while minimizing the classification errors. The Lagrange multipliers (α values) are the variables to be optimized. The constraints ensure that the sum of the Lagrange multipliers weighted by the corresponding target variables is zero and that the Lagrange multipliers are within a specified range (0 ≤ α[i] ≤ C). thuật toán GD trong deep learning receives a sequence of data points one at a time and updates its model iteratively. the use of randomness in the objective function or in the optimization algorithm.
- Để so sánh độ tương tự giữa 2 chuỗi với các thay đổi như đột biến, chèn hoặc xoá. Ví dụ chuỗi AGGCCTC Interactive demo for Needlemanâ€“Wunsch algorithm (mostafa.io)
- Tiêu chuẩn đánh giá Alignment
- Để giải quyết vấn đề và đạt được hiệu quả tính toán có thể hướng đến các hướng sau: sampling, parallel/distributed computing, algorithms
- Interactive demo for Needlemanâ€“Wunsch algorithm (mostafa.io)