Generative AI on Enterprise Cloud with NiFi and Milvus
Cognitive Toolkit - Deep Learning framework from Microsoft
1. Cognitive Toolkit - Deep
Learning framework from
Microsoft
Łukasz Grala
lukasz@tidk.pl | lukasz.grala@cs.put.poznan.pl
2. Łukasz Grala
• Architekt danych w TIDK
• Twórca „Data Scientist as as Service”
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek zarządu Polskiego Towarzystwa Informatycznego Oddział Wielkopolski
• Członek i lider Data Community Poland (dawniej Polish SQL Server User Group (PLSSUG))
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu i MTB
email lukasz@tidk.pl - lukasz.grala@cs.put.poznan.pl blog: grala.it
6. Machine Learning
1763 1805 1812 1913 1950 1951 1967 1982 1995 1997 2016
The Underpinngs of
Bayes' Theorem
Least Squares
Bayes' Theorem
Markov Chains First Neural
Network Machine
Nearest NeighborTuring's Learning
Machine
Recurrent Neural
Network
Random Forest
Algorithm
Support Vector
Machines
IBM Deep Blue
Beats Kasparov
2012
Recognizing Cats
on YouTube
AlphaGo
1958
Single-layer neural
network on a
room size
computer
7. Machine Learning
“Can machines do what we (as thinking entities)
can do?”
Alan Touring, “Computing Machinery and Intelligence”. Mind, 1950
Turing’s test
8. Machine Learning
“Machine Learning is a field of computer science that
gives computers the ability to learn without explicitly
programmed.”
Samuel Arthur, “Some Studies in Machine Learning Using the Game of
Checkers”, IBM Journal of Research and Development, 1959
9. Machine Learning
“A computer program is said to learn from experience
E with respect to some class of tasks T and
performance measure P if its performance at tasks in
T, as measured by P, improves with experience E.”
Tom. M. Mitchell, “Machine Learning. McGraw Hill, 1997
10. Supervised
Each point in the training data is associated with a label or output
Task is to learn a model/hypothesis that predicts output for points not
in the training dataset
Classification
Given a set of features, predict discrete outputs (Fraud/Not Fraud)
Regression
Given a set of features, predict continuous outputs (Credit score, item price, …)
Recommendation
Given a set of {user, item, rating} triplets and optionally features about users and items,
predict ratings for an item, items similar to a given item, users similar to a given user
Anomaly detection
Given a set of features for “normal” examples,
predict normal vs anomaly
11. Unsupervised
Points on a training dataset are not associated with known output values
Creates a model that learns inherent structure in training data
Clustering
Given a training dataset, find a small number of centers, ‘k’, that are “close” to points in the dataset
Each point in the dataset is associated with at most a single center
Principal Component Analysis
Given a training dataset with ‘N’ features, find a set of ‘k’ features that approximates the data with bounded
error
The set of ‘k’ principal components is representative of the original dataset but with much lower
dimensionality
Time Series
ARIMA, ETS,…
16. Artificial Neural Networks
ANNs are processing devices (algorithms or actual hardware) that are loosely modeled after the neuronal
structure of the mamalian cerebral cortex but on much smaller scales.
The simplest definition of a neural network, more properly referred to as an 'artificial' neural network
(ANN), is provided by the inventor of one of the first neurocomputers, Dr. Robert Hecht-Nielsen. He
defines a neural network as:
"...a computing system made up of a number of simple, highly interconnected processing
elements, which process information by their dynamic state response to external inputs. “
19. Overfitting
The green line represents an overfitted
model and the black line represents a
regularized model. While the green line
best follows the training data, it is too
dependent on that data and it is likely to
have a higher error rate on new unseen
data, compared to the black line.
20. Deep Learning Use Cases
•Sentiment Analysis
•Augemented Search
•Fraud Detection
•NLP
Text
•Facial Recognition
•Emotion Recognition
•Image Search
•Photo Clustering
•Tags
•Motion Detection
Video and Image
•Voice Recognition
•Voice Search
•Sentiment Analysis
•Flaw Detection
Sound
•Prediction
•Recommendation
•Risk Detection
Time Series
21. Convolutional network
Convolutional neural network (CNN, or ConvNet) is a class of deep,
feed-forward artificial neural networks that has successfully been
applied to analyzing visual imagery.
24. ImageNet CNN
• Model, który zwyciężył w konkursie ImageNet w 2012
• 5 warstw konwolucyjnych i 2 warstwy pełne
• Jednostki ReLU i Droput o najwyższej warstwie 60 milionów parametrów
• 1.2 mln obrazów treningowych
• Klasyfikacja do 1000 klas
• Uczenie na dwóch GPU przez tydzień
• Błąd 16.4% (drugie miejsce 26.2%)
26. Recurrent Neural Networks
A recurrent neural network (RNN) is a class of artificial neural
network where connections between units form a directed cycle.
40. Learners
Algorithms Strengths
rxFastLinear Fast, accurate linear learner with auto L1 & L2
rxLogisticRegression Logistic Regression with L1 & L2
rxFastTree
Boosted Decision tree from Bing. Competitive wth
XGBoost. Most accurate learner for most cases
rxFastForest Random Forest
rxNeuralNet GPU accelereted Net# DNNs with Convolutions
rxOneClassSvm Anomaly or unbalanced binary classification
41. Learners - Scalability
• Streaming (not RAM bound)
• Billions of features
• Multi-proc
• GPU acceleration for DNNs
• Distributed on Hadoop/Spark via Ensambling
45. ONNX is a community project created by Facebook and Microsoft.
ONNX provides a definition of an extensible computation graph model, as well as
definitions of built-in operators and standard data types.
Each computation dataflow graph is structured as a list of nodes that form an
acyclic graph. Nodes have one or more inputs and one or more outputs. Each
node is a call to an operator. The graph also has metadata to help document its
purpose, author, etc.
Operators are implemented externally to the graph, but the set of built-in
operators are portable across frameworks. Every framework supporting ONNX will
provide implementations of these operators on the applicable data types.
48. Cognitive Toolkit
• FFN, CNN, RNN/LSTN, Batch normalization, Sequence-to-Sequence with
attention and more
• Reinforcement learning, generative adversarial networks, supervised and
unsupervised learning
• Ability to add new user-defined core-components on the GPU from Python
• Automatic hyperparameter tuning
• Built-in readers optimized for massive datasets
• Full API’s for defining networks, leaners, readers, training and evaluation from
Python, C++, C#, BrainScript
• Evaluate models with Python, C++, C#, R and BrainScript
• Automatic shape inference based on your data
49. CNTK – Layers Library
Simple 1-layer hidden layer model – function Dense()
50. CNTK – Layers Library
Alternative Sequential()
2011-style feed-forward speech-recognition network
with 6 hidden sigmoid layers of identical dimensions
62. 6464
Łukasz Grala, Microsoft MVP
CEO, Data Architect
Lukasz.grala@cs.put.poznan.pl
+48 663832323
http://tidk.pl
Lukasz.grala@tidk.pl
http://dsaas.co
Notes de l'éditeur
AI:
ML (Uczenie Bayesowskie, Drzew decyzyjnych, Zbioru reguł,
Sieci ekspertowe
Sieci neuronowe
Dowodzenie twierdzeń
Podejmowanie decyzji przy braku pełnych danych
Rozumowanie logiczne/racjonalne
Logika rozmyta
Algorytmy ewolucyjne
Microsoft Azure VM opis
https://docs.microsoft.com/pl-pl/azure/machine-learning/data-science-virtual-machine/overview
VM:
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.linux-data-science-vm-ubuntu?tab=Overview
https://azuremarketplace.microsoft.com/en-us/marketplace/apps?search=Data%20Science%20Virtual%20Machine&page=1
shape: output dimension of this layer
activation (default: None: pass a function here to be used as the activation function, such as activation=relu
input_rank: if given, number of trailing dimensions that are transformed by Dense() (map_rankmust not be given)
map_rank: if given, the number of leading dimensions that are not transformed by Dense()(input_rank must not be given)
init (default: glorot_uniform()): initializer descriptor for the weights. See cntk.initializer for a full list of random-initialization options.
bias: if False, do not include a bias parameter
init_bias (default: 0): initializer for the bias
FOR EMBEDDING
shape: the dimension of the desired embedding vector. Must not be None unless weights are passed
init: initializer descriptor for the weights to be learned. See cntk.initializer for a full list of initialization options.
weights (numpy array): if given, embeddings are not learned but specified by this array (which could be, e.g., loaded from a file) and not updated further during training
filter_shape: shape of receptive field of the filter, e.g. (5,5) for a 2D filter (not including the input feature-map depth)
num_filters: number of output channels (number of filters)
activation: optional non-linearity, e.g. activation=relu
init: initializer descriptor for the weights, e.g. glorot_uniform(). See cntk.initializer for a full list of random-initialization options.
pad: if False (default), then the filter will be shifted over the “valid” area of input, that is, no value outside the area is used. If pad is True on the other hand, the filter will be applied to all input positions, and values outside the valid region will be considered zero.
strides: increment when sliding the filter over the input. E.g. (2,2) to reduce the dimensions by 2
bias: if False, do not include a bias parameter
init_bias: initializer for the bias
use_correlation: currently always True and cannot be changed. It indicates that Convolution()actually computes the cross-correlation rather than the true convolution
filter_shape: receptive field (window) to pool over, e.g. (2,2) (not including the input feature-map depth)
strides: increment when sliding the pool over the input. E.g. (2,2) to reduce the dimensions by 2
pad: if False (default), then the pool will be shifted over the “valid” area of input, that is, no value outside the area is used. If pad is True on the other hand, the pool will be applied to all input positions, and values outside the valid region will be considered zero. For average pooling, count for average does not include padded values.
filter_shape: receptive field (window) to pool over, e.g. (2,2) (not including the input feature-map depth)
strides: increment when sliding the pool over the input. E.g. (2,2) to reduce the dimensions by 2
pad: if False (default), then the pool will be shifted over the “valid” area of input, that is, no value outside the area is used. If pad is True on the other hand, the pool will be applied to all input positions, and values outside the valid region will be considered zero. For average pooling, count for average does not include padded values.
shape: dimension of the output
cell_shape (optional): the dimension of the LSTM’s cell. If None, the cell shape is identical to shape. If specified, an additional linear projection will be inserted to project from the cell dimension to the output shape.
use_peepholes (optional): if True, then use peephole connections in the LSTM
init: initializer descriptor for the weights. See cntk.initializer for a full list of initialization options.
enable_self_stabilization (optional): if True, insert a Stabilizer() for the hidden state and cell
BatchNormalization:
map_rank: if given then normalize only over this many leading dimensions. E.g. 1 to tie all (h,w) in a (C, H, W)-shaped input. Currently, the only allowed values are None (no pooling) and 1 (e.g. pooling across all pixel positions of an image)
normalization_time_constant (default 5000): time constant in samples of the first-order low-pass filter that is used to compute mean/variance statistics for use in inference
initial_scale: initial value of scale parameter
epsilon: small value that gets added to the variance estimate when computing the inverse
use_cntk_engine: if True, use CNTK’s native implementation. If false, use cuDNN’s implementation (GPU only).
disable_regularization: if True then disable regularization in BatchNormalization.
LayerNormalization:
initial_scale: initial value of scale parameter
initial_bias: initial value of bias parameter
Stabilizer:
steepness: sharpness of the knee of the softplus function