The document discusses various methods for improving reproducibility in artificial intelligence research. It begins by introducing some AI projects the author has worked on. It then discusses causes of non-reproducibility such as lack of data/code access. The document looks at potential solutions like reproducibility frameworks, benchmarking, and standalone methods. It focuses on the author's MultiAffect framework, which standardizes data processing, feature extraction, training, evaluation and reporting. It aims to make research reproducible and accessible. The framework is demonstrated on affect recognition and action recognition tasks, showing it can achieve results comparable to other works.
3. Some AI projects that I've done
● Hum2Song : Compose the musical accompaniment of a melody produced by a human voice.
● MultiAffect : Reproducible Research Framework for Multimodal Affect and Action Recognition
● AutomEditor: AutomEditor is an AI-based video editor.
● DeepStab: Real-time Video Object Stabilization tool by using Deep Learning
● DeepPiracy: Video piracy detection system by using Longest Common Subsequence and DL
● VR-360-musi: Transforms a Youtube video into five stems by using AI and place them into a room.
● ReputationAgent: System that detects inaccurate and unfair reviews given to gig workers.
● TaskBot: Research and development of a bot that helps teams to delegate tasks
● ExpertTwin: Enhanced workspace by an AI agent that provides content to knowledge workers
● LivenessDetection: Design and development of Machine VIsion algorithms to validate identity
● QuantumDrugDiscovery: Drug discovery by using Quantum Computing.
● Awesome Machine Learning Jupyter Notebooks for Colab: Curated list of notebooks
● Awesome Robotic Process Automation: Curated list of notebooks
● Artificial Intelligence By Example Second Edition, Book
● Explainable AI, Book
● Among others ...
5. What is Reproducibility?
Reproducibility means obtaining consistent computational
results using the same input data, computational steps,
methods, code, and conditions of analysis.
Replicability means obtaining consistent results across
studies aimed at answering the same scientific question,
each of which has obtained its own data.
6.
7. Causes
Researchers over the years have investigated the factors that affect reproducibility in
data science related studies. Some common findings point that non-reproducible
studies:
● Lack information or access to the dataset in its original form and order
● The software environment used
● Randomization control
● The actual implementation of the proposed techniques
● Some studies require a large number of computational resources that not
everybody can afford.
8. Looking for solutions ...
During my work on academia I have explored three different solutions
● Reproducibility framework
● Reproducible benchmarking
● Reproducible standalone methods
12. My journey
I will explain what is needed
to produce and use any of
these approaches.
13. Reproducibility framework
A reproducible research framework standarizes:
● Data processing
● Feature engineering
● Training methods
● Evaluation methods
● Research document formatting
● Administration interface
14. Inclusiveness
Additionally it should be accessible to have a
broader impact, some of the desired features may
be:
● No client requirements (online)
● No special hardware requirements
● No extra configuration
● Free of charge
15. MultiAffect: Reproducible Research
Framework for Multimodal Video
Classification and Regression Tasks at
utterance-level with spatio-temporal
feature fusion by using Face, Body,
Audio, Text, and Emotion features
So with this in mind, I created MultiAffect
16. MultiAffect
framework
The main goal of MultiAffect is to give guidance on how to
reproduce research experiments in a fixed setting.
These are the 5 main components:
● Platform Setup: Ensures that the machine is
properly configured
● Feature Extractor: Monitors the feature extraction
and manage the extracted features
● Model Trainer: Defines, trains, and fine-tunes the
model
● Evaluator: Calculates and reports the performance
metrics.
● Research Paper Template: Defines the minimum set
of sections and mandatory citations
17. Platform Setup
Preparing a host machine to replicate machine learning research is usually
challenging, time-consuming, and expensive. One of the reasons is that
most of the models available today require a large scale dataset for
training. Hence, multimedia datasets have a high storage requirement. In
machine learning tasks, the feature extraction step helps algorithms to
reduce the dimensionality of the data and aids the model to focus on their
most significant or discriminative parameters. However, extracting
features from multimedia samples is a highly demanding task in terms of
computation.
18. Dealing with faulty code and
compiled libraries
Some of the tools that are required to perform the data extraction
need to be compiled for the host operating system. Scientific tools
are commonly built from multiple libraries and sometimes depend on
specific versions of certain libraries for certain operating systems;
this makes them prone to throw compilation errors. Sometimes the
code is not given, and there is an extra effort to code the
instructions described in the publication. Even if the code is available,
sometimes the code is not ready to reproduce, and important
efforts should be performed to make it work when works.
19. The solution is a
virtual machine
The software challenges can be mitigated by using virtual machines or
containers. Virtual machines and containers give a base operating system
that can contain the proper configuration built-in. These approaches can
run in the top of the host operating system or in online infrastructure. The
hardware challenges can be overcome by investing in powerful enough
infrastructure in-site or by using online on-demand infrastructure.
Conventional research paper replication depends on multiple factors as we
have explored.
20. MultiAffect over
Google
Colaboratory
The MultiAffect framework uses Google
Colaboratory to publish the Jupyter interactive
notebook and to perform the computation in
the attached virtual machine. Google
Colaboratory is a free research tool that enables
users with a Google account to host and run code
over Google's infrastructure. Google
Colaboratory offers users the ability to execute
their code segments in CPUs, GPUs, and TPUs
(an AI accelerator application-specific integrated
circuit). By the time this work is published, Google
Colaboratory offers a virtual machine with a Tesla
K80 GPU, 12 GB of RAM, and 350 GB of storage.
This platform provides enough resources to
perform video action recognition.
21.
22. Ubuntu as
Operating
System
This platform includes a Debian based operating setting, so the provided
instructions are platform-specific. Local replication of our framework
requires an Ubuntu 18.04 operating system in order to install all the
libraries successfully. Our platform is agnostic to the Python version, all
the code executed in the notebook is written in Python, and it can be
executed in the versions 2 or 3 of the interpreter. Our framework is able
to set up and run the experiment from the online platform, enabling
users to deploy and execute the code in a free of charge environment
and without special requirements in the client-side.
23. Fine tuning the setup
process
The definition of the setup was an incremental
process of three main steps: (1) Initial setup:
The first functional version; (2) Packing
components: Uploading components in
batches to cloud storage; and (3) Optimal
setup: A version that loads faster.
24. Initialsetup
In this step, the libraries were downloaded and compiled
directly from the notebook by running shell commands
from the notebook cells. Pre-requisites, missing
dependencies, and additional packages were installed in
the same notebook. The dataset and the pre-trained
models were downloaded from their original sources to
the virtual machine. The feature extraction, training,
and evaluation code were directly inserted into the
notebook in separate cells. The first version was tested
until it successfully extracted the features, trained, and
evaluated the models from the notebook. A backup of this
notebook was documented and set as the initial version.
25. Packing components
Each individual compiled library was packaged into a zip file that contains
the binary files as well as the configuration files. The pre-trained models
that were individually downloaded from their original sources were packed
together into a single file. Sometimes the latency is reduced by
downloading a single large file from a high-speed source and increased
when downloading multiple large files from different bandwidths. The
outcome of this task is a collection of zip files that were uploaded to a
Google Drive account. The files were shared with public access to be able
to be downloaded in Google Colaboratory notebooks logged with different
accounts.
26. Optimal
setup
After packaging and storing the files from the initial setup
to the cloud, we started a branch of the initial setup that
loads these files. The optimal setup notebook was a
simplified version of the initial notebook, instead of having a
long section documenting the setup process, it was
replaced with a download pre-requisites section. The files
were downloaded by using a Python tool called GDown
that is already installed in Google Colaborary. It is important
to mention that the virtual machine attached to the Google
Colaboratory notebooks has already an Ubuntu
distribution with the most common machine learning
tools and libraries already installed. This optimal version is
tailored to Google Colaboratory only.
27. Optimizing the loading time
Per each of the libraries installed, we measured the time that takes to
install the prerequisites plus the compilation time. In average, the
overall setup of each library was five times slower than downloading
and extracting a previously compiled and zipped version of the library.
The total setup time for the Google Colaboratory environment was
reduced from 43 minutes to 6 minutes after implementing the pre-
compiled tools strategy and by downloading the files from the same
Google infrastructure.
28. Feature extractors
MultiAffect includes a feature extraction module as an independent component.
Multimodal feature extraction is often a highly demanding task, as it requires a
certain pre-processing of the videos before being able to extract features. Some
common pre-processing tasks are: separating the audio, extracting frames,
identifying faces, cropping faces, removing the background, skelethon detection
(pose), emotion detection, among many other procedures. Our feature extraction
methodology is based on the common ground found in submissions. Our feature
extraction process aims to maintain as invariant factors features such as the person
descriptors (i.e., gender, age, race), scale, position, background, and language. Our
approach considers ten features from five different modalities: face, body, audio,
text, and emotions.
29. Audio features
OpenSMILE (1582 features): The audio is
extracted from the videos and are processed
by OpenSMILE that extract audio features such
as loudness, pitch, jitter, etc.
It was tested on video-clip length (general) and
20 fragments (temporal).
30. Text features
Opinion Lexicon (6 features): depends on the ratio of
sentiment words (adjectives, adverbs, verbs and
nouns), which express positive or negative sentiments.
Subjective Lexicon (4 features): They used the
subjective Lexicon from MPQA (Multi-Perspective
Question Answering) that models the sentiment by its
type and intensity.
Word vectors GloVe, and BERT embeddings
31. Face features
OpenFace (709 features): Facial behavior analysis tool
that provides accurate facial landmark detection,
head pose estimation, facial action unit recognition,
and eye-gaze estimation. We get points that
represents the face.
VGG16 FC6 (4096 features): The faces are cropped
(224×224×3), aligned, zero out the background, and
passed through a pretrained VGG16 to get a take a
dimensional feature vector from FC6 layer.
32. Body Features
OpenPose (BODY_25) (11
features): The normalized angles
between the joints.I did not use
the calculated features because
were 25x224x224
VGG16 FC6 Skelethon image (4096
features): I drew the skeleton (neck
in the center) on a black
background and feed a VGG16 and
extracted a feature vector of the
FC6 layer.
33. Emotion features
EmoPy (7 features): A deep neural net toolkit for
emotion analysis via Facial Expression Recognition
(FER).
Other (28 features): Other 4 models from different FER
contest participants.
7 categories per model, 35 features in total
20 samples per video clip were predicted (temporal)
from there I computed its normalized sum (general)
34. Model trainer
The MultiAffect models use different deep
learning models to recognize affect. Among
them we find RNNs (Recurrent Neural
Networks), CNNs (Convolutional Neural
Networks), and simple DNNs (Deep Neural
Networks) as MLPs (Multilayer Perceptrons).
35.
36. Evaluator
The MultiAffect framework is designed to perform classification and regression tasks.
Depending on the performed task, the platform is adjusted to display meaningful
evaluations. The classification task gives accuracy, F1-score, recall, precision, AUC
and other metrics for the training, validation, and testing sets. In the case of a
regression task, the framework computes the MSE (Mean Square Error) and CCC
(Concordance Correlation Coefficient) that describes how well a new test or
measurement reproduces a gold standard test.
37. Plotting the
results
The results obtained from our reproducible framework for
the classification task are two plots, one to visualize the
accuracy while training and one for the training and testing
loss; and a confusion matrix obtained while evaluating the
model on the test data. On the other hand, for the
regression tasks the results are displayed in a scatter plot
that shows the correlation between the predicted and gold
standard labels.
38. Experimentation
In order to test its generalizability, we performed experiments on
two main tasks: affect recognition and video action recognition.
The video action and affect recognition tasks are attacked through the
training and testing of classification and regression models,
respectively. One of the main goals of the proposed framework is to
be able to perform both actions by only configuring a new set of
variable without performing any change to the code. Another goal was
to deliver results comparable to existing work
45. Let's switch
approaches to
Benchmarking
You can use MultiAffect as a tool for any video
categorization and regression tasks. You can try it out
from this URL: http://bit.ly/multiaffect
Now if we want to compare which of the existing
techniques work better for your problem, then you will
need a tool that benchmarks all the methods. This is
why I adapted an existing Text Classification
Benchmarking tool to be used as a tool in the cloud,
you can find it out here: http://bit.ly/ai-text-workshop
46. Text Classification
Benchmarking tool
This is a Google Colaboratory notebook
with instructions that has these
methods:
● Word ngram + LR (Logistic
regression)
● Char ngram + LR
● (Word + Char ngram) + LR
● RNN no embedding
● RNN + GloVe embedding
● CNN (multi-channel):
● RNN + CNN
● Google BERT
49. The last approach, Independent ML,DL methods
Sometime you may know what is the best algorithm to
use for your requirements. In that case I adapted >100
notebooks to be able to use them as a tool and to train
models from the cloud by only uploading your data. You
can find it out here: http://bit.ly/awesome-ai
The process of adapting a notebook is 1) open in colab
from github 2) Add extra libraries 3) Download the data
from Drive.
Let’s do a quick recap of all the ML/DL/RL methods to
identify which method fit better to your problem.
53. When to use it?
● Simple regression problems
○ How much the rent should cost in certain area
○ How much should I charge for specific amount of work
● Problems where we want to define a rule that separates two
categories that are similar, i.e. Premium or Basic price for customers
under certain parameters (number of rooms vs number of cars)
https://colab.research.google.com/drive/1-dTb2vCiZHa-DnyqlVFGOnMSNjvkIOTP
https://colab.research.google.com/drive/1Z20iJspQm2Y_wLI51wgE6nXGOSu1kG4W
https://colab.research.google.com/drive/1-yk3m6p3ylNLtTaEf3nya6exO_wv8f_L
58. When to use it?
● When we need to know what decisions the machine is taking
● When we need to explain to others how the features are evaluated
● When there are no much features
https://colab.research.google.com/drive/1Fc8qs1fwdcpoZ_-tTj32OBl-tCGlAe5c
60. Code
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)
clf = RandomForestClassifier(n_estimators=100)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print('accuracy is',accuracy_score(y_pred, y_test))
61. When to use it?
● When we want to know alternatives of how to evaluate a problem.
● When we want to manually discard flows that are biased
● When we want to manage ensembles from one single method.
https://colab.research.google.com/drive/1WMOOtaHAMZPi-enVM8RRM_CC-grEtm9P
https://colab.research.google.com/drive/1jDdWp-CJybMJDX17jBmG5qoPPg9qj1sm
https://colab.research.google.com/drive/1-uDIRl1aYqmJX59rAJumHY1T20QqBJiQ
https://colab.research.google.com/drive/1-uDIRl1aYqmJX59rAJumHY1T20QqBJiQ
63. When to use it?
● When we want to know the probabilities of the different cases.
● When we need a probabilistic model.
● When we need an easy way to prove in paper
https://colab.research.google.com/drive/1qOCllKsBBrLeUnP-XAXHefXCtbuBWl69
https://colab.research.google.com/drive/11FiWH00vzygQp1T_pD0MCfMFg6FYsd01
65. When to use it?
● When intuition says that the problem can be solved from getting thee
most similar option.
● When the information is no exhaustive.
● When we want to justify the decision of the algorithm in a common
human reasoning.
https://colab.research.google.com/drive/1GeUVjDW74SxFxz2Nh3rqOlte-S2dblYv
https://colab.research.google.com/drive/1X12qds10ZfN7QCrmpRR2OXxa--PTyS5e
67. When to use it?
● When we don’t know how to understand the data
● When we want to optimize resources by grouping related elements.
● When we want that the computer creates the labels for us.
https://colab.research.google.com/drive/1RL3oZm6LgnEChI1aOQZoMn1WDk-DQJiV
https://colab.research.google.com/drive/1yvy1scktjcDyydG2fZz2OJfRFAer0SEO
https://colab.research.google.com/drive/1CzEf6giBXPSQI5UJOhZrZfYKAJcH68wg
69. When to use it?
● It was the most effective technique before Neural Networks, it can
achieve excellent results with less processing.
● Mathematically speaking, it is based in very strong math principles, it
creates complex multidimensional hyperplanes that separates the classes
precisely.
● It is not a white box technique, but may be the best option for problems
where we want to get the best of Machine Learning approach without
dealing with Neural Networks.
https://colab.research.google.com/drive/13PRk-GKeSivp4R-FIdjmYBQS7xWUco9C
71. When to use it?
● When we want to optimize a regression
● When we want to binarize the output
● As a preliminary analysis before implementing neural networks
https://colab.research.google.com/drive/1PWmvsZRaj3JQ8rtj6vlwhJhJpOrIAamT
https://colab.research.google.com/drive/1p8rcrSQB-thLSakUmCHjSbqI6vd-NkCq
https://colab.research.google.com/drive/1jhrAtmPgg6Uu0WzMzV-VakWlncQAvk-D
73. When to use it?
● When we have very few features and there is no extra details that can be
extracted from hidden layers.
● There are in fact neural networks, and we do not need alway to use
them for deep learning these can be used for machine learning when we
benchmark with other machine learning techniques.
● When we want to get the power of neural networks and we don’t have
much computational power.
https://colab.research.google.com/drive/10PvUh-8ZsVqQADqXSmRIDHGiCH9iypyO
79. When to use it?
● Classifiers when common machine Learning Algorithms performs
poorly.
● Models with much features.
● Multiple classes projects.
https://colab.research.google.com/drive/1GAYf5yMNBkVrag0z2Q4MPSwuqfRN1Wz
https://colab.research.google.com/drive/12YBDQFYXN8VruxKTfzDpbPsYFAEQceQP
https://colab.research.google.com/drive/1pyRqGmMG4-Mj8Wis5XrQ_a4dUJvYln1
https://colab.research.google.com/drive/1wHjugM56k0ay5QCmRVMBfAMF96EY7A5k
https://colab.research.google.com/drive/1Ly0BtKBphUdeqMQBO8Xjweku62Vq3UAX
84. When to use it?
● When we want to process images
● When we want to process videos
● When we have highly dimensional data
https://colab.research.google.com/drive/1jN8oswBOds4XuRbnQMxxDXDssmDD_rD9
https://colab.research.google.com/drive/1iEYJs75hat_URxshmCBMGzHQo5VgdRvN
https://colab.research.google.com/drive/1YHKZgpJuriGYjEzFDNGz2Hf0widu-exx
https://colab.research.google.com/drive/1gi2_Or0rDz5Gg9FkGJjFDxgeiwt5-lXm
https://colab.research.google.com/drive/1QcnY-LOZU9c7Sp2DsDVeYxLNBx87VNhn
87. When to use it?
● When sequences are provided
○ Text sequences
○ Image sequences (videos)
○ Time series
● When we need to provide an ordered output
https://colab.research.google.com/drive/1twc5dBjgFLFuv8p-gPfnrscTPcBlkx5q
https://colab.research.google.com/drive/10-ou-Za75bFgwArvgP3QfNJ4cWuwY-eF
https://colab.research.google.com/drive/1PEOqq8mBcmc-FMj8lpbVF93cQI4RLgVJ
https://colab.research.google.com/drive/1XUEAFxxKVmdgC7oPOzVpGInXfUeTcgIQ
https://colab.research.google.com/drive/1tfDDriSDUh_J9OHwjt-NzT8xRiEDQF7x
91. When to use it?
● When we want to benchmark models
● When different models are stronger when these are evaluated together
● When the individual processing is not exhaustive
https://colab.research.google.com/drive/1Kg_nHBmUGQ1zepU-wZlDwMyM-YrlMTUX
https://colab.research.google.com/drive/1U86EVD-6ulYMxTzDX8-m6nEptYq0yaej
94. When to use it?
● On every new model
● When we have enough time to train multiple models
● When we don’t know wich hyperparameters are better.
https://colab.research.google.com/drive/1gTBDfbJy9SsgbUPRhL_mrujw6HC2BjxN
https://colab.research.google.com/drive/17Ii6Nw89gZT8l_XrvSQhNWaa_VfcdLBn
https://colab.research.google.com/drive/1xe4G_dqsPMq0n3w_Mqlm-39j5TMUqHJR
97. When to use it?
● When a robot explores a place and needs to learn from the environment.
● When we can try as much as we can in a simulator.
● When we want to find the most optimal path
https://colab.research.google.com/drive/1fgv5UWhHR7xSwZfwwltF4OFDYqtWdlQD
https://colab.research.google.com/drive/14aYmND2LKtaPTW3JWS7scKGwU9baxHeE
https://colab.research.google.com/drive/16Scl43smvcXGZFEGITs15_SN_7-EidZd
100. When to use it?
● When we have too much features and we do not know which of them
are useful.
● When we want to reduce the dimensionality of our model.
● When we want to plot our decision boundaries.
https://colab.research.google.com/drive/1CO6BACds6J8hGPYlEU2INnSTpT0EmS74
https://colab.research.google.com/drive/1VU2SO3IfklPkK1EPMnwiO7trJslt79OZ
102. When to use it?
● When we have limited data
● When we want to help our model to generalize more
● When our unseen data comes in very different formats.
https://colab.research.google.com/drive/1ANIc7tXrggPT2I9JzpBlZQ3BBhCpbJUJ
https://colab.research.google.com/drive/1cQRVdiDc9xraHZYLu3VrXxX4FKXoaS8U
https://colab.research.google.com/drive/1O5far2FC4GlAc9pkLPZqsjKreCpI4S_-
108. When to use it?
● When we want to compress data.
● When we need to change one type of input to other type of output.
● When we don’t need much variability in the generated data.
https://colab.research.google.com/drive/1QxXqnhyqIZrrGtor2tVa4jY63adS4yc0
110. When to use it?
● When we need to transfer a style
● When we need more variability in the generated output
● When we need to keep context in the generation.
https://colab.research.google.com/drive/1YOYH78YQAgPBRIpUPhh_e0cFLNu-BPVo
https://colab.research.google.com/drive/1POZpWN-2M5hy3D2ATWzJs2LC5sk7hpts
https://colab.research.google.com/drive/1aKywiJ5p0eCwDIIWKe8Q205rcKqmR_VX
https://colab.research.google.com/drive/1QxXqnhyqIZrrGtor2tVa4jY63adS4yc0
https://colab.research.google.com/drive/1Lw7BqKABvtiSyUHg9DeM5f90_WFGB7uz
112. When to use it?
● When we generate text
● When we generate the next sequence from a serie
● When the order in the generated output matters.
https://colab.research.google.com/drive/1ZB-oueLvBgltXshb1lDV2EpqbqV6FC5x
113.
114. When to use it?
● When context is an essential part of the generated output
● When we need to keep consistency in the frequency space.
● When we have enough computational resources.
https://colab.research.google.com/drive/1jWaRkii6xLkxxAPyfudeGJsHf_jokqXG
115. Put notebooks into production
It seems that running code from a notebook in the cloud is just for testing
purposes, but actually you can run it as a service by running from a Docker
container locally.
I created a script that automatically prepares a container and execute it
every time you need as a command line application.
Example:
docker run psykohack/google-colab https://colab.research.google.com/drive/133DIr7lvkuaNU_X2JN5id3XmtSXQspy9
Code: https://github.com/toxtli/google-colab-docker
116. Resources
More and more AI research is being distributed nowadays in redistributable
format. Some valuable resources can be found in:
https://www.paperswithcode.com/
https://www.kaggle.com/
117. Conclusions
● Nowadays we can reproduce state-of-the-art AI algorithms from a web
based platform.
● Complex tasks can be executed in notebooks structured as
frameworks
● Our main job is to prepare the data to feed the algorithm that fits the
most to our needs.
● AI prototyping is drastically accelerated by using this technologies.
● Since these technologies are between pure-code and pure-tool
approaches, that gives the flexibility to iterate faster.