2. Q-1
• What is Memory-Driven Computing?
• System- Enterprise class 16 core server, 2TB RAM
• Design IDS which analyze network logs & identify
intrusion candidate & minimum 256 system to monitor
• Find state of art approach for intrusion detection.
• Design IDS as scale up or scale out application?
• Would your choice change if system – multiple standard
workstation with 32GB RAM.
3. • Compute nodes accessing a shared pool of Fabric-Attached Memory
• An optimised Linux-based operating system (OS) running on a
customised SoC (System on a Chip)
• Photonics/Optical communication links, including the new X1
photonics module, are online and operational
• Fabric- Communication vehicle that transfer data between element
of computer system (SATA, DDR, PCI)
4. Comparison Case I and Case II
Xeon Intel 2TB Enterprise
Network Servers (Case I)
• 8 x Intel 10-Core XEON E7-
8870 2.4GHz 30M Cache
Processors
• 2048GB (2TB) Memory
• 3 x 300Gb SAS 10K Hot plug
Drives
• Price- 46,000$
HP Workstation Z240 – Tower
(Case II)
• 6th Generation Intel Core i7
8MB Cache
• Up to 64GB DDR4
• Up to 1TB SSD
• Price – 2000$
5. • Scalability of System
• Vertical Scaling- Add more
processor and RAM
– Less power consumption
– Less licensing cost
• Horizontal Scaling- Add more
servers with less processor
– Easy to upgrade
6. Q-2
• Build a supervised binary classification system for detecting
spam emails.
• Basically, given an email, your system should be able to
classify it as spam or non-spam. You are given a training data
set containing 400,000 rows and 10,000 columns (features).
The 10,000 features include both numerical variables and
categorical variables.
• You need to create a model using this training data set.
Consider the two options given in Question 1 above, namely a
server with 2TB RAM or a simple workstation with 32GB RAM.
• For each of these two options, can you explain how your
approach would change in training a model?
7. Machine Learning
• Machine learning method - Bayesian classification, k-NN, ANNs,
SVMs, Artificial immune system and Rough sets
• Naïve Bayes classifier method- Best Method
• Training – Parse each email into its constituent tokens
• Filtering- For each message , scan message for the next token and
calculate the overall message filtering indication
• If message filtering indication is greater than thresold message
marked as spam
• The workstation is memory constrained, hence you will need to
figure out ways to reduce your data set. Think about dimensionality
reduction, sampling, etc
8. Bonus Quesion
• Now, assume that you are able to successfully
create two different models — one for the
enterprise server and another for the
workstation.
• Which of these two models would do better on a
held-out test data set?
• Is it always the case that the model you built for
the enterprise class server with huge RAM would
have a much lower test error than the model you
built for the workstation?
9. Data Mining Model
• Training set is something that we have as of now.
We will remove subset from it and removed
subset will be called hel-dout set.
• Training (50%)
• Validation (25%)
• Testing (25%)
• Where the model is built on the training set, the
prediction errors are calculated using the
validation set, and the test set is used to assess
the generalization error of the final model.
10. Q-3
• You were asked to create a binary classification system
using the supervised learning approach, to analyse X-
ray images when detecting lung cancer.
• You have implemented two different approaches and
would like to evaluate both to see which performs
better.
• You find that Approach 1 has an accuracy of 97% and
Approach 2 has an accuracy of 98% .
• Given that, would you choose Approach 2 over
Approach 1? Would accuracy alone be a sufficient
criterion in your selection?
12. • Supervised: All data is labeled and the algorithms
learn to predict the output from the input data.
• Unsupervised: All data is unlabeled and the
algorithms learn to inherent structure from the
input data.
• Semi-supervised: Some data is labeled but most
of it is unlabeled and a mixture of supervised and
unsupervised techniques can be used.
13. Q-4
• In Q3, Instead of getting accuracies of 98 per cent and 97, both your
approaches have a pretty low accuracy of 59 per cent and 65 per
cent.
• While you are debating whether to implement a totally different
approach, your friend suggests that you use an ensemble classifier
wherein you can combine the two classifiers you have already built
and generate a prediction.
• Would the combined ensemble classifier be better in performance
than either of the classifiers?
• If yes, explain why. If not, explain why it may perform poorly
compared to either of the two classifiers. When is an ensemble
classifier a good choice to try out when individual classifiers are not
able to reach the desired accuracy?
14. Ensembles of classifiers
• It is used for the accuracy.
• For the improvement of the performance of
individual classifiers, these classifiers could be
based on a variety of classification
methodologies, and could achieve different
rate of correctly classified individuals.
• The goal of classification result integration
algorithms is to generate more certain, precise
and accurate system results
15. Q-5
• You have built a classification model for email
spam such that you are able to reach a
training error which is very close to zero.
• Your classification model was a deep neural
network with five hidden layers.
• However, you find that the validation error is
not small but around 25. What is the problem
with your model?
17. Q-6
• Regularisation techniques are widely used to avoid
overfitting of the model parameters to training data so that
the constructed network can still generalise well for unseen
test data.
• Ridge regression and Lasso regression are two of the
popular regularisation techniques.
• When would you choose to use ridge regression and when
would you use Lasso?
• Specifically, you are given a data set, in which you know
that out of 100 features, only 10 to 15 of them are
significant for prediction, based on your domain
knowledge. In that case, would you choose ridge regression
over Lasso?
18. • As we know that ridge regression can't zero
coefficients. Here, you either select all the
coefficients or none of them whereas LASSO
does both parameter shrinkage and variable
selection automatically because it zero out the
co-efficients of collinear variables. Here it
helps to select the variable(s) out of given n
variables while performing lasso regression.
19. Q-7
• You are given a computer system with 16 cores, 64GB
RAM, 512KB per core L1 cache (both instruction and
data caches are separate and each is of size 512KB),
2MB per core L2 cache, and 1GB L3 cache which is
shared among all cores.
• Now you are told to run multiple MySQL server
instances on this machine, so that you can use this
machine as a common backend system for multiple
Web applications.
• Can you characterize the different types of cache
misses that would be encountered by the MySQL
instances?
20. • L1, L2 and L3 caches are different memory pools similar to
the RAM in a computer. They were built in to decrease the
time taken to access data by the processor called latency.
• Level 1 Cache(2KB - 64KB) - Instructions are first searched in
this cache. L1 cache very small in comparison to others, thus
making it faster than the rest.
• (L2) Level 2 Cache(256KB - 512KB) - If the instructions are not
present in the L1 cache then it looks in the L2 cache, which is
a slightly larger pool of cache, thus accompanied by latency.
• (L3) Level 3 Cache (1MB -8MB) - With each cache miss, it
proceeds to the next level cache. This is the largest among the
all the cache, even though it is slower, its still faster than the
RAM.
21. Cache Miss Types
• L1 cache is direct-mapped whereas L2 and L3
caches are four-way set associative.
• Compulsory/Cold- The first reference to a block
of memory, starting with an empty cache.
• Capacity - The cache is not big enough to hold
every block you want to use.
• Conflict - Two blocks are mapped to the same
location and there is not enough room to hold
both.
22. Q-8
• In Q 7, where you are not given the cache sizes of L1,
L2 and L3, but have been asked to figure out the
approximate sizes of L1, L2 and L3 caches.
• You are given the average cache access times (hit and
miss penalty for each access) for L1, L2 and L3 caches.
Other system parameters are the same.
• Can you write a small code snippet which can
determine the approximate sizes of L1, L2 and L3
caches? Now, is it possible to figure out the cache sizes
if you are not aware of the average cache access times?
23. Average memory access time
• Average Memory Access Time (AMAT) is a
common metric to analyze memory system
performance.
• AMAT uses hit time, miss penalty, and miss rate
to measure memory performance.
• It accounts for the fact that hits and misses affect
memory system performance differently.
• It focuses on how locality and cache misses affect
overall performance and allows for a quick
analysis of different cache design techniques.
24. Q-9
• Consider that you have written two different versions of a
multi-threaded application.
• One version extensively uses linked lists and the second
version uses mainly arrays. You know that the memory
accesses performed by your application do not exhibit good
locality due to their inherent nature. You also know that your
memory accesses are around 75 per cent reads.
• Now you have the choice between two computer systems —
one with an L3 cache size of 24MB shared across all eight
cores, and another with an L3 cache size of 2MB independent
for each of the eight cores.
• Are there any specific reasons why you would choose System
1 rather than System 2 for each of your application versions?
25. Q-10
• Can you explain the stochastic gradient
descent algorithm?
• When would you prefer to use the coordinate
gradient descent algorithm instead of the
regular stochastic gradient algorithm?
• Here’s a related question. You are told that the
cost function you are trying to optimize is non-
convex. Would you still be able to use the co-
ordinate descent algorithm?
26. Stochastic Gradient Descent Algorithm
• Stochastic gradient descent (often shortened in
SGD), also known as incremental gradient
descent, is a stochastic approximation of the
gradient descent optimization method for
minimizing an objective function that is written
as a sum of differentiable functions. In other
words, SGD tries to find minima or maxima by
iteration.
• Gradient descent is a first-order iterative
optimization algorithm for finding the minimum
of a function.
27. Tips
• Clear about the fundamental concepts
• Make sure that you have updated your GitHub
and SourceForge pages
• Contributing to open source and having
interesting, even if small, coding projects
online in your GitHub pages would definitely
be a big plus.
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.