SlideShare a Scribd company logo
1 of 27
Interview Question
10 Example of Q&A on Data Mining,
Machine Learning
Tips
Q-1
• What is Memory-Driven Computing?
• System- Enterprise class 16 core server, 2TB RAM
• Design IDS which analyze network logs & identify
intrusion candidate & minimum 256 system to monitor
• Find state of art approach for intrusion detection.
• Design IDS as scale up or scale out application?
• Would your choice change if system – multiple standard
workstation with 32GB RAM.
• Compute nodes accessing a shared pool of Fabric-Attached Memory
• An optimised Linux-based operating system (OS) running on a
customised SoC (System on a Chip)
• Photonics/Optical communication links, including the new X1
photonics module, are online and operational
• Fabric- Communication vehicle that transfer data between element
of computer system (SATA, DDR, PCI)
Comparison Case I and Case II
Xeon Intel 2TB Enterprise
Network Servers (Case I)
• 8 x Intel 10-Core XEON E7-
8870 2.4GHz 30M Cache
Processors
• 2048GB (2TB) Memory
• 3 x 300Gb SAS 10K Hot plug
Drives
• Price- 46,000$
HP Workstation Z240 – Tower
(Case II)
• 6th Generation Intel Core i7
8MB Cache
• Up to 64GB DDR4
• Up to 1TB SSD
• Price – 2000$
• Scalability of System
• Vertical Scaling- Add more
processor and RAM
– Less power consumption
– Less licensing cost
• Horizontal Scaling- Add more
servers with less processor
– Easy to upgrade
Q-2
• Build a supervised binary classification system for detecting
spam emails.
• Basically, given an email, your system should be able to
classify it as spam or non-spam. You are given a training data
set containing 400,000 rows and 10,000 columns (features).
The 10,000 features include both numerical variables and
categorical variables.
• You need to create a model using this training data set.
Consider the two options given in Question 1 above, namely a
server with 2TB RAM or a simple workstation with 32GB RAM.
• For each of these two options, can you explain how your
approach would change in training a model?
Machine Learning
• Machine learning method - Bayesian classification, k-NN, ANNs,
SVMs, Artificial immune system and Rough sets
• Naïve Bayes classifier method- Best Method
• Training – Parse each email into its constituent tokens
• Filtering- For each message , scan message for the next token and
calculate the overall message filtering indication
• If message filtering indication is greater than thresold message
marked as spam
• The workstation is memory constrained, hence you will need to
figure out ways to reduce your data set. Think about dimensionality
reduction, sampling, etc
Bonus Quesion
• Now, assume that you are able to successfully
create two different models — one for the
enterprise server and another for the
workstation.
• Which of these two models would do better on a
held-out test data set?
• Is it always the case that the model you built for
the enterprise class server with huge RAM would
have a much lower test error than the model you
built for the workstation?
Data Mining Model
• Training set is something that we have as of now.
We will remove subset from it and removed
subset will be called hel-dout set.
• Training (50%)
• Validation (25%)
• Testing (25%)
• Where the model is built on the training set, the
prediction errors are calculated using the
validation set, and the test set is used to assess
the generalization error of the final model.
Q-3
• You were asked to create a binary classification system
using the supervised learning approach, to analyse X-
ray images when detecting lung cancer.
• You have implemented two different approaches and
would like to evaluate both to see which performs
better.
• You find that Approach 1 has an accuracy of 97% and
Approach 2 has an accuracy of 98% .
• Given that, would you choose Approach 2 over
Approach 1? Would accuracy alone be a sufficient
criterion in your selection?
Supervised Learning Approch
• Supervised: All data is labeled and the algorithms
learn to predict the output from the input data.
• Unsupervised: All data is unlabeled and the
algorithms learn to inherent structure from the
input data.
• Semi-supervised: Some data is labeled but most
of it is unlabeled and a mixture of supervised and
unsupervised techniques can be used.
Q-4
• In Q3, Instead of getting accuracies of 98 per cent and 97, both your
approaches have a pretty low accuracy of 59 per cent and 65 per
cent.
• While you are debating whether to implement a totally different
approach, your friend suggests that you use an ensemble classifier
wherein you can combine the two classifiers you have already built
and generate a prediction.
• Would the combined ensemble classifier be better in performance
than either of the classifiers?
• If yes, explain why. If not, explain why it may perform poorly
compared to either of the two classifiers. When is an ensemble
classifier a good choice to try out when individual classifiers are not
able to reach the desired accuracy?
Ensembles of classifiers
• It is used for the accuracy.
• For the improvement of the performance of
individual classifiers, these classifiers could be
based on a variety of classification
methodologies, and could achieve different
rate of correctly classified individuals.
• The goal of classification result integration
algorithms is to generate more certain, precise
and accurate system results
Q-5
• You have built a classification model for email
spam such that you are able to reach a
training error which is very close to zero.
• Your classification model was a deep neural
network with five hidden layers.
• However, you find that the validation error is
not small but around 25. What is the problem
with your model?
Deep Neural Network
Q-6
• Regularisation techniques are widely used to avoid
overfitting of the model parameters to training data so that
the constructed network can still generalise well for unseen
test data.
• Ridge regression and Lasso regression are two of the
popular regularisation techniques.
• When would you choose to use ridge regression and when
would you use Lasso?
• Specifically, you are given a data set, in which you know
that out of 100 features, only 10 to 15 of them are
significant for prediction, based on your domain
knowledge. In that case, would you choose ridge regression
over Lasso?
• As we know that ridge regression can't zero
coefficients. Here, you either select all the
coefficients or none of them whereas LASSO
does both parameter shrinkage and variable
selection automatically because it zero out the
co-efficients of collinear variables. Here it
helps to select the variable(s) out of given n
variables while performing lasso regression.
Q-7
• You are given a computer system with 16 cores, 64GB
RAM, 512KB per core L1 cache (both instruction and
data caches are separate and each is of size 512KB),
2MB per core L2 cache, and 1GB L3 cache which is
shared among all cores.
• Now you are told to run multiple MySQL server
instances on this machine, so that you can use this
machine as a common backend system for multiple
Web applications.
• Can you characterize the different types of cache
misses that would be encountered by the MySQL
instances?
• L1, L2 and L3 caches are different memory pools similar to
the RAM in a computer. They were built in to decrease the
time taken to access data by the processor called latency.
• Level 1 Cache(2KB - 64KB) - Instructions are first searched in
this cache. L1 cache very small in comparison to others, thus
making it faster than the rest.
• (L2) Level 2 Cache(256KB - 512KB) - If the instructions are not
present in the L1 cache then it looks in the L2 cache, which is
a slightly larger pool of cache, thus accompanied by latency.
• (L3) Level 3 Cache (1MB -8MB) - With each cache miss, it
proceeds to the next level cache. This is the largest among the
all the cache, even though it is slower, its still faster than the
RAM.
Cache Miss Types
• L1 cache is direct-mapped whereas L2 and L3
caches are four-way set associative.
• Compulsory/Cold- The first reference to a block
of memory, starting with an empty cache.
• Capacity - The cache is not big enough to hold
every block you want to use.
• Conflict - Two blocks are mapped to the same
location and there is not enough room to hold
both.
Q-8
• In Q 7, where you are not given the cache sizes of L1,
L2 and L3, but have been asked to figure out the
approximate sizes of L1, L2 and L3 caches.
• You are given the average cache access times (hit and
miss penalty for each access) for L1, L2 and L3 caches.
Other system parameters are the same.
• Can you write a small code snippet which can
determine the approximate sizes of L1, L2 and L3
caches? Now, is it possible to figure out the cache sizes
if you are not aware of the average cache access times?
Average memory access time
• Average Memory Access Time (AMAT) is a
common metric to analyze memory system
performance.
• AMAT uses hit time, miss penalty, and miss rate
to measure memory performance.
• It accounts for the fact that hits and misses affect
memory system performance differently.
• It focuses on how locality and cache misses affect
overall performance and allows for a quick
analysis of different cache design techniques.
Q-9
• Consider that you have written two different versions of a
multi-threaded application.
• One version extensively uses linked lists and the second
version uses mainly arrays. You know that the memory
accesses performed by your application do not exhibit good
locality due to their inherent nature. You also know that your
memory accesses are around 75 per cent reads.
• Now you have the choice between two computer systems —
one with an L3 cache size of 24MB shared across all eight
cores, and another with an L3 cache size of 2MB independent
for each of the eight cores.
• Are there any specific reasons why you would choose System
1 rather than System 2 for each of your application versions?
Q-10
• Can you explain the stochastic gradient
descent algorithm?
• When would you prefer to use the coordinate
gradient descent algorithm instead of the
regular stochastic gradient algorithm?
• Here’s a related question. You are told that the
cost function you are trying to optimize is non-
convex. Would you still be able to use the co-
ordinate descent algorithm?
Stochastic Gradient Descent Algorithm
• Stochastic gradient descent (often shortened in
SGD), also known as incremental gradient
descent, is a stochastic approximation of the
gradient descent optimization method for
minimizing an objective function that is written
as a sum of differentiable functions. In other
words, SGD tries to find minima or maxima by
iteration.
• Gradient descent is a first-order iterative
optimization algorithm for finding the minimum
of a function.
Tips
• Clear about the fundamental concepts
• Make sure that you have updated your GitHub
and SourceForge pages
• Contributing to open source and having
interesting, even if small, coding projects
online in your GitHub pages would definitely
be a big plus.

More Related Content

What's hot

Processes and Processors in Distributed Systems
Processes and Processors in Distributed SystemsProcesses and Processors in Distributed Systems
Processes and Processors in Distributed SystemsDr Sandeep Kumar Poonia
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algosAkhil Sharma
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsAnkit Gupta
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Maria Stylianou
 
Final training course
Final training courseFinal training course
Final training courseNoor Dhiya
 
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4Barbara Aichinger
 
Client-centric Consistency Models
Client-centric Consistency ModelsClient-centric Consistency Models
Client-centric Consistency ModelsEnsar Basri Kahveci
 
Security in distributed systems
Security in distributed systems Security in distributed systems
Security in distributed systems Haitham Ahmed
 
Distributed Shared Memory
Distributed Shared MemoryDistributed Shared Memory
Distributed Shared MemoryPrakhar Rastogi
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSKathirvel Ayyaswamy
 
Distributed Shared Memory on Ericsson Labs
Distributed Shared Memory on Ericsson LabsDistributed Shared Memory on Ericsson Labs
Distributed Shared Memory on Ericsson LabsEricsson Labs
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memoryAshish Kumar
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategySaptarshi Chatterjee
 

What's hot (20)

Processes and Processors in Distributed Systems
Processes and Processors in Distributed SystemsProcesses and Processors in Distributed Systems
Processes and Processors in Distributed Systems
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algos
 
Distributed shared memory ch 5
Distributed shared memory ch 5Distributed shared memory ch 5
Distributed shared memory ch 5
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...
 
Final training course
Final training courseFinal training course
Final training course
 
Process coordination
Process coordinationProcess coordination
Process coordination
 
Dos unit3
Dos unit3Dos unit3
Dos unit3
 
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
 
Chap 4
Chap 4Chap 4
Chap 4
 
Client-centric Consistency Models
Client-centric Consistency ModelsClient-centric Consistency Models
Client-centric Consistency Models
 
Security in distributed systems
Security in distributed systems Security in distributed systems
Security in distributed systems
 
Distributed Shared Memory
Distributed Shared MemoryDistributed Shared Memory
Distributed Shared Memory
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Cache simulator
Cache simulatorCache simulator
Cache simulator
 
Distributed Shared Memory on Ericsson Labs
Distributed Shared Memory on Ericsson LabsDistributed Shared Memory on Ericsson Labs
Distributed Shared Memory on Ericsson Labs
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memory
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
Resource management
Resource managementResource management
Resource management
 

Similar to Inerview Quesion on Data Mining and Machine Learning

7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Intro to MySQL Master Slave Replication
Intro to MySQL Master Slave ReplicationIntro to MySQL Master Slave Replication
Intro to MySQL Master Slave Replicationsatejsahu
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLTriNimbus
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...confluent
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityPapitha Velumani
 
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest SystemsBig Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systemsaaamase
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopAyon Sinha
 
Performance Tuning
Performance TuningPerformance Tuning
Performance TuningJannet Peetz
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and NanosecondsRobert Greiner
 
UNIT II (1).pptx
UNIT II (1).pptxUNIT II (1).pptx
UNIT II (1).pptxgopi venkat
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)AllineaSoftware
 
Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensa...
Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensa...Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensa...
Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensa...STePINForum
 
Cassandra Applications Benchmarking
Cassandra Applications BenchmarkingCassandra Applications Benchmarking
Cassandra Applications Benchmarkingniallmilton
 

Similar to Inerview Quesion on Data Mining and Machine Learning (20)

7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Intro to MySQL Master Slave Replication
Intro to MySQL Master Slave ReplicationIntro to MySQL Master Slave Replication
Intro to MySQL Master Slave Replication
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest SystemsBig Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
 
Introduction
IntroductionIntroduction
Introduction
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
Petabytes and Nanoseconds
Petabytes and NanosecondsPetabytes and Nanoseconds
Petabytes and Nanoseconds
 
UNIT II (1).pptx
UNIT II (1).pptxUNIT II (1).pptx
UNIT II (1).pptx
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 
chap-0 .ppt
chap-0 .pptchap-0 .ppt
chap-0 .ppt
 
22CS201 COA
22CS201 COA22CS201 COA
22CS201 COA
 
Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensa...
Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensa...Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensa...
Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensa...
 
Cassandra Applications Benchmarking
Cassandra Applications BenchmarkingCassandra Applications Benchmarking
Cassandra Applications Benchmarking
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Inerview Quesion on Data Mining and Machine Learning

  • 1. Interview Question 10 Example of Q&A on Data Mining, Machine Learning Tips
  • 2. Q-1 • What is Memory-Driven Computing? • System- Enterprise class 16 core server, 2TB RAM • Design IDS which analyze network logs & identify intrusion candidate & minimum 256 system to monitor • Find state of art approach for intrusion detection. • Design IDS as scale up or scale out application? • Would your choice change if system – multiple standard workstation with 32GB RAM.
  • 3. • Compute nodes accessing a shared pool of Fabric-Attached Memory • An optimised Linux-based operating system (OS) running on a customised SoC (System on a Chip) • Photonics/Optical communication links, including the new X1 photonics module, are online and operational • Fabric- Communication vehicle that transfer data between element of computer system (SATA, DDR, PCI)
  • 4. Comparison Case I and Case II Xeon Intel 2TB Enterprise Network Servers (Case I) • 8 x Intel 10-Core XEON E7- 8870 2.4GHz 30M Cache Processors • 2048GB (2TB) Memory • 3 x 300Gb SAS 10K Hot plug Drives • Price- 46,000$ HP Workstation Z240 – Tower (Case II) • 6th Generation Intel Core i7 8MB Cache • Up to 64GB DDR4 • Up to 1TB SSD • Price – 2000$
  • 5. • Scalability of System • Vertical Scaling- Add more processor and RAM – Less power consumption – Less licensing cost • Horizontal Scaling- Add more servers with less processor – Easy to upgrade
  • 6. Q-2 • Build a supervised binary classification system for detecting spam emails. • Basically, given an email, your system should be able to classify it as spam or non-spam. You are given a training data set containing 400,000 rows and 10,000 columns (features). The 10,000 features include both numerical variables and categorical variables. • You need to create a model using this training data set. Consider the two options given in Question 1 above, namely a server with 2TB RAM or a simple workstation with 32GB RAM. • For each of these two options, can you explain how your approach would change in training a model?
  • 7. Machine Learning • Machine learning method - Bayesian classification, k-NN, ANNs, SVMs, Artificial immune system and Rough sets • Naïve Bayes classifier method- Best Method • Training – Parse each email into its constituent tokens • Filtering- For each message , scan message for the next token and calculate the overall message filtering indication • If message filtering indication is greater than thresold message marked as spam • The workstation is memory constrained, hence you will need to figure out ways to reduce your data set. Think about dimensionality reduction, sampling, etc
  • 8. Bonus Quesion • Now, assume that you are able to successfully create two different models — one for the enterprise server and another for the workstation. • Which of these two models would do better on a held-out test data set? • Is it always the case that the model you built for the enterprise class server with huge RAM would have a much lower test error than the model you built for the workstation?
  • 9. Data Mining Model • Training set is something that we have as of now. We will remove subset from it and removed subset will be called hel-dout set. • Training (50%) • Validation (25%) • Testing (25%) • Where the model is built on the training set, the prediction errors are calculated using the validation set, and the test set is used to assess the generalization error of the final model.
  • 10. Q-3 • You were asked to create a binary classification system using the supervised learning approach, to analyse X- ray images when detecting lung cancer. • You have implemented two different approaches and would like to evaluate both to see which performs better. • You find that Approach 1 has an accuracy of 97% and Approach 2 has an accuracy of 98% . • Given that, would you choose Approach 2 over Approach 1? Would accuracy alone be a sufficient criterion in your selection?
  • 12. • Supervised: All data is labeled and the algorithms learn to predict the output from the input data. • Unsupervised: All data is unlabeled and the algorithms learn to inherent structure from the input data. • Semi-supervised: Some data is labeled but most of it is unlabeled and a mixture of supervised and unsupervised techniques can be used.
  • 13. Q-4 • In Q3, Instead of getting accuracies of 98 per cent and 97, both your approaches have a pretty low accuracy of 59 per cent and 65 per cent. • While you are debating whether to implement a totally different approach, your friend suggests that you use an ensemble classifier wherein you can combine the two classifiers you have already built and generate a prediction. • Would the combined ensemble classifier be better in performance than either of the classifiers? • If yes, explain why. If not, explain why it may perform poorly compared to either of the two classifiers. When is an ensemble classifier a good choice to try out when individual classifiers are not able to reach the desired accuracy?
  • 14. Ensembles of classifiers • It is used for the accuracy. • For the improvement of the performance of individual classifiers, these classifiers could be based on a variety of classification methodologies, and could achieve different rate of correctly classified individuals. • The goal of classification result integration algorithms is to generate more certain, precise and accurate system results
  • 15. Q-5 • You have built a classification model for email spam such that you are able to reach a training error which is very close to zero. • Your classification model was a deep neural network with five hidden layers. • However, you find that the validation error is not small but around 25. What is the problem with your model?
  • 17. Q-6 • Regularisation techniques are widely used to avoid overfitting of the model parameters to training data so that the constructed network can still generalise well for unseen test data. • Ridge regression and Lasso regression are two of the popular regularisation techniques. • When would you choose to use ridge regression and when would you use Lasso? • Specifically, you are given a data set, in which you know that out of 100 features, only 10 to 15 of them are significant for prediction, based on your domain knowledge. In that case, would you choose ridge regression over Lasso?
  • 18. • As we know that ridge regression can't zero coefficients. Here, you either select all the coefficients or none of them whereas LASSO does both parameter shrinkage and variable selection automatically because it zero out the co-efficients of collinear variables. Here it helps to select the variable(s) out of given n variables while performing lasso regression.
  • 19. Q-7 • You are given a computer system with 16 cores, 64GB RAM, 512KB per core L1 cache (both instruction and data caches are separate and each is of size 512KB), 2MB per core L2 cache, and 1GB L3 cache which is shared among all cores. • Now you are told to run multiple MySQL server instances on this machine, so that you can use this machine as a common backend system for multiple Web applications. • Can you characterize the different types of cache misses that would be encountered by the MySQL instances?
  • 20. • L1, L2 and L3 caches are different memory pools similar to the RAM in a computer. They were built in to decrease the time taken to access data by the processor called latency. • Level 1 Cache(2KB - 64KB) - Instructions are first searched in this cache. L1 cache very small in comparison to others, thus making it faster than the rest. • (L2) Level 2 Cache(256KB - 512KB) - If the instructions are not present in the L1 cache then it looks in the L2 cache, which is a slightly larger pool of cache, thus accompanied by latency. • (L3) Level 3 Cache (1MB -8MB) - With each cache miss, it proceeds to the next level cache. This is the largest among the all the cache, even though it is slower, its still faster than the RAM.
  • 21. Cache Miss Types • L1 cache is direct-mapped whereas L2 and L3 caches are four-way set associative. • Compulsory/Cold- The first reference to a block of memory, starting with an empty cache. • Capacity - The cache is not big enough to hold every block you want to use. • Conflict - Two blocks are mapped to the same location and there is not enough room to hold both.
  • 22. Q-8 • In Q 7, where you are not given the cache sizes of L1, L2 and L3, but have been asked to figure out the approximate sizes of L1, L2 and L3 caches. • You are given the average cache access times (hit and miss penalty for each access) for L1, L2 and L3 caches. Other system parameters are the same. • Can you write a small code snippet which can determine the approximate sizes of L1, L2 and L3 caches? Now, is it possible to figure out the cache sizes if you are not aware of the average cache access times?
  • 23. Average memory access time • Average Memory Access Time (AMAT) is a common metric to analyze memory system performance. • AMAT uses hit time, miss penalty, and miss rate to measure memory performance. • It accounts for the fact that hits and misses affect memory system performance differently. • It focuses on how locality and cache misses affect overall performance and allows for a quick analysis of different cache design techniques.
  • 24. Q-9 • Consider that you have written two different versions of a multi-threaded application. • One version extensively uses linked lists and the second version uses mainly arrays. You know that the memory accesses performed by your application do not exhibit good locality due to their inherent nature. You also know that your memory accesses are around 75 per cent reads. • Now you have the choice between two computer systems — one with an L3 cache size of 24MB shared across all eight cores, and another with an L3 cache size of 2MB independent for each of the eight cores. • Are there any specific reasons why you would choose System 1 rather than System 2 for each of your application versions?
  • 25. Q-10 • Can you explain the stochastic gradient descent algorithm? • When would you prefer to use the coordinate gradient descent algorithm instead of the regular stochastic gradient algorithm? • Here’s a related question. You are told that the cost function you are trying to optimize is non- convex. Would you still be able to use the co- ordinate descent algorithm?
  • 26. Stochastic Gradient Descent Algorithm • Stochastic gradient descent (often shortened in SGD), also known as incremental gradient descent, is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD tries to find minima or maxima by iteration. • Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function.
  • 27. Tips • Clear about the fundamental concepts • Make sure that you have updated your GitHub and SourceForge pages • Contributing to open source and having interesting, even if small, coding projects online in your GitHub pages would definitely be a big plus.

Editor's Notes

  1. https://www.ebay.co.uk/b/Xeon-Intel-2TB-Enterprise-Network-Servers/11211/bn_25808434 https://www.neweggbusiness.com/product/product.aspx?item=9b-0v5-000u-00030
  2. http://www.indjst.org/index.php/indjst/article/viewFile/89264/68348
  3. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.190.1441&rep=rep1&type=pdf
  4. Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output. The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.
  5. Cross Validation