SlideShare une entreprise Scribd logo
1  sur  3
Télécharger pour lire hors ligne
1. Describe an algorithm which samples roughly n
elements from a stream,
each uniformly and independently at random. More precisely, in every
point in the stream, after processing N elements, each of the elements
a1, . . . , aN should appear in your sample with probability exactly n/N.
Note that you clearly do not know N in advance and that you cannot
reread the stream.
2. Prove the correctness of your algorithm.
3. What is the complexity of processing each element in your algorithm?
4. Bound the probability that your algorithm samples more than 2n elements.
Solution
1. Before describing the algorithm, lets provide some intuition about how
it works. Imagine we choose for each element ai a random number ri
uniformly and independently at random from the range [0, 1]. We can
keep all the elements for which ri ≤ n/N we clearly do the right thing.
The question is, how do we do that when we don’t know N in advance?
The idea is to assume that the current element is the last one in the stream
at every point.
More precisely, when receiving an element ai the algorithm generates a
random number ri uniformly and independently at random from the range
[0, 1]. If ri > n/i the element is ignored. Otherwise, it is kept. Also, in
every step, we go over all the elements aj such that 1 ≤ j ≤ i − 1 and
discard those for which rj > n/i. Note that we had to have kept all the
values rj as well.
2. Clearly, doing this gives the same intuitive solution as above. In other
words, when the stream ends (i = N) we have that our sample contains
1
DATA MINING
Our online Tutors are available 24*7 to provide Help with Data Mining Homework/Assignment or a long
term Graduate/Undergraduate Data Mining Project. Our Tutors being experienced and proficient in Data
Mining ensure to provide high quality Data Mining Homework Help. Upload your Data Mining
Assignment at ‘Submit Your Assignment’ button or email it to info@assignmentpedia.com. You can use
our ‘Live Chat’ option to schedule an Online Tutoring session with our Data Mining Tutors.
all the elements aj for which rj ≤ n/N. Since rj are uniformly and inde-
pendently chosen from [0, 1], each element is in the sample independently
with probability n/N.
3. A naive implementation of this algorithm requires O(n) operations per
read element. This is because, in every step we have to go over all the
sampled elements and discard the ones for which rj > n/i. However, if we
keep the picked elements in a sorted list, we can go to the end of the list
and remove only those. This requirs only O(1) operations per removed
element. Or, discarding di elements in step i requires O(di) operations.
This comes with the price of having to insert picked elements into the
sorted list in O(si) operations. Here si denoted the number of samples
that we kept in step i − 1. We denote by zi the indicator variable for
the event that element ai is picked. Let us compute the expected running
time of this algorithm T(N, n).
E[T(N, n)] =
N
i=1
E[Cost of step i
=
N
i=1
[E[O(si)zi] + O(di
= O(
N
i=1
n2
i
+ N
= O(n2
log(N) + N
We used here that si and zi are independent so E[O(si)zi] = O(E(si))E[zi].
Also,
n
i=1 di ≤ N and
n
i=1
1
i ≈ log(N). Amortizing over the length of
the stream we get that the expected cost per element is T(N, n)/N =
O(n2
log(n)
N +1) which is O(1) if the stream is long enough, i.e. n2
log(N)
N ≤
1.
Alternatively, we can keep all the picked elements in a sorted heap accord-
ing to the r values which results in O(log(n)) operations per insertion and
deletion (amortized)
4. Bounding the probability of sampling more than 2n elements can be done
using the Chernoff bound.
Pr[X > (1 + ε)µ] ≤ e−µε2
/4
We have that µ = E[sN ] = n and ε = 1. Thus:
Pr[sN > 2n] ≤ e−n/4
.
We can also make sure that in no point in the algorithm we have si > 2n
by the union bound. Making sure that in all N steps of the algorithm this
2
happens with probability at least 1 − δ.
n
i=1
Pr[si > 2n] ≤ Ne−n/4
≤ δ .
This is true if n > 4 log(N
δ ).
3
visit us at www.assignmentpedia.com or email us at info@assignmentpedia.com or call us at +1 520 8371215

Contenu connexe

Plus de Assignmentpedia

Sequential radar tracking
Sequential radar trackingSequential radar tracking
Sequential radar trackingAssignmentpedia
 
Radar application project help
Radar application project helpRadar application project help
Radar application project helpAssignmentpedia
 
Parallel computing homework help
Parallel computing homework helpParallel computing homework help
Parallel computing homework helpAssignmentpedia
 
Network costing analysis
Network costing analysisNetwork costing analysis
Network costing analysisAssignmentpedia
 
Matlab simulation project
Matlab simulation projectMatlab simulation project
Matlab simulation projectAssignmentpedia
 
Matlab programming project
Matlab programming projectMatlab programming project
Matlab programming projectAssignmentpedia
 
Image processing project using matlab
Image processing project using matlabImage processing project using matlab
Image processing project using matlabAssignmentpedia
 
Computer Networks Homework Help
Computer Networks Homework HelpComputer Networks Homework Help
Computer Networks Homework HelpAssignmentpedia
 
Theory of computation homework help
Theory of computation homework helpTheory of computation homework help
Theory of computation homework helpAssignmentpedia
 
Help With Digital Communication Project
Help With  Digital Communication ProjectHelp With  Digital Communication Project
Help With Digital Communication ProjectAssignmentpedia
 
Filter Implementation And Evaluation Project
Filter Implementation And Evaluation ProjectFilter Implementation And Evaluation Project
Filter Implementation And Evaluation ProjectAssignmentpedia
 
Distributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectDistributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectAssignmentpedia
 
Distributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectDistributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectAssignmentpedia
 
Digital Signal Processing
Digital Signal ProcessingDigital Signal Processing
Digital Signal ProcessingAssignmentpedia
 

Plus de Assignmentpedia (20)

Sequential radar tracking
Sequential radar trackingSequential radar tracking
Sequential radar tracking
 
Resolution project
Resolution projectResolution project
Resolution project
 
Radar application project help
Radar application project helpRadar application project help
Radar application project help
 
Parallel computing homework help
Parallel computing homework helpParallel computing homework help
Parallel computing homework help
 
Network costing analysis
Network costing analysisNetwork costing analysis
Network costing analysis
 
Matlab simulation project
Matlab simulation projectMatlab simulation project
Matlab simulation project
 
Matlab programming project
Matlab programming projectMatlab programming project
Matlab programming project
 
Links design
Links designLinks design
Links design
 
Image processing project using matlab
Image processing project using matlabImage processing project using matlab
Image processing project using matlab
 
Transmitter subsystem
Transmitter subsystemTransmitter subsystem
Transmitter subsystem
 
Computer Networks Homework Help
Computer Networks Homework HelpComputer Networks Homework Help
Computer Networks Homework Help
 
Theory of computation homework help
Theory of computation homework helpTheory of computation homework help
Theory of computation homework help
 
Radar Spectral Analysis
Radar Spectral AnalysisRadar Spectral Analysis
Radar Spectral Analysis
 
Pi Controller
Pi ControllerPi Controller
Pi Controller
 
Help With Digital Communication Project
Help With  Digital Communication ProjectHelp With  Digital Communication Project
Help With Digital Communication Project
 
Filter Implementation And Evaluation Project
Filter Implementation And Evaluation ProjectFilter Implementation And Evaluation Project
Filter Implementation And Evaluation Project
 
Distributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectDistributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation Project
 
Distributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation ProjectDistributed Radar Tracking Simulation Project
Distributed Radar Tracking Simulation Project
 
Digital Signal Processing
Digital Signal ProcessingDigital Signal Processing
Digital Signal Processing
 
Design Of 10 gbps
Design Of 10 gbpsDesign Of 10 gbps
Design Of 10 gbps
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Dernier (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Data mining homework help

  • 1. 1. Describe an algorithm which samples roughly n elements from a stream, each uniformly and independently at random. More precisely, in every point in the stream, after processing N elements, each of the elements a1, . . . , aN should appear in your sample with probability exactly n/N. Note that you clearly do not know N in advance and that you cannot reread the stream. 2. Prove the correctness of your algorithm. 3. What is the complexity of processing each element in your algorithm? 4. Bound the probability that your algorithm samples more than 2n elements. Solution 1. Before describing the algorithm, lets provide some intuition about how it works. Imagine we choose for each element ai a random number ri uniformly and independently at random from the range [0, 1]. We can keep all the elements for which ri ≤ n/N we clearly do the right thing. The question is, how do we do that when we don’t know N in advance? The idea is to assume that the current element is the last one in the stream at every point. More precisely, when receiving an element ai the algorithm generates a random number ri uniformly and independently at random from the range [0, 1]. If ri > n/i the element is ignored. Otherwise, it is kept. Also, in every step, we go over all the elements aj such that 1 ≤ j ≤ i − 1 and discard those for which rj > n/i. Note that we had to have kept all the values rj as well. 2. Clearly, doing this gives the same intuitive solution as above. In other words, when the stream ends (i = N) we have that our sample contains 1 DATA MINING Our online Tutors are available 24*7 to provide Help with Data Mining Homework/Assignment or a long term Graduate/Undergraduate Data Mining Project. Our Tutors being experienced and proficient in Data Mining ensure to provide high quality Data Mining Homework Help. Upload your Data Mining Assignment at ‘Submit Your Assignment’ button or email it to info@assignmentpedia.com. You can use our ‘Live Chat’ option to schedule an Online Tutoring session with our Data Mining Tutors.
  • 2. all the elements aj for which rj ≤ n/N. Since rj are uniformly and inde- pendently chosen from [0, 1], each element is in the sample independently with probability n/N. 3. A naive implementation of this algorithm requires O(n) operations per read element. This is because, in every step we have to go over all the sampled elements and discard the ones for which rj > n/i. However, if we keep the picked elements in a sorted list, we can go to the end of the list and remove only those. This requirs only O(1) operations per removed element. Or, discarding di elements in step i requires O(di) operations. This comes with the price of having to insert picked elements into the sorted list in O(si) operations. Here si denoted the number of samples that we kept in step i − 1. We denote by zi the indicator variable for the event that element ai is picked. Let us compute the expected running time of this algorithm T(N, n). E[T(N, n)] = N i=1 E[Cost of step i = N i=1 [E[O(si)zi] + O(di = O( N i=1 n2 i + N = O(n2 log(N) + N We used here that si and zi are independent so E[O(si)zi] = O(E(si))E[zi]. Also, n i=1 di ≤ N and n i=1 1 i ≈ log(N). Amortizing over the length of the stream we get that the expected cost per element is T(N, n)/N = O(n2 log(n) N +1) which is O(1) if the stream is long enough, i.e. n2 log(N) N ≤ 1. Alternatively, we can keep all the picked elements in a sorted heap accord- ing to the r values which results in O(log(n)) operations per insertion and deletion (amortized) 4. Bounding the probability of sampling more than 2n elements can be done using the Chernoff bound. Pr[X > (1 + ε)µ] ≤ e−µε2 /4 We have that µ = E[sN ] = n and ε = 1. Thus: Pr[sN > 2n] ≤ e−n/4 . We can also make sure that in no point in the algorithm we have si > 2n by the union bound. Making sure that in all N steps of the algorithm this 2
  • 3. happens with probability at least 1 − δ. n i=1 Pr[si > 2n] ≤ Ne−n/4 ≤ δ . This is true if n > 4 log(N δ ). 3 visit us at www.assignmentpedia.com or email us at info@assignmentpedia.com or call us at +1 520 8371215