SlideShare une entreprise Scribd logo
1  sur  5
Télécharger pour lire hors ligne
6.897: CSCI8980 Algorithmic Techniques for Big Data September 5, 2013
Lecture 1
Dr. Barna Saha Scribe: Vivek Mishra
Overview
We introduce data streaming model where streams of elements are coming in and main memory
space is not sufficient to hold all the data. We then look at the problem of finding frequent
items deterministically. This is a rare instance of data streaming algorithms that provides non-
trivial approximation guarantee deterministically. For most algorithms as we will see the answer
is approximate (close to optimal) and randomization is crucially used. To emphasize on the need
of randomization in designing data streaming algorithms, we next show that computing distinct
items (known as F0 computation) over a stream deterministically cannot achieve any approximation
without essentially storing the entire stream. Our next goal is to analyze the algorithm for distinct
items that have been covered in Lecture 1. Towards that goal, we study some basic concentration
inequality such as, Markov inequality, Chebyshev bound and The Chernoff bound.
1 Introduction to Data Streams
In a data streaming model sequence of elements a1 a2 a3 am arrive from a domain [1, n]. Each
element ai is a tuple (j, ν) where j ∈ [1, n] is an element from the domain and ν ∈ I. For simplicity,
we can consider ν = ±1 where +1 implies j is inserted in the stream and −1 implies j is deleted.
The goal is to process these elements using space (ideally) polylog of m and n, but definitely in
sub-linear in m and n. In the basic streaming setting, only a single pass over the data is allowed.
The time to process each update must be low.
When only insertions are allowed (all ν > 0), the model is known as cash-register model. On
the other hand, if both insertions and deletions are allowed, it is called turnstile model. For any
element, we generally do not allow the number of deletions (total negative frequency) to be more
than total number of insertions of that elements. However, if we do allow such scenarios, we will
refer it as general turnstile model.
2 Finding Frequent Items Deterministically
Here we describe an algorithm by Mishra and Gries [1] to find frequent items in the cash-register
model that is when only insertions are allowed. The precise problem is as follows.
Given a sequence of m elements from [1, n] and any k ∈ N, find all the elements with frequency
more than m
k where frequency simply implies the number of times an element occurs using space
O(k) (in words) and using only a single pass. Hence in bits, the total space usage is O(k log n).
We would like to have the following guarantees.
1
• No false negative. All items that have frequency > m
k will be reported.
There might be false positives, that is elements with frequency lower than m/k may be reported.
However, if we allow two passes, in the second pass all the elements reported in the first pass can
be checked for actual frequency and any false positive can be eliminated.
Description of the Algorithm
• Data structure: An associative array of k − 1 elements initialized to empty. An associative
array contains in each of its cell a key (the item) and a value (its count) and may be maintained
in a balanced binary search tree by its key. Whenever we see an element ai = (j, 1) we can
search in the array if j exists or not in O(log k) time.
• Procedure: Given arrival of ai = (j, 1), we search if j exists in the associative array. If the
element is already there, we increment its count. If it is not there and there is an empty cell,
we store it and make its count 1. If no space is left and the element is not there, we do not
store the element and decrement the counters of all the stored elements by 1. If a counter
reaches 0, that element is dropped from the associative array and the cell becomes empty. In
the following code, we use A to represent the associative array, and give the update process
when ai is seen in the stream.
Algorithm 1 [Process ai = (j, 1)]
Search for j ∈ A
if j is already present as key at cell A[l] then
Count[l] = Count[l] + 1 {i}ncrement its count
else if j is not present in A and ∃l such that A[l] is empty then
Insert j as key to A[l]
Count[l] = 1
else
Drop j {comment: j is not present and there is no free space}
for i = 1 to k − 1 do
Count[i] = Count[i] − 1
if Count[i] == 0 then
Drop the key associated with A[i] and mark A[i] empty
end if
end for
end if
The entire algorithm for a given k ∈ N is as follows
2.1 Analysis
We now prove the following theorem.
Theorem 1. For any given k ∈ N Algorithm 2 returns all items with frequency more than m
k .
2
Algorithm 2 Mishra-Gries Algorithm
Initialize A of size k − 1 to empty
for i = 1 to m do
Process(ai)
end for
return All the elements in A
Proof. Clearly, the number of items having frequency more than m
k is at most k −1 because stream
size is m. Let fj denote the actual frequency of item j and let ˆfj be the frequency of the item j as
observed in A at the end of the processing. If j is not in A, we assume ˆfj = 0. We therefore return
all elements with ˆfj > 0.
Note that we increment frequency for j only if we see the actual item. Hence ˆfj ≤ fj. We want to
find out how low it can be. If we never drop the element then ˆfj = fj. Otherwise, either at some
point when j occurs, array A is full and j is not already there. Or because it is dropped from array
due to its count being decremented to 0.
Note that whenever an item is dropped (due to no space or count getting decremented to 0), there
are other distinct k − 1 elements whose count also gets decremented. Here we view the count of an
element on arrival is increased to 1, but it is dropped to 0 if it cannot be stored. Therefore, there
can be at most m
k steps on which element counts are decremented, k distinct elements at one shot.
The reasoning being the stream size is m and all frequencies are non-negative.
For each of these events, the difference between the computed frequency and the actual frequency
can increase by 1 and hence altogether we have ˆfj ≥ fj − m
k for all j ∈ [1, n].
Therefore, for all j ∈ [1, n] if fj > m
k then ˆfj > 0 and hence it is stored in the array at the end and
will be reported.
In most data streaming algorithms, one cannot achieve any non-trivial approximation determinis-
tically. In the following section, we come back to counting distinct items and show no deterministic
algorithm is possible that in o(m) space can give an exact count for distinct elements.
3 Lower Bound for Deterministic Computation of Distinct Ele-
ments
We prove the following theorem here.
Theorem 2. There exists no deterministic algorithm that returns the exact count for distinct items
in stream of size m in o(m) bits.
Proof. We will prove this by the by contradiction. Suppose it is possible to have an exact estimate
of distinct elements using space o(m) bits. Let R be such an algorithm. Since R uses o(m) bits,
the number of different possible configurations that R can maintain is at-most 2o(m). We let n = m
and consider all the different streams which have exactly m
2 distinct elements. How many such
3
streams are possible ? Clearly, the number of such streams is at least
m
m/2
≈
me
m/2
m
2
= (2e)
me
2 = 2Θ(m)
where the second inequality comes from Stirling’s approximation.
Since the number of such streams is more than the number of available configurations of R, there
must exist two streams y, y , y = y such that both have the same configuration, that is, R(y) =
R(y ) and the distinct items in y are not identical to distinct items of y . We now consider two
streams Y1 = y + y and Y2 = y + y where + represents concatenation here, that is stream Y1
y is followed by y and in stream Y2, y is followed by y. Since R(y) = R(y ), we must have
R(y + y) = R(y + y ). Therefore R will return same distinct elements counts for both Y1 and Y2
which is wrong because the number of distinct elements in Y1 is m
2 where for Y2 it is > m
2 .
The above proof can be extended to show that there does not exist any deterministic algorithm
with space o(m) that returns a count of distinct items within a multiplicative factor less than 2.
Similarly, one can show that no exact randomized algorithm can exist either in o(m) space.
Surprisingly, when we allow both approximation and randomization, the space usage can be dras-
tically reduced (next lecture).
4 Basic Concentration Inequalities
Here we study three basic concentration inequalities which bound deviation from expectation.
1. Markov inequality ( The 1st moment inequality)
2. Chebyshev inequality( The 2nd moment inequality)
3. The Chernoff Bound
Theorem 3 (Markov Bound). For any positive random variable X, and for any t > 0
Pr (X ≥ t) ≤
E[x]
t
(1)
Proof.
E[x] =
x
x · Pr(X = x)
=
x<t
x · Pr(X = x) +
x≥t
x · Pr(X = x)
≥ 0 + t ·
x≥t
Pr(X = x)
= t · P(X ≥ t)
4
Theorem 4 (Chebyshev Inequality). For any random variable X and for any t > 0
Pr(|X − E[x]| ≥ t) ≤
V ar(x)
t2
(2)
Proof.
Pr(|X − E[x]| ≥ t)
= Pr([X − E[x]]2
≥ t2
)
≤
E (X − E[x])2
t2
=
V ar(X)
t2
Theorem 5 (The Chernoff Bound). Let X1, X2...Xn be n independent Bernoulli random variables
with Pr(Xi = 1) = pi. Let X = Xi. Hence,
E[X] = E Xi = E [Xi] = Pr(Xi = 1) = pi = µ(say).
Then the Chernoff Bound says for any > 0
Pr(X > (1 + )µ) ≤
e
(1 + )
µ
and
Pr(X < (1 − )µ) ≤
e−
(1 − )1−
µ
When 0 < < 1 the above expression can be further simplified to
Pr(X > (1 + )µ) ≤ e
−µ 2
3 and
Pr(X < (1 − )µ) ≤ e
−µ 2
2
Hence
Pr(|X − µ| > µ) ≤ 2e
−µ 2
3
References
[1] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program.,
2(2):143152, 1982.
5

Contenu connexe

Tendances (20)

Application of calculus in everyday life
Application of calculus in everyday lifeApplication of calculus in everyday life
Application of calculus in everyday life
 
Cis435 week02
Cis435 week02Cis435 week02
Cis435 week02
 
Indexing Text with Approximate q-grams
Indexing Text with Approximate q-gramsIndexing Text with Approximate q-grams
Indexing Text with Approximate q-grams
 
MMath Paper, Canlin Zhang
MMath Paper, Canlin ZhangMMath Paper, Canlin Zhang
MMath Paper, Canlin Zhang
 
MASTER_THESIS-libre
MASTER_THESIS-libreMASTER_THESIS-libre
MASTER_THESIS-libre
 
Ms nikita greedy agorithm
Ms nikita greedy agorithmMs nikita greedy agorithm
Ms nikita greedy agorithm
 
Ph 101-9 QUANTUM MACHANICS
Ph 101-9 QUANTUM MACHANICSPh 101-9 QUANTUM MACHANICS
Ph 101-9 QUANTUM MACHANICS
 
Diffusion Homework Help
Diffusion Homework HelpDiffusion Homework Help
Diffusion Homework Help
 
Lec22
Lec22Lec22
Lec22
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
 
Physics Assignment Help
Physics Assignment HelpPhysics Assignment Help
Physics Assignment Help
 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical Methods
 
Lec10
Lec10Lec10
Lec10
 
220exercises2
220exercises2220exercises2
220exercises2
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
algorithm unit 1
algorithm unit 1algorithm unit 1
algorithm unit 1
 
Lec23
Lec23Lec23
Lec23
 
Sortsearch
SortsearchSortsearch
Sortsearch
 
Application of Schrodinger Equation to particle in one Dimensional Box: Energ...
Application of Schrodinger Equation to particle in one Dimensional Box: Energ...Application of Schrodinger Equation to particle in one Dimensional Box: Energ...
Application of Schrodinger Equation to particle in one Dimensional Box: Energ...
 
Differential equations final -mams
Differential equations final -mamsDifferential equations final -mams
Differential equations final -mams
 

En vedette

бойко. 21ноября круглый стол
бойко. 21ноября круглый столбойко. 21ноября круглый стол
бойко. 21ноября круглый столAtner Yegorov
 
социокультурный портрет Чувашской Республики 2013
социокультурный портрет  Чувашской Республики 2013социокультурный портрет  Чувашской Республики 2013
социокультурный портрет Чувашской Республики 2013Atner Yegorov
 
о реализации РЦП "БДД Чувашии 2006-2012"
о реализации РЦП "БДД Чувашии 2006-2012"о реализации РЦП "БДД Чувашии 2006-2012"
о реализации РЦП "БДД Чувашии 2006-2012"Atner Yegorov
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAtner Yegorov
 
кку тракторные заводы
кку тракторные заводыкку тракторные заводы
кку тракторные заводыAtner Yegorov
 
презентация инвест предложения_чр_январь_2012_табаков
презентация инвест предложения_чр_январь_2012_табаковпрезентация инвест предложения_чр_январь_2012_табаков
презентация инвест предложения_чр_январь_2012_табаковAtner Yegorov
 
Efimov goroda upravlenie_razvitiem_gorodov_2010
Efimov goroda upravlenie_razvitiem_gorodov_2010Efimov goroda upravlenie_razvitiem_gorodov_2010
Efimov goroda upravlenie_razvitiem_gorodov_2010Atner Yegorov
 
Prezentazia kostroma
Prezentazia kostromaPrezentazia kostroma
Prezentazia kostromaAtner Yegorov
 
Prezentaciya rab glave_17_fevralya
Prezentaciya rab glave_17_fevralyaPrezentaciya rab glave_17_fevralya
Prezentaciya rab glave_17_fevralyaAtner Yegorov
 
птицефабрика
птицефабрикаптицефабрика
птицефабрикаAtner Yegorov
 
презентация топинамбур
презентация топинамбурпрезентация топинамбур
презентация топинамбурAtner Yegorov
 

En vedette (19)

чебоксары
чебоксарычебоксары
чебоксары
 
бойко. 21ноября круглый стол
бойко. 21ноября круглый столбойко. 21ноября круглый стол
бойко. 21ноября круглый стол
 
социокультурный портрет Чувашской Республики 2013
социокультурный портрет  Чувашской Республики 2013социокультурный портрет  Чувашской Республики 2013
социокультурный портрет Чувашской Республики 2013
 
Lecture5
Lecture5Lecture5
Lecture5
 
о реализации РЦП "БДД Чувашии 2006-2012"
о реализации РЦП "БДД Чувашии 2006-2012"о реализации РЦП "БДД Чувашии 2006-2012"
о реализации РЦП "БДД Чувашии 2006-2012"
 
Lecture4
Lecture4Lecture4
Lecture4
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
кку тракторные заводы
кку тракторные заводыкку тракторные заводы
кку тракторные заводы
 
презентация инвест предложения_чр_январь_2012_табаков
презентация инвест предложения_чр_январь_2012_табаковпрезентация инвест предложения_чр_январь_2012_табаков
презентация инвест предложения_чр_январь_2012_табаков
 
ВСМ-2
ВСМ-2ВСМ-2
ВСМ-2
 
Prezentazia ples
Prezentazia plesPrezentazia ples
Prezentazia ples
 
зао чэаз
зао чэаззао чэаз
зао чэаз
 
Efimov goroda upravlenie_razvitiem_gorodov_2010
Efimov goroda upravlenie_razvitiem_gorodov_2010Efimov goroda upravlenie_razvitiem_gorodov_2010
Efimov goroda upravlenie_razvitiem_gorodov_2010
 
It 2012 3
It 2012 3It 2012 3
It 2012 3
 
Prezentazia kostroma
Prezentazia kostromaPrezentazia kostroma
Prezentazia kostroma
 
Prezentaciya rab glave_17_fevralya
Prezentaciya rab glave_17_fevralyaPrezentaciya rab glave_17_fevralya
Prezentaciya rab glave_17_fevralya
 
птицефабрика
птицефабрикаптицефабрика
птицефабрика
 
It gl-gor2013
It gl-gor2013It gl-gor2013
It gl-gor2013
 
презентация топинамбур
презентация топинамбурпрезентация топинамбур
презентация топинамбур
 

Similaire à Lec12

In three of the exercises below there is given a code of a method na.pdf
In three of the exercises below there is given a code of a method na.pdfIn three of the exercises below there is given a code of a method na.pdf
In three of the exercises below there is given a code of a method na.pdffeelinggift
 
Lecture 1 maximum likelihood
Lecture 1 maximum likelihoodLecture 1 maximum likelihood
Lecture 1 maximum likelihoodAnant Dashpute
 
Design and analysis of ra sort
Design and analysis of ra sortDesign and analysis of ra sort
Design and analysis of ra sortijfcstjournal
 
MC0082 –Theory of Computer Science
MC0082 –Theory of Computer ScienceMC0082 –Theory of Computer Science
MC0082 –Theory of Computer ScienceAravind NC
 
Quantum algorithm for solving linear systems of equations
 Quantum algorithm for solving linear systems of equations Quantum algorithm for solving linear systems of equations
Quantum algorithm for solving linear systems of equationsXequeMateShannon
 
Chp-1 Quick Review of basic concepts.pdf
Chp-1 Quick Review of basic concepts.pdfChp-1 Quick Review of basic concepts.pdf
Chp-1 Quick Review of basic concepts.pdfSolomonMolla4
 
Skiena algorithm 2007 lecture08 quicksort
Skiena algorithm 2007 lecture08 quicksortSkiena algorithm 2007 lecture08 quicksort
Skiena algorithm 2007 lecture08 quicksortzukun
 
Beginning direct3d gameprogrammingmath02_logarithm_20160324_jintaeks
Beginning direct3d gameprogrammingmath02_logarithm_20160324_jintaeksBeginning direct3d gameprogrammingmath02_logarithm_20160324_jintaeks
Beginning direct3d gameprogrammingmath02_logarithm_20160324_jintaeksJinTaek Seo
 
不等概率随机选取
不等概率随机选取不等概率随机选取
不等概率随机选取claywong
 

Similaire à Lec12 (20)

Stochastic Processes Assignment Help
Stochastic Processes Assignment HelpStochastic Processes Assignment Help
Stochastic Processes Assignment Help
 
Ada notes
Ada notesAda notes
Ada notes
 
In three of the exercises below there is given a code of a method na.pdf
In three of the exercises below there is given a code of a method na.pdfIn three of the exercises below there is given a code of a method na.pdf
In three of the exercises below there is given a code of a method na.pdf
 
Lecture 1 maximum likelihood
Lecture 1 maximum likelihoodLecture 1 maximum likelihood
Lecture 1 maximum likelihood
 
Design and analysis of ra sort
Design and analysis of ra sortDesign and analysis of ra sort
Design and analysis of ra sort
 
Statistics Assignment Help
Statistics Assignment HelpStatistics Assignment Help
Statistics Assignment Help
 
MC0082 –Theory of Computer Science
MC0082 –Theory of Computer ScienceMC0082 –Theory of Computer Science
MC0082 –Theory of Computer Science
 
Quantum Noise and Error Correction
Quantum Noise and Error CorrectionQuantum Noise and Error Correction
Quantum Noise and Error Correction
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
 
Quantum algorithm for solving linear systems of equations
 Quantum algorithm for solving linear systems of equations Quantum algorithm for solving linear systems of equations
Quantum algorithm for solving linear systems of equations
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Chp-1 Quick Review of basic concepts.pdf
Chp-1 Quick Review of basic concepts.pdfChp-1 Quick Review of basic concepts.pdf
Chp-1 Quick Review of basic concepts.pdf
 
Networking Assignment Help
Networking Assignment HelpNetworking Assignment Help
Networking Assignment Help
 
Big o
Big oBig o
Big o
 
Skiena algorithm 2007 lecture08 quicksort
Skiena algorithm 2007 lecture08 quicksortSkiena algorithm 2007 lecture08 quicksort
Skiena algorithm 2007 lecture08 quicksort
 
Beginning direct3d gameprogrammingmath02_logarithm_20160324_jintaeks
Beginning direct3d gameprogrammingmath02_logarithm_20160324_jintaeksBeginning direct3d gameprogrammingmath02_logarithm_20160324_jintaeks
Beginning direct3d gameprogrammingmath02_logarithm_20160324_jintaeks
 
Convergence
ConvergenceConvergence
Convergence
 
不等概率随机选取
不等概率随机选取不等概率随机选取
不等概率随机选取
 
L1803016468
L1803016468L1803016468
L1803016468
 
Slide2
Slide2Slide2
Slide2
 

Plus de Atner Yegorov

Здравоохранение Чувашии 2015
Здравоохранение Чувашии 2015Здравоохранение Чувашии 2015
Здравоохранение Чувашии 2015Atner Yegorov
 
Чувашия. Итоги за 5 лет. 2015
Чувашия. Итоги за 5 лет. 2015Чувашия. Итоги за 5 лет. 2015
Чувашия. Итоги за 5 лет. 2015Atner Yegorov
 
Инвестиционная политика Чувашской Республики 2015
Инвестиционная политика Чувашской Республики 2015Инвестиционная политика Чувашской Республики 2015
Инвестиционная политика Чувашской Республики 2015Atner Yegorov
 
Господдержка малого и среднего бизнеса в Чувашии 2015
Господдержка малого и среднего бизнеса в Чувашии 2015Господдержка малого и среднего бизнеса в Чувашии 2015
Господдержка малого и среднего бизнеса в Чувашии 2015Atner Yegorov
 
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015Atner Yegorov
 
Брендинг Чувашии
Брендинг ЧувашииБрендинг Чувашии
Брендинг ЧувашииAtner Yegorov
 
VIII Экономический форум Чебоксары
VIII Экономический форум ЧебоксарыVIII Экономический форум Чебоксары
VIII Экономический форум ЧебоксарыAtner Yegorov
 
Рейтин инвестклимата в регионах 2015
Рейтин инвестклимата в регионах 2015Рейтин инвестклимата в регионах 2015
Рейтин инвестклимата в регионах 2015Atner Yegorov
 
Аттракционы тематического парка, Татарстан
Аттракционы тематического парка, ТатарстанАттракционы тематического парка, Татарстан
Аттракционы тематического парка, ТатарстанAtner Yegorov
 
Тематический парк, Татарстан
Тематический парк, ТатарстанТематический парк, Татарстан
Тематический парк, ТатарстанAtner Yegorov
 
Будущее журналистики
Будущее журналистикиБудущее журналистики
Будущее журналистикиAtner Yegorov
 
Будущее нишевых СМИ
Будущее нишевых СМИБудущее нишевых СМИ
Будущее нишевых СМИAtner Yegorov
 
Будущее онлайн СМИ в регионах
Будущее онлайн СМИ в регионах Будущее онлайн СМИ в регионах
Будущее онлайн СМИ в регионах Atner Yegorov
 
Война форматов. Как люди читают новости
Война форматов. Как люди читают новости Война форматов. Как люди читают новости
Война форматов. Как люди читают новости Atner Yegorov
 
Национальная технологическая инициатива
Национальная технологическая инициативаНациональная технологическая инициатива
Национальная технологическая инициативаAtner Yegorov
 
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ  ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ  ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ Atner Yegorov
 
Участники заседания по модернизации
Участники заседания по модернизацииУчастники заседания по модернизации
Участники заседания по модернизацииAtner Yegorov
 
Город Innopolis
Город InnopolisГород Innopolis
Город InnopolisAtner Yegorov
 
презентация1
презентация1презентация1
презентация1Atner Yegorov
 

Plus de Atner Yegorov (20)

Здравоохранение Чувашии 2015
Здравоохранение Чувашии 2015Здравоохранение Чувашии 2015
Здравоохранение Чувашии 2015
 
Чувашия. Итоги за 5 лет. 2015
Чувашия. Итоги за 5 лет. 2015Чувашия. Итоги за 5 лет. 2015
Чувашия. Итоги за 5 лет. 2015
 
Инвестиционная политика Чувашской Республики 2015
Инвестиционная политика Чувашской Республики 2015Инвестиционная политика Чувашской Республики 2015
Инвестиционная политика Чувашской Республики 2015
 
Господдержка малого и среднего бизнеса в Чувашии 2015
Господдержка малого и среднего бизнеса в Чувашии 2015Господдержка малого и среднего бизнеса в Чувашии 2015
Господдержка малого и среднего бизнеса в Чувашии 2015
 
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
 
Брендинг Чувашии
Брендинг ЧувашииБрендинг Чувашии
Брендинг Чувашии
 
VIII Экономический форум Чебоксары
VIII Экономический форум ЧебоксарыVIII Экономический форум Чебоксары
VIII Экономический форум Чебоксары
 
Рейтин инвестклимата в регионах 2015
Рейтин инвестклимата в регионах 2015Рейтин инвестклимата в регионах 2015
Рейтин инвестклимата в регионах 2015
 
Аттракционы тематического парка, Татарстан
Аттракционы тематического парка, ТатарстанАттракционы тематического парка, Татарстан
Аттракционы тематического парка, Татарстан
 
Тематический парк, Татарстан
Тематический парк, ТатарстанТематический парк, Татарстан
Тематический парк, Татарстан
 
Будущее журналистики
Будущее журналистикиБудущее журналистики
Будущее журналистики
 
Будущее нишевых СМИ
Будущее нишевых СМИБудущее нишевых СМИ
Будущее нишевых СМИ
 
Relap
RelapRelap
Relap
 
Будущее онлайн СМИ в регионах
Будущее онлайн СМИ в регионах Будущее онлайн СМИ в регионах
Будущее онлайн СМИ в регионах
 
Война форматов. Как люди читают новости
Война форматов. Как люди читают новости Война форматов. Как люди читают новости
Война форматов. Как люди читают новости
 
Национальная технологическая инициатива
Национальная технологическая инициативаНациональная технологическая инициатива
Национальная технологическая инициатива
 
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ  ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ  ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
 
Участники заседания по модернизации
Участники заседания по модернизацииУчастники заседания по модернизации
Участники заседания по модернизации
 
Город Innopolis
Город InnopolisГород Innopolis
Город Innopolis
 
презентация1
презентация1презентация1
презентация1
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Lec12

  • 1. 6.897: CSCI8980 Algorithmic Techniques for Big Data September 5, 2013 Lecture 1 Dr. Barna Saha Scribe: Vivek Mishra Overview We introduce data streaming model where streams of elements are coming in and main memory space is not sufficient to hold all the data. We then look at the problem of finding frequent items deterministically. This is a rare instance of data streaming algorithms that provides non- trivial approximation guarantee deterministically. For most algorithms as we will see the answer is approximate (close to optimal) and randomization is crucially used. To emphasize on the need of randomization in designing data streaming algorithms, we next show that computing distinct items (known as F0 computation) over a stream deterministically cannot achieve any approximation without essentially storing the entire stream. Our next goal is to analyze the algorithm for distinct items that have been covered in Lecture 1. Towards that goal, we study some basic concentration inequality such as, Markov inequality, Chebyshev bound and The Chernoff bound. 1 Introduction to Data Streams In a data streaming model sequence of elements a1 a2 a3 am arrive from a domain [1, n]. Each element ai is a tuple (j, ν) where j ∈ [1, n] is an element from the domain and ν ∈ I. For simplicity, we can consider ν = ±1 where +1 implies j is inserted in the stream and −1 implies j is deleted. The goal is to process these elements using space (ideally) polylog of m and n, but definitely in sub-linear in m and n. In the basic streaming setting, only a single pass over the data is allowed. The time to process each update must be low. When only insertions are allowed (all ν > 0), the model is known as cash-register model. On the other hand, if both insertions and deletions are allowed, it is called turnstile model. For any element, we generally do not allow the number of deletions (total negative frequency) to be more than total number of insertions of that elements. However, if we do allow such scenarios, we will refer it as general turnstile model. 2 Finding Frequent Items Deterministically Here we describe an algorithm by Mishra and Gries [1] to find frequent items in the cash-register model that is when only insertions are allowed. The precise problem is as follows. Given a sequence of m elements from [1, n] and any k ∈ N, find all the elements with frequency more than m k where frequency simply implies the number of times an element occurs using space O(k) (in words) and using only a single pass. Hence in bits, the total space usage is O(k log n). We would like to have the following guarantees. 1
  • 2. • No false negative. All items that have frequency > m k will be reported. There might be false positives, that is elements with frequency lower than m/k may be reported. However, if we allow two passes, in the second pass all the elements reported in the first pass can be checked for actual frequency and any false positive can be eliminated. Description of the Algorithm • Data structure: An associative array of k − 1 elements initialized to empty. An associative array contains in each of its cell a key (the item) and a value (its count) and may be maintained in a balanced binary search tree by its key. Whenever we see an element ai = (j, 1) we can search in the array if j exists or not in O(log k) time. • Procedure: Given arrival of ai = (j, 1), we search if j exists in the associative array. If the element is already there, we increment its count. If it is not there and there is an empty cell, we store it and make its count 1. If no space is left and the element is not there, we do not store the element and decrement the counters of all the stored elements by 1. If a counter reaches 0, that element is dropped from the associative array and the cell becomes empty. In the following code, we use A to represent the associative array, and give the update process when ai is seen in the stream. Algorithm 1 [Process ai = (j, 1)] Search for j ∈ A if j is already present as key at cell A[l] then Count[l] = Count[l] + 1 {i}ncrement its count else if j is not present in A and ∃l such that A[l] is empty then Insert j as key to A[l] Count[l] = 1 else Drop j {comment: j is not present and there is no free space} for i = 1 to k − 1 do Count[i] = Count[i] − 1 if Count[i] == 0 then Drop the key associated with A[i] and mark A[i] empty end if end for end if The entire algorithm for a given k ∈ N is as follows 2.1 Analysis We now prove the following theorem. Theorem 1. For any given k ∈ N Algorithm 2 returns all items with frequency more than m k . 2
  • 3. Algorithm 2 Mishra-Gries Algorithm Initialize A of size k − 1 to empty for i = 1 to m do Process(ai) end for return All the elements in A Proof. Clearly, the number of items having frequency more than m k is at most k −1 because stream size is m. Let fj denote the actual frequency of item j and let ˆfj be the frequency of the item j as observed in A at the end of the processing. If j is not in A, we assume ˆfj = 0. We therefore return all elements with ˆfj > 0. Note that we increment frequency for j only if we see the actual item. Hence ˆfj ≤ fj. We want to find out how low it can be. If we never drop the element then ˆfj = fj. Otherwise, either at some point when j occurs, array A is full and j is not already there. Or because it is dropped from array due to its count being decremented to 0. Note that whenever an item is dropped (due to no space or count getting decremented to 0), there are other distinct k − 1 elements whose count also gets decremented. Here we view the count of an element on arrival is increased to 1, but it is dropped to 0 if it cannot be stored. Therefore, there can be at most m k steps on which element counts are decremented, k distinct elements at one shot. The reasoning being the stream size is m and all frequencies are non-negative. For each of these events, the difference between the computed frequency and the actual frequency can increase by 1 and hence altogether we have ˆfj ≥ fj − m k for all j ∈ [1, n]. Therefore, for all j ∈ [1, n] if fj > m k then ˆfj > 0 and hence it is stored in the array at the end and will be reported. In most data streaming algorithms, one cannot achieve any non-trivial approximation determinis- tically. In the following section, we come back to counting distinct items and show no deterministic algorithm is possible that in o(m) space can give an exact count for distinct elements. 3 Lower Bound for Deterministic Computation of Distinct Ele- ments We prove the following theorem here. Theorem 2. There exists no deterministic algorithm that returns the exact count for distinct items in stream of size m in o(m) bits. Proof. We will prove this by the by contradiction. Suppose it is possible to have an exact estimate of distinct elements using space o(m) bits. Let R be such an algorithm. Since R uses o(m) bits, the number of different possible configurations that R can maintain is at-most 2o(m). We let n = m and consider all the different streams which have exactly m 2 distinct elements. How many such 3
  • 4. streams are possible ? Clearly, the number of such streams is at least m m/2 ≈ me m/2 m 2 = (2e) me 2 = 2Θ(m) where the second inequality comes from Stirling’s approximation. Since the number of such streams is more than the number of available configurations of R, there must exist two streams y, y , y = y such that both have the same configuration, that is, R(y) = R(y ) and the distinct items in y are not identical to distinct items of y . We now consider two streams Y1 = y + y and Y2 = y + y where + represents concatenation here, that is stream Y1 y is followed by y and in stream Y2, y is followed by y. Since R(y) = R(y ), we must have R(y + y) = R(y + y ). Therefore R will return same distinct elements counts for both Y1 and Y2 which is wrong because the number of distinct elements in Y1 is m 2 where for Y2 it is > m 2 . The above proof can be extended to show that there does not exist any deterministic algorithm with space o(m) that returns a count of distinct items within a multiplicative factor less than 2. Similarly, one can show that no exact randomized algorithm can exist either in o(m) space. Surprisingly, when we allow both approximation and randomization, the space usage can be dras- tically reduced (next lecture). 4 Basic Concentration Inequalities Here we study three basic concentration inequalities which bound deviation from expectation. 1. Markov inequality ( The 1st moment inequality) 2. Chebyshev inequality( The 2nd moment inequality) 3. The Chernoff Bound Theorem 3 (Markov Bound). For any positive random variable X, and for any t > 0 Pr (X ≥ t) ≤ E[x] t (1) Proof. E[x] = x x · Pr(X = x) = x<t x · Pr(X = x) + x≥t x · Pr(X = x) ≥ 0 + t · x≥t Pr(X = x) = t · P(X ≥ t) 4
  • 5. Theorem 4 (Chebyshev Inequality). For any random variable X and for any t > 0 Pr(|X − E[x]| ≥ t) ≤ V ar(x) t2 (2) Proof. Pr(|X − E[x]| ≥ t) = Pr([X − E[x]]2 ≥ t2 ) ≤ E (X − E[x])2 t2 = V ar(X) t2 Theorem 5 (The Chernoff Bound). Let X1, X2...Xn be n independent Bernoulli random variables with Pr(Xi = 1) = pi. Let X = Xi. Hence, E[X] = E Xi = E [Xi] = Pr(Xi = 1) = pi = µ(say). Then the Chernoff Bound says for any > 0 Pr(X > (1 + )µ) ≤ e (1 + ) µ and Pr(X < (1 − )µ) ≤ e− (1 − )1− µ When 0 < < 1 the above expression can be further simplified to Pr(X > (1 + )µ) ≤ e −µ 2 3 and Pr(X < (1 − )µ) ≤ e −µ 2 2 Hence Pr(|X − µ| > µ) ≤ 2e −µ 2 3 References [1] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143152, 1982. 5