SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
digital signal processing
in hadoop
with scalding
Adam Laiacano
adam@tumblr.com
@adamlaiacano
Thursday, October 17, 13
Overview
• Intro to digital signals and filters
• sampling
• frequency domain
• FIR / IIR filters
• Very quick intro to Scalding
• Filtering tons of signals at once
• Application: Finding trending blogs on tumblr
Thursday, October 17, 13
1 sample / day
Thursday, October 17, 13
7-day average
Thursday, October 17, 13

1 sample / day
Some Definitions
Signal - Any series of data (Volts, posts, etc) that is measured at regular intervals.
Sampling period, Ts - Time between samples (my example was Ts = 1 day)
Sampling frequency fs - 1 / Ts
Nyquist frequency - Highest frequency that can be represented = fs/2
Filter - A system to reduce or enhance certain aspects (phase, magnitude) of a
signal.
Stopband - The frequency range we want to eliminate
Passband - The frequency range we want to preserve
Cutoff frequency, fc - The boundary of the stopband/passband
Thursday, October 17, 13
Signals
Thursday, October 17, 13
Sampling

Orignal Analog
Thursday, October 17, 13

10 samples/period
Sampling

1 sample/period
Thursday, October 17, 13

2 samples/period
Filters
Thursday, October 17, 13
FILTER

Thursday, October 17, 13
Low-Pass Filter

Passband

Stopband

fc
Thursday, October 17, 13

fn
Low-Pass Filter

Passband

Stopband

Closer to reality
fc
Thursday, October 17, 13

fn
Moving Average Filter

y[t] = 1/7
1/7
1/7
...
1/7

Thursday, October 17, 13

* x[t] +
* x[t-1] +
* x[t-2] +
* x[t-6]
FIR Digital Filter

R code:

y[t] = h[0] *
h[1] *
h[2] *
...
h[N-1]

y <- filter(x, h)
Thursday, October 17, 13

x[t] +
x[t-1] +
x[t-2] +
* x[t-N-1]
Frequency Domain

x = 1.0 * sin(0.5*2*pi*t) +
0.5 * sin(250*2*pi*t) +
0.1 * sin(400*2*pi*t)
Thursday, October 17, 13
Frequency Domain

Thursday, October 17, 13
Frequency Domain

21-point low-pass filter with 250Hz cutoff
h = [-0.0201, -0.0584, -0.0612, -0.0109, 0.0513,
0.0332, -0.0566, -0.0857, 0.0634, 0.3109,
0.4344, 0.3109, 0.0634, -0.0857, -0.0566,
0.0332, 0.0513, -0.0109, -0.0612, -0.0584,
-0.0201]

http://t-filter.appspot.com/fir/index.html
Thursday, October 17, 13
FIR vs IIR
1

y[t] = h[0] * x[t] +
...
h[N-1] * x[t-N-1]

Thursday, October 17, 13

y[t] = h[0] * x[t] +
...
h[N-1] * x[t-N-1] g[1] * y[t-1] ...
g[M] * y[t-M]
Delta Function

Thursday, October 17, 13
FIR vs IIR

y[t] = 1/7 * x[t] +
1/7 * x[t-1] +
...
1/7 * x[t-6]

Thursday, October 17, 13

y[t] = 1/2 * x[t] +
1/2 * y[t-1]
IIR - be careful!

y[t] = 0.5 * x[t] +
1.1 * y[t-1]

Thursday, October 17, 13
Impulse Response

Thursday, October 17, 13
Recap - FIR Filters
• FIR filters are weighted sums of previous input.
• Can think of them as a generalized Moving
Average

• Required to apply:
• Filter h of length N
• Previous N inputs x
Thursday, October 17, 13
Thursday, October 17, 13
Super General Overview
• DSL on top of Cascading, written in scala
• Cascading: Workflow language for dealing
with lots of data. Often in hadoop.

• Similar to pig or hive, but easier to extend
(no UDFs! one language!).

• Feels like “real programming” - compiler!
types!

• Is awesome
Thursday, October 17, 13
Less General Overview
• Similar to split/apply/combine paradigm (plyr, pandas)
• Load data into Pipes (like data.frames)
• Each pipe has one or more Fields (columns)
• Perform row-wise operations with map (d$a+d$b)
• Perform field-wise operations in groupBy
Thursday, October 17, 13
Hello World

Thursday, October 17, 13
Scalding Resources
• The best resource is the Scalding wiki page.
https://github.com/twitter/scalding/wiki/Fields-based-API-Reference

• Edwin Chen’s post about recommendations.
http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

• Source code is FULL of undocumented features!
https://github.com/twitter/scalding

Thursday, October 17, 13
import Matrix._

Thursday, October 17, 13
Data
Vector

Sliding
Filter

= Sliding subset of input

Thursday, October 17, 13
Data
Vector

Sliding
Filter
0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Thursday, October 17, 13

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0
0

...

...
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

= Sliding filter
Data
Vector

T

*

Thursday, October 17, 13

Filtered
Output

Filter
Matrix
0 0 0 0 0
0 0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0

0
0
0
0
0

0
0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0
0

0
0
0
0
0

0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0

0
0 0
0 0 0
0 0 0 0 0

0
0
0
0
0
0
0
0
0
0
0
0
0

T

=
Data
Vector

T

*

Thursday, October 17, 13

Filtered
Output

Filter
Matrix
0 0 0 0 0
0 0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0

0
0
0
0
0

0
0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0
0

0
0
0
0
0

0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0

0
0 0
0 0 0
0 0 0 0 0

0
0
0
0
0
0
0
0
0
0
0
0
0

T

=
T

X * H = Y
N

M

T

MxM

*

N = number of blogs
M = number of samples
Thursday, October 17, 13

T
N

=

T
M
Matrix Filter: Square Waves

Thursday, October 17, 13
Matrix Filter: Square Waves

Thursday, October 17, 13
Matrix Filter: Square Waves

Thursday, October 17, 13
import Matrix._
• Scalding has a Matrix library!
• Stores data in a Pipe as ('row,
• Ideal for sparse matricies

'col, 'val)

• L0, L1, L2 norm, inverse, +, -, *
• QR Factorization: http://bit.ly/1hxWF17
• More!
Thursday, October 17, 13
Tumblr Social Graph
•
•
•
•

140+ Million Nodes
3.5 Billion Edges
About 100GB of raw text data
3 columns: fromId, toId, timestamp

GOAL: Calculate followers / day for every blog,
apply 1-week moving average.
Thursday, October 17, 13
Apply Low-Pass filter
to 140,000,000 blogs

Thursday, October 17, 13
Find blogs who have accelerating follower counts
for the most consecutive days.

“Accelerating”:
New Followers Today > New Followers Yesterday

Thursday, October 17, 13
Blog A
Blog B
Blog C

Thursday, October 17, 13
Blog A
Blog B
Blog C

Thursday, October 17, 13
Blog A
Blog B
Blog C

Thursday, October 17, 13
Days of consecutive acceleration
• Binary input: 1 if more followers today than
yesterday, otherwise 0

• Filter the binary signal to produce a value
between 0 and 1

• Anything above a threshold (0.75) is
“accelerating”

Thursday, October 17, 13
Days of consecutive acceleration
(filtered signal)

15 days
36 days
48 days

Thursday, October 17, 13
Days of consecutive acceleration
(filtered signal)

15 days
36 days
48 days

Thursday, October 17, 13
Thanks!
@adamlaiacano
adamlaiacano.tumblr.com
github.com/alaiacano/dsp-scalding

Thursday, October 17, 13

Contenu connexe

Tendances

Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
Yandex
 

Tendances (11)

Faster Python, FOSDEM
Faster Python, FOSDEMFaster Python, FOSDEM
Faster Python, FOSDEM
 
Memory efficient pytorch
Memory efficient pytorchMemory efficient pytorch
Memory efficient pytorch
 
Maximum Flow
Maximum FlowMaximum Flow
Maximum Flow
 
Track Finding in LHCb's 2020 Trigger
Track Finding in LHCb's 2020 TriggerTrack Finding in LHCb's 2020 Trigger
Track Finding in LHCb's 2020 Trigger
 
Tensorflow internal
Tensorflow internalTensorflow internal
Tensorflow internal
 
FMRIPREP - robust and easy to use fMRI preprocessing pipeline
FMRIPREP - robust and easy to use fMRI preprocessing pipelineFMRIPREP - robust and easy to use fMRI preprocessing pipeline
FMRIPREP - robust and easy to use fMRI preprocessing pipeline
 
D vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoyaD vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoya
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
Realtime 3D Visualization without GPU
Realtime 3D Visualization without GPURealtime 3D Visualization without GPU
Realtime 3D Visualization without GPU
 
機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編
 

En vedette (7)

Python Map Reduce vs Scalding
Python Map Reduce vs ScaldingPython Map Reduce vs Scalding
Python Map Reduce vs Scalding
 
Data Science at Tumblr
Data Science at TumblrData Science at Tumblr
Data Science at Tumblr
 
Insulinizacion temprana
Insulinizacion tempranaInsulinizacion temprana
Insulinizacion temprana
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
El Sentido De La Vista
El Sentido De La VistaEl Sentido De La Vista
El Sentido De La Vista
 
Bayes rpp bristol
Bayes rpp bristolBayes rpp bristol
Bayes rpp bristol
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 

Similaire à Digital Signal Processing in Hadoop

Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
HamzaJaved306957
 
Math cad fourier analysis (jcb-edited)
Math cad   fourier analysis (jcb-edited)Math cad   fourier analysis (jcb-edited)
Math cad fourier analysis (jcb-edited)
Julio Banks
 
Vig tutorial 8.5.2.0
Vig tutorial 8.5.2.0Vig tutorial 8.5.2.0
Vig tutorial 8.5.2.0
Nagesh Maley
 
presentation
presentationpresentation
presentation
jie ren
 
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
BasavaRajeshwari2
 
Seminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentationSeminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentation
douglaslyon
 

Similaire à Digital Signal Processing in Hadoop (20)

Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
 
Privacy for Continual Data Publishing
Privacy for Continual Data PublishingPrivacy for Continual Data Publishing
Privacy for Continual Data Publishing
 
Anton Dignös - Towards a Temporal PostgresSQL
Anton Dignös - Towards a Temporal PostgresSQLAnton Dignös - Towards a Temporal PostgresSQL
Anton Dignös - Towards a Temporal PostgresSQL
 
04 AD and DA ZoH.pptx
04 AD and DA ZoH.pptx04 AD and DA ZoH.pptx
04 AD and DA ZoH.pptx
 
Dds
DdsDds
Dds
 
Clojure night
Clojure nightClojure night
Clojure night
 
Fourier Transform in Signal and System of Telecom
Fourier Transform in Signal and System of TelecomFourier Transform in Signal and System of Telecom
Fourier Transform in Signal and System of Telecom
 
Math cad fourier analysis (jcb-edited)
Math cad   fourier analysis (jcb-edited)Math cad   fourier analysis (jcb-edited)
Math cad fourier analysis (jcb-edited)
 
Vig tutorial 8.5.2.0
Vig tutorial 8.5.2.0Vig tutorial 8.5.2.0
Vig tutorial 8.5.2.0
 
presentation
presentationpresentation
presentation
 
Tech Overview
Tech OverviewTech Overview
Tech Overview
 
02_signals.pdf
02_signals.pdf02_signals.pdf
02_signals.pdf
 
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
 
Seminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentationSeminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentation
 
L08.pdf
L08.pdfL08.pdf
L08.pdf
 
Adversarial Classification Under Differential Privacy
Adversarial Classification Under Differential PrivacyAdversarial Classification Under Differential Privacy
Adversarial Classification Under Differential Privacy
 
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
 
Activity Recognition Through Complex Event Processing: First Findings
Activity Recognition Through Complex Event Processing: First Findings Activity Recognition Through Complex Event Processing: First Findings
Activity Recognition Through Complex Event Processing: First Findings
 
Overview of sampling
Overview of samplingOverview of sampling
Overview of sampling
 
Fourier project presentation
Fourier project  presentationFourier project  presentation
Fourier project presentation
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Digital Signal Processing in Hadoop