SlideShare une entreprise Scribd logo
1  sur  67
100% Big Data
0% Hadoop
0% Java
Pavlo Baron, codecentric AG
Saturday, April 27, 13
pavlo.baron@codecentric.de
@pavlobaron
github.com/pavlobaron
Saturday, April 27, 13
So here is the short story...
Saturday, April 27, 13
sitting there, listening...
Saturday, April 27, 13
presented as Houdini magic...
Saturday, April 27, 13
so you telling me...
it’s smoke and mirrors?
Saturday, April 27, 13
Smells like a bunch of queues,
pipes and filters...
Saturday, April 27, 13
Looks like some NLP...
Saturday, April 27, 13
Sounds like some math...
Saturday, April 27, 13
Seems like some basic ML...
Saturday, April 27, 13
methinks: I can tinker that.
I have 2 nights in the hotel...
Saturday, April 27, 13
Fire!
Saturday, April 27, 13
Know the use cases...
Saturday, April 27, 13
Consume a feed where people
say what they think before
they think what they say...
Saturday, April 27, 13
Drink Big Data warm, straight
from the fire hose...
Saturday, April 27, 13
Then fork for immediate
notification and batch
analytics...
Saturday, April 27, 13
Some bubbles
queuefeeds filter
sentiment analysis
formalizestore
aggregatereportreact
alert
queue
queue
react
map/reduce
fork
queue
Saturday, April 27, 13
Some tech
Languages: Python, Erlang
Feeds: Tweepy, crawlers, feed readers
Queueing: RabbitMQ through Pika
Store: Riak through protobufs
Map/reduce: modified Disco to run workers on Riak-
nodes data-locally
Active event push to the browser through RabbitMQ’s
Web-STOMP
jVectorMap, jQuery, stomp and sock.js in the browser
wpcorpus built with RabbitMQ, indexed with PyTables
Saturday, April 27, 13
Some math
Analytics: NLP with NLTK
Algo training: nltk-trainer with pickle=true
Algos: naive Bayes, decision tree, binary
classification based on trigram frequencies
simple name and antiword filtering based on
public and own corpora
sentiment analysis based on public and own
corpora
troll filtering using Wikipedia as active
corpus
Saturday, April 27, 13
Some numbers...
...‘cause numbers are sexy
Saturday, April 27, 13
When numbers
become too
sexy for your
[hat|car|cat],
they mutate
into
numbers pr0n
Saturday, April 27, 13
Some numbers, revisited
I’m not into numbers pr0n
numbers need to be just good enough for
what you’re trying to solve
Saturday, April 27, 13
But it’s still the
easiest way to
impress,
especially
without
solving a
concrete
problem
Saturday, April 27, 13
So, finally, some numbers
(on my MBA)
Feed: ~10000 chaotic text msg/min
Store: ~8000 formalized msg/min, N=3,
quorum, 3 nodes
Analytics: ~7000 msg/min (filtered, pos/neg
aggregation, location based aggregation)
Demo: ~1500000 tweets, pos/neg
aggregation, stream processing in ~7min,
map/red in ~15sec
Saturday, April 27, 13
Some lessons learned
Saturday, April 27, 13
The Beliebers...
Saturday, April 27, 13
More than 60% of the
Twitter sample stream is
useless garbage...
Saturday, April 27, 13
Further 20% are trolls...
Saturday, April 27, 13
So I ended up implementing
wpcorpus - active NLP corpus
based on Wikipedia, using its
categories and their
combination as classes and
anti-classes
Saturday, April 27, 13
Real names...
Saturday, April 27, 13
Absurd profile bios...
Saturday, April 27, 13
Location...
Saturday, April 27, 13
Language...
For trigrams in NLTK, use
Spanish as “anti-class” to tell
English/German from the rest
Saturday, April 27, 13
Disco workers on Riak nodes...
PITA and a lot of tinkering, but necessary
for data locality
Extending Disco is relatively easy, but
changing it is hard...
Flooding, asynchronous, separate key/value
listing in low-level Riak goes very well with
Erlang port based Python/Erlang message
exchange in Disco. Not
Low-level vnode-data-consume needed
probabilistic correction due to N=3
Extended Disco to use RabbitMQ between
the worlds (h/t Dan North for the idea)
Saturday, April 27, 13
Mixing Python and Erlang in
one project...
Forgetting punctuation in Erlang code all the
time when quickly switching from Python
Terribly missing pattern matching in Python
Considering to embed Python in Erlang, but
it might become a double PITA then
Saturday, April 27, 13
Sentiment analysis...
Saturday, April 27, 13
Well, actually, strong
sentiment analysis...
Saturday, April 27, 13
Very unreliable given the
human nature...
Saturday, April 27, 13
In addition to the NLTK’s
movie reviews corpus, use
these for “neg” classification
Saturday, April 27, 13
FAQ
Saturday, April 27, 13
Q: Why the heck are you
doing this?
Saturday, April 27, 13
A
Because I can
Because I want
Because I want to learn
Because I want to go deep on low-level
Because I value speed over abstraction
Because it’s very interesting to combine
computer science with math
Saturday, April 27, 13
Q: Why not just use Hadoop?
Saturday, April 27, 13
A
Because I didn’t want to run this on the JVM
Because I have 2 use cases, and only one of
them is suitable for batch map/reduce
Saturday, April 27, 13
Q: Why didn’t you want to
run this on the JVM?
Saturday, April 27, 13
A: well, technically seen,
Big Data area is growing
on the JVM
Hadoop
Pig
Storm, Kafka, Esper
Mahout
OpenNLP
Saturday, April 27, 13
A: but I didn’t want this
Big Data on my drive
~/.m2
Saturday, April 27, 13
A: and I am evaluating some
alternatives to the ecosystem
Saturday, April 27, 13
Q: Why are you queueing at
all? Others do gazillions of
msg/sec without queues
Saturday, April 27, 13
A
I could, if instead of filters and batch analytics of
chaotic text, it would be just about building trivial
sums from fixed-sized ticks
with growable numbers like this, you want to
protect any sort of reliable data store from getting
flooded by writes, RDBMS or NoSQL store
Because I need to do some pipes and filters
Because I’m mixing and crossing borders of data
sources and technologies
Because (almost) all frameworks that you might
consider also do some queueing or buffering
Saturday, April 27, 13
Q: Why did you use Erlang
and Python?
Saturday, April 27, 13
A
Because reliability and distribution are built
into the Erlang VM and I don’t need separate
coordinators or to reinvent the wheel
Because both, Python and Erlang, are
“functional” enough for what I need day-by-
day
Because Python has been for many years the
platform of choice for scientists, thus there
are available clever and mature math
libraries
Because Disco is on Python and Erlang, Riak
and RabbitMQ are on Erlang
Saturday, April 27, 13
Q: isn’t Python slow like hell?
Saturday, April 27, 13
A
it’s not operating at the speed of light
yes, it is slower at some points
I’ve also been testing PyPy to improve
performance for the case I should need it,
‘cause right now it works just fast enough
without explicit bottle-necks in the given
architecture, even on one single MBA
Saturday, April 27, 13
Q: MBA is boring. Can you
make it real web scale?
Saturday, April 27, 13
A
well, to be precise, I’m operating on web
data
I can scale queues with RabbitMQ
I can scale storage with Riak
I can scale the map/reduce supported
analytics with Disco/Riak
I can scale data sources/feeds, machines,
hardware, networks, infrastructure, logins
etc. You name it
Saturday, April 27, 13
Q: what’s in the future?
Saturday, April 27, 13
A
I don’t have my crystal ball with me
I’m toying with the idea to write Pig Latin
engine in Python called “Sau” (German for
pig), to offer data scientists a comfortable
interface and to allow them to run existing
Pig scripts on this stack
I could add more data sources, improve
throughput where necessary and work on
some low level Disco modifications to change
the way it utilizes Erlang in my case
I will integrate my Disco extensions with
Disco upstream one day
Saturday, April 27, 13
Q: what do we learn about
Big Data here?
Saturday, April 27, 13
A
Big Data is about the “what”, followed by
the “how” and enabled by the “what with”
Saturday, April 27, 13
A
It’s about gathering data, filtering out most
of it as garbage, analyzing it, gaining useful
information out of it immediately, finding new
ways to gather and use information and
deriving steps for business improvements,
strategy planning, doing soft intelligence aka
enterprise level stalking or, even more
important, helping make the world a better
place - it’s up to you
Saturday, April 27, 13
A
It’s not about building SkyNet - even if this
will be built one day, it will be pretty
boring. It’s about building recommender and
decision support systems, thus letting
machines do stupid, repeated jobs fast and
human beings make high quality decisions
Saturday, April 27, 13
A
It’s not about plain numbers. It’s about
numbers that are good enough to carry the
solution. Not less, but also not more than
that
Consider the dilemma: if you want to be as
precise and as fast as possible in your “Big
Data”, you don’t crunch it blindly on tons of
machines. Instead, you go at the low-level
and optimize the stack, filter upfront
If it’s not leading to value in near realtime,
it’s useless, though probably Big
Saturday, April 27, 13
A
It’s a huge field for geeks with aspiration to
learn new things, dig into math and computer
science, play with different platforms and
tools and pick the right tool chain
Saturday, April 27, 13
Oh, and did the demo run?
Saturday, April 27, 13
Thank you!
Saturday, April 27, 13
Most images originate from
istockphoto.com
except few ones taken
from Wikipedia or Flickr (CC)
and product pages
or generated through public
online generators
Saturday, April 27, 13

Contenu connexe

En vedette

@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve itPavlo Baron
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)Pavlo Baron
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Pavlo Baron
 
Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Pavlo Baron
 
Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Pavlo Baron
 
Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Pavlo Baron
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)Pavlo Baron
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Pavlo Baron
 
Expo antiagregantes plaquetarios
Expo antiagregantes plaquetariosExpo antiagregantes plaquetarios
Expo antiagregantes plaquetariosale_magnifike
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Pavlo Baron
 

En vedette (11)

@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it
 
Madre feliz dia¡
Madre feliz dia¡Madre feliz dia¡
Madre feliz dia¡
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)
 
Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)
 
Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)
 
Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)
 
Expo antiagregantes plaquetarios
Expo antiagregantes plaquetariosExpo antiagregantes plaquetarios
Expo antiagregantes plaquetarios
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
 

Plus de Pavlo Baron

data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)Pavlo Baron
 
Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)Pavlo Baron
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Pavlo Baron
 
a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)Pavlo Baron
 
From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)Pavlo Baron
 
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)Pavlo Baron
 
20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)Pavlo Baron
 
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Pavlo Baron
 
Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Pavlo Baron
 
JUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPJUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPPavlo Baron
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Pavlo Baron
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)Pavlo Baron
 

Plus de Pavlo Baron (12)

data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)
 
Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)
 
a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)
 
From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)
 
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)
 
20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)
 
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
 
Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)
 
JUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPJUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTP
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)
 

Dernier

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Dernier (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

100% Big Data, 0% Hadoop, 0% Java (@pavlobaron)

  • 1. 100% Big Data 0% Hadoop 0% Java Pavlo Baron, codecentric AG Saturday, April 27, 13
  • 3. So here is the short story... Saturday, April 27, 13
  • 5. presented as Houdini magic... Saturday, April 27, 13
  • 6. so you telling me... it’s smoke and mirrors? Saturday, April 27, 13
  • 7. Smells like a bunch of queues, pipes and filters... Saturday, April 27, 13
  • 8. Looks like some NLP... Saturday, April 27, 13
  • 9. Sounds like some math... Saturday, April 27, 13
  • 10. Seems like some basic ML... Saturday, April 27, 13
  • 11. methinks: I can tinker that. I have 2 nights in the hotel... Saturday, April 27, 13
  • 13. Know the use cases... Saturday, April 27, 13
  • 14. Consume a feed where people say what they think before they think what they say... Saturday, April 27, 13
  • 15. Drink Big Data warm, straight from the fire hose... Saturday, April 27, 13
  • 16. Then fork for immediate notification and batch analytics... Saturday, April 27, 13
  • 17. Some bubbles queuefeeds filter sentiment analysis formalizestore aggregatereportreact alert queue queue react map/reduce fork queue Saturday, April 27, 13
  • 18. Some tech Languages: Python, Erlang Feeds: Tweepy, crawlers, feed readers Queueing: RabbitMQ through Pika Store: Riak through protobufs Map/reduce: modified Disco to run workers on Riak- nodes data-locally Active event push to the browser through RabbitMQ’s Web-STOMP jVectorMap, jQuery, stomp and sock.js in the browser wpcorpus built with RabbitMQ, indexed with PyTables Saturday, April 27, 13
  • 19. Some math Analytics: NLP with NLTK Algo training: nltk-trainer with pickle=true Algos: naive Bayes, decision tree, binary classification based on trigram frequencies simple name and antiword filtering based on public and own corpora sentiment analysis based on public and own corpora troll filtering using Wikipedia as active corpus Saturday, April 27, 13
  • 20. Some numbers... ...‘cause numbers are sexy Saturday, April 27, 13
  • 21. When numbers become too sexy for your [hat|car|cat], they mutate into numbers pr0n Saturday, April 27, 13
  • 22. Some numbers, revisited I’m not into numbers pr0n numbers need to be just good enough for what you’re trying to solve Saturday, April 27, 13
  • 23. But it’s still the easiest way to impress, especially without solving a concrete problem Saturday, April 27, 13
  • 24. So, finally, some numbers (on my MBA) Feed: ~10000 chaotic text msg/min Store: ~8000 formalized msg/min, N=3, quorum, 3 nodes Analytics: ~7000 msg/min (filtered, pos/neg aggregation, location based aggregation) Demo: ~1500000 tweets, pos/neg aggregation, stream processing in ~7min, map/red in ~15sec Saturday, April 27, 13
  • 27. More than 60% of the Twitter sample stream is useless garbage... Saturday, April 27, 13
  • 28. Further 20% are trolls... Saturday, April 27, 13
  • 29. So I ended up implementing wpcorpus - active NLP corpus based on Wikipedia, using its categories and their combination as classes and anti-classes Saturday, April 27, 13
  • 33. Language... For trigrams in NLTK, use Spanish as “anti-class” to tell English/German from the rest Saturday, April 27, 13
  • 34. Disco workers on Riak nodes... PITA and a lot of tinkering, but necessary for data locality Extending Disco is relatively easy, but changing it is hard... Flooding, asynchronous, separate key/value listing in low-level Riak goes very well with Erlang port based Python/Erlang message exchange in Disco. Not Low-level vnode-data-consume needed probabilistic correction due to N=3 Extended Disco to use RabbitMQ between the worlds (h/t Dan North for the idea) Saturday, April 27, 13
  • 35. Mixing Python and Erlang in one project... Forgetting punctuation in Erlang code all the time when quickly switching from Python Terribly missing pattern matching in Python Considering to embed Python in Erlang, but it might become a double PITA then Saturday, April 27, 13
  • 37. Well, actually, strong sentiment analysis... Saturday, April 27, 13
  • 38. Very unreliable given the human nature... Saturday, April 27, 13
  • 39. In addition to the NLTK’s movie reviews corpus, use these for “neg” classification Saturday, April 27, 13
  • 41. Q: Why the heck are you doing this? Saturday, April 27, 13
  • 42. A Because I can Because I want Because I want to learn Because I want to go deep on low-level Because I value speed over abstraction Because it’s very interesting to combine computer science with math Saturday, April 27, 13
  • 43. Q: Why not just use Hadoop? Saturday, April 27, 13
  • 44. A Because I didn’t want to run this on the JVM Because I have 2 use cases, and only one of them is suitable for batch map/reduce Saturday, April 27, 13
  • 45. Q: Why didn’t you want to run this on the JVM? Saturday, April 27, 13
  • 46. A: well, technically seen, Big Data area is growing on the JVM Hadoop Pig Storm, Kafka, Esper Mahout OpenNLP Saturday, April 27, 13
  • 47. A: but I didn’t want this Big Data on my drive ~/.m2 Saturday, April 27, 13
  • 48. A: and I am evaluating some alternatives to the ecosystem Saturday, April 27, 13
  • 49. Q: Why are you queueing at all? Others do gazillions of msg/sec without queues Saturday, April 27, 13
  • 50. A I could, if instead of filters and batch analytics of chaotic text, it would be just about building trivial sums from fixed-sized ticks with growable numbers like this, you want to protect any sort of reliable data store from getting flooded by writes, RDBMS or NoSQL store Because I need to do some pipes and filters Because I’m mixing and crossing borders of data sources and technologies Because (almost) all frameworks that you might consider also do some queueing or buffering Saturday, April 27, 13
  • 51. Q: Why did you use Erlang and Python? Saturday, April 27, 13
  • 52. A Because reliability and distribution are built into the Erlang VM and I don’t need separate coordinators or to reinvent the wheel Because both, Python and Erlang, are “functional” enough for what I need day-by- day Because Python has been for many years the platform of choice for scientists, thus there are available clever and mature math libraries Because Disco is on Python and Erlang, Riak and RabbitMQ are on Erlang Saturday, April 27, 13
  • 53. Q: isn’t Python slow like hell? Saturday, April 27, 13
  • 54. A it’s not operating at the speed of light yes, it is slower at some points I’ve also been testing PyPy to improve performance for the case I should need it, ‘cause right now it works just fast enough without explicit bottle-necks in the given architecture, even on one single MBA Saturday, April 27, 13
  • 55. Q: MBA is boring. Can you make it real web scale? Saturday, April 27, 13
  • 56. A well, to be precise, I’m operating on web data I can scale queues with RabbitMQ I can scale storage with Riak I can scale the map/reduce supported analytics with Disco/Riak I can scale data sources/feeds, machines, hardware, networks, infrastructure, logins etc. You name it Saturday, April 27, 13
  • 57. Q: what’s in the future? Saturday, April 27, 13
  • 58. A I don’t have my crystal ball with me I’m toying with the idea to write Pig Latin engine in Python called “Sau” (German for pig), to offer data scientists a comfortable interface and to allow them to run existing Pig scripts on this stack I could add more data sources, improve throughput where necessary and work on some low level Disco modifications to change the way it utilizes Erlang in my case I will integrate my Disco extensions with Disco upstream one day Saturday, April 27, 13
  • 59. Q: what do we learn about Big Data here? Saturday, April 27, 13
  • 60. A Big Data is about the “what”, followed by the “how” and enabled by the “what with” Saturday, April 27, 13
  • 61. A It’s about gathering data, filtering out most of it as garbage, analyzing it, gaining useful information out of it immediately, finding new ways to gather and use information and deriving steps for business improvements, strategy planning, doing soft intelligence aka enterprise level stalking or, even more important, helping make the world a better place - it’s up to you Saturday, April 27, 13
  • 62. A It’s not about building SkyNet - even if this will be built one day, it will be pretty boring. It’s about building recommender and decision support systems, thus letting machines do stupid, repeated jobs fast and human beings make high quality decisions Saturday, April 27, 13
  • 63. A It’s not about plain numbers. It’s about numbers that are good enough to carry the solution. Not less, but also not more than that Consider the dilemma: if you want to be as precise and as fast as possible in your “Big Data”, you don’t crunch it blindly on tons of machines. Instead, you go at the low-level and optimize the stack, filter upfront If it’s not leading to value in near realtime, it’s useless, though probably Big Saturday, April 27, 13
  • 64. A It’s a huge field for geeks with aspiration to learn new things, dig into math and computer science, play with different platforms and tools and pick the right tool chain Saturday, April 27, 13
  • 65. Oh, and did the demo run? Saturday, April 27, 13
  • 67. Most images originate from istockphoto.com except few ones taken from Wikipedia or Flickr (CC) and product pages or generated through public online generators Saturday, April 27, 13