SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Yokai Versus the ElephantHadoop and the Fight Against Shape-Shifting Spam,[object Object],VishwanathRamarao & Mark Risher,[object Object],Yahoo! Mail,[object Object]
© SHMorgan - www.obakemono.com,[object Object]
AGENDA,[object Object],3,[object Object],Shape-shifting spam,[object Object],Antispam Origins,[object Object],Hadoop Algorithms,[object Object],Applications to Security,[object Object],Resources for Implementers,[object Object]
Yahoo! Mail antispam - Bay area Hadoop user group
5,[object Object]
6,[object Object],http:/<!--gmail.com-->/f915fde2cf53df18<!--uc22wddprm-->.li<!--cf997b28e-->gh<!--PdNKLr-->,[object Object],tt<!---kxnd2itipuvd.yahoo.com-->o<!--ju1j8V-->,[object Object],p<!--vrgxetdcnubslgacvc-->b<!--OsLaWIv-->o<!--_qsgsnnjuf1m@vkvriskrgavzxjovbqg.net-->dy<!--in7oouvxfrg7ax-->.com]*!}v}]along especially consecutive important dmvfu,[object Object],<!--gmail.com-->,[object Object]
7,[object Object]
8,[object Object],1,300,925,111,156,286,160,896,[object Object],(http://bit.ly/cpOyLi),[object Object]
Yahoo! Mail antispam - Bay area Hadoop user group
10,[object Object]
Typical attack/response profile,[object Object],11,[object Object],Rule change,[object Object],(1/23@01:15),[object Object]
MORE YOKAI - TARGETED ATTACKS,[object Object],<style>mechanic CC0066 getimage 3A00 lectroniques repertoires spiel proscribing ammonoid 10110 radiobuttontelefoons Jermaine iesaporitoroshan 3026 janatatrennungpalillos toughest ncapitolecalzado 20200 Omnimedia collective saudadedizaines 205px hardener elongating InvasionofyourprivacyPersonnalftsbedingungenMontanerprozacSerpellfcardbvh capacitate 12502 courtship kiranjiutroligt transducer tyee Delhaize clueless toffee nnioZoapochino sterns 622 Verordnung carbons waterresistant assessing footerTextperrine url0 potatoes 999933 Rightmove positively thmb closer secures Amarillo suffer 314992 32599 8849 GJ initialling cockleshell JTA Justiaguardo jibes Chubb inflammatory iteration granfaldasseoir considerations 692px treasured Allotransplantationtwoyearsappx Bowers doorgeven 1487 bigpicture repeatedly Popp MPEG4 webbsidaliefdeVoeding Elena Kernighan sternway laggardly Zwischendurch commons equis sewing f17 apadrinasareiniqueslugoquotedblbayr 3500 CI addressee optativelygazzetta 616px mingus 23238 PhotoLink desuetude tofu keychains molding redevelopment stucco deltage astrology2 thumbscrews probablemente 700g rnsfuseactionrepristaires restraint manchettestrendlineseffectuedespatchMinskyestadual doses danbrown Muenster jind7n7 smashes gourmandesashantisentants rows kyk coated Incontournablescoincidenjspa stalker CDS contienen expletives s8 eof replenishing puyalluppratosondravalidarorientale sonnets steamer Niwangoacrocentric dozens elr tempting poing jails ingredi Sep3 misdirection vested tecniciconciertos dear martini 3D35 MBR DNAME 2650 violation Egyptiin NCR sposoriss hl 12450 connectors circumcision transform CFA employeur 153 comunicazioni miner 19905 citronella PlissierHellmich Randall CaradonnaspringaregistradahauptEntran 3060 Rochin capacitor sotol 3413 smirk interditeServicePoint capabilities bouncefeeLinkov 3Dg auntie OSP CaeciliaPlatzierung wrangler pisosbanlieueDaniellaenderleisraelprofessionnellessusto 39800 Espanaplena radian antic!...........................200KB……….,[object Object], </style>,[object Object],<center><a href="http://ivywhere.info/52210088504303.hrmj.1/285/1000/1006/1000/1237976a102c0176c7b3fb3164f83590.html">Please Click Here if You Can't See Images<br><imgsrc="http://ivywhere.info/images/usacpm1.jpg" border="0"></a><br><a href="http://ivywhere.info/52210088504303.hrmj.1/40106/1000/1000/1000/a.html"><imgsrc="http://ivywhere.info/images/usacpm2.jpg" border="0"></a><br><a href="http://ivywhere.info/gp.html"><imgsrc="http://ivywhere.info/images/please2.jpg" border="0"></a><br>,[object Object],12,[object Object],[400kb…],[object Object],<center><a href="http://corfair.info/52210088504303.hrmj.1/129286/1000/1006/1000/d1c7b1fa06980b08bf9b3a9c14844623.html">Please Click Here if You Can't See Images<br><imgsrc="http://corfair.info/images/ivblg1.jpg" border="0"></a><br><a href="http://corfair.info/52210088504303.hrmj.1/40126/1000/1000/1000/a.html"><imgsrc="http://corfair.info/images/ivblg2.jpg" border="0"></a><br><a href="http://corfair.info/gp.html"><imgsrc="http://corfair.info/images/please2.jpg" border="0"></a><br> ,[object Object]
Yahoo! Mail antispam - Bay area Hadoop user group
14,[object Object]
Why is the ANTISPAM PROBLEM hard,[object Object],Scale of the problem; 25B Connections, 5B deliveries, 450M mailboxes,[object Object],User feedback is often late, noisy and not always actionable ,[object Object],Large, diverse stream of legitimate traffic that looks like spam,[object Object],Slow adoption of authentication technologies like DKIM and SPF,[object Object],Spammers are clever; target and specialize attacks ,[object Object],Rapidly changing spam campaigns with a large bot controlled IP base; large variations even within a single campaign,[object Object],A significant percentage of spam comes from large ESPs like Hotmail, Google and Yahoo,[object Object],15,[object Object]
Generation 1: Manual management layer,[object Object],Heuristics, blocks, blacklists,[object Object],Provide attack mitigation and operational flexibility, highly explainable. ,[object Object],Not durable, expensive to keep pace with fast morphing spam,[object Object],Ad hoc queries,[object Object],Proprietary implementations, not very scalable, steep learning curve,[object Object],Reactive and usually late,[object Object],16,[object Object]
Generation 2: Machine Management Layer,[object Object],Online reputation models,[object Object],Simple, mostly scoring/counter/ratio based models,[object Object],Highly scalable due the absence of any state/memory,[object Object],Generalize too broadly, lack expressive power,[object Object],Batch trained reputation models,[object Object],Typically digested memory based hashing or machine learning models,[object Object],Difficult to implement and due to the need for labeled examples scale well only moderately,[object Object],Slow to update and learn, lack explainability, limited operational control,[object Object],17,[object Object]
Yahoo! Mail antispam - Bay area Hadoop user group
distributed computing paradigm,[object Object],19,[object Object],Map:Reduce + distributed storage:,[object Object],[object Object]
Expressiveness of offline analysis
Ease of management,[object Object]
the map:reduce paradigm ,[object Object],21,[object Object],Mapper,[object Object],<k1,v1>,[object Object],Mapper,[object Object],<k1,{v1,v3}>,[object Object],<k2,v2>,[object Object],Reducer,[object Object],<k2,v2>,[object Object],<k1,W1>,[object Object],Mapper,[object Object],<k1,v3>,[object Object]
A SIMPLE MAP:REDUCE EXAMPLE,[object Object],$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 ,[object Object],Hello World Bye World ,[object Object],$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 ,[object Object],Hello Hadoop Goodbye Hadoop,[object Object],// Split up input files (MAP), iterate over chunks, reassemble results (REDUCE) ,[object Object],$ bin/hadoop jar /usr/joe/wordcount.jarorg.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output,[object Object],$ bin/hadoopdfs -cat /usr/joe/wordcount/output/part-00000 ,[object Object],Bye 1 ,[object Object],Goodbye 1 ,[object Object],Hadoop 2 ,[object Object],Hello 2 ,[object Object],World 2 ,[object Object],22,[object Object]
a simple map:reduce example (bit.ly/bdyi0l),[object Object],18.	public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {,[object Object],19.	String line = value.toString();,[object Object],20.	StringTokenizertokenizer = new StringTokenizer(line);,[object Object],21.	while (tokenizer.hasMoreTokens()) {,[object Object],22.		word.set(tokenizer.nextToken());,[object Object],23.		output.collect(word, one);,[object Object],24.		},[object Object],25.	},[object Object],23,[object Object]
a simple map:reduce example (bit.ly/bdyi0l),[object Object],28.	public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {,[object Object],29.	public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {,[object Object],30.		int sum = 0;,[object Object],31.		while (values.hasNext()) {,[object Object],32.			sum += values.next().get();,[object Object],33.		},[object Object],34.		output.collect(key, new IntWritable(sum));,[object Object],24,[object Object]
Applications ,[object Object],& ,[object Object],Outcomes,[object Object],25,[object Object]
Lets REVIEW OUR DESIGN GOALs AGAIN,[object Object],Classifiers are notorious for lack of explainability,[object Object],Engineers and analysts needs to know what the classifier is missing,[object Object],Engineers and analysts need to know about emerging threats,[object Object],Analysts need “canned” reports along interesting dimensions,[object Object],Machines need smart feature engineering,[object Object],Develop a scalable system to provide deep insight into spammer campaigns,[object Object],Double up as a platform for standard reporting,[object Object],Also double up as a platform for adhoc analysis and data probing,[object Object],Signal amplification and smart feature extraction platform,[object Object],26,[object Object]
Our ANTISPAM ANALYTIC PLATFORM,[object Object],Hadoop: Implements map reduce, written in Java but supports many other languages including Perl and C++ using the streaming interface,[object Object],Feature engineering with small simple Perl programs for data extraction and transformation,[object Object],SQL-like “Pig” programming language for data analysis and management,[object Object],Mahout: data mining libraries that provide shrink- wrapped, scalable, sophisticated algorithms,[object Object],Other proprietary algorithms and frameworks for specialized tasks,[object Object],27,[object Object]
Various ASPECTS of A GRID DRIVEN SOLUTION,[object Object],Standard reporting,[object Object],Ad hoc querying,[object Object],Campaign discovery from spam feedback using frequent item set mining,[object Object],“Gaming” detection in notspam feedback using connected components,[object Object],28,[object Object]
Top SPAMMY DOMAINS REPORT FOR 01/15/2010,[object Object],29,[object Object],key:noreply.amateurmatch.com|value:1164,[object Object],key:goodmere.info|value:896,[object Object],key:marketing.meredith.com|value:1078,[object Object],key:verizon.net|value:822,[object Object],key:reply.mb00.net|value:980,[object Object],key:insideapple.apple.com|value:1094,[object Object],key:facebookappmail.com|value:882,[object Object],key:mydailymoment.com|value:849,[object Object],key:thetwilightsaga.com|value:4671,[object Object],key:adknowledgemailer6.com|value:859,[object Object],key:freedollarspro.info|value:1164,[object Object],key:smartreachmedia.com|value:1074,[object Object],key:yahoo.es|value:877,[object Object],key:ecomasher.com|value:1197,[object Object],key:leasetrade-statusupdates.com|value:951,[object Object],key:noreply.amateurmatch.comvalue:1164,[object Object]
AD HOC queries for ANTISPAM research,[object Object],Identify domains that had few spam votes in the previous time window but have a high number of spam votes today,[object Object],All IPs in the last hour that sent a particular URL pattern…or that sent any unknown URL >500 times,[object Object],Which domains/IPs suddenly increased their sending volume after a positive reputation change,[object Object],Which FROM addresses exhibit low message size entropy,[object Object],All messages that had nothing but a URL and the domain of the URL had low page rank,[object Object],30,[object Object]
AD HOC QUERIES - Anatomy of a PIG QUERY,[object Object],---  This includes some basic string functions, including splitting a string on the '@' character,[object Object],register /homes/jpujara/pig_scripts/string.jar;,[object Object],define splitEmail string.Tokenize('2','@');,[object Object],--- Load up some data - incoming messages at a date and time, and our trusted user database,[object Object],MESSAGES = load '/projects/antispam/mta_feature_logs/$date*/*/*-$time*' using com.yahoo.ymail.pigfunctions.AsStorage('__record_key__,firstrcpt,mailfrom') as (mid:chararray,to:chararray,from:chararray);,[object Object],USERS = load '/projects/antispam/TrustedUser.bz2' using com.yahoo.ymail.pigfunctions.AsStorage('user,t') as (user:chararray,trusted:int);,[object Object],--- Split the e-mail addresses into user+domain and generate the appropriate user-id for yahoo users and partners,[object Object],EXPLODED_MESSAGES = FOREACH MESSAGES GENERATE to,FLATTEN(splitEmail(to)) as (user,udomain),FLATTEN(splitEmail(from)) as (sender,sdomain);,[object Object],YAHOO_MESSAGES = FOREACH EXPLODED_MESSAGES GENERATE (udomain MATCHES '.*yahoo.*' ? user : to ) as yuser,sdomain;,[object Object],31,[object Object],--- Combine the message and sender domains with the trusted user data and select only trusted messages,[object Object],YAHOO_MESSAGES_TRUST = JOIN YAHOO_MESSAGES by yuser, USERS by user;,[object Object],TRUSTED_MESSAGES = FILTER YAHOO_MESSAGES_TRUST by trusted > 0;,[object Object],--- Group by domain, and generate a count, order by descending count,[object Object],DOMAIN_GROUPS = GROUP TRUSTED_MESSAGES by sdomain;,[object Object],DOMAIN_GROUPS_COUNT = FOREACH DOMAIN_GROUPS GENERATE group,COUNT(TRUSTED_MESSAGES) as count;,[object Object],DOMAIN_GROUPS_ORDER = ORDER DOMAIN_GROUPS_COUNT by count DESC;,[object Object],--- Output the results,[object Object],STORE DOMAIN_GROUPS_ORDER into '$targetdir/topDomains';,[object Object]
CAMPAIGN Discovery in SPAM Feedback,[object Object],Frequent Itemset Mining,[object Object],Classical method,[object Object],Research interesting relationships between variables in a large database,[object Object],Primarily applied for market basket analysis,[object Object],Many good implementations,[object Object],APRIORI,[object Object],Easy to implement,[object Object],Parallelizes moderately well but bottlenecks for extremely large data sets,[object Object],Not very efficient with the number scans,[object Object],ECLAT,[object Object],Parallelizes easily ,[object Object],Amenable to a good grid implementation,[object Object],Fewer scans of the dataset,[object Object],Parallel FP GROWTH,[object Object],Designed explicitly for systems like hadoop,[object Object],Implemented in Mahout 0.2,[object Object],32,[object Object]
Frequent item set – example dataset,[object Object],33,[object Object]
Frequent ITEMSET MINING,[object Object],34,[object Object],Slide Courtsey: dortmund.de,[object Object]
Frequent itemset MINING on ONE DAY’s SPAM REPORTS,[object Object],9	2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,),[object Object],9	2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,),[object Object],9	2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,),[object Object],9	2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,),[object Object],9	2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,),[object Object],9	2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,) ,[object Object],9	1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,),[object Object],9	1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,),[object Object],9	1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,),[object Object],9	1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,),[object Object],9	1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,),[object Object],9	1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,),[object Object],35,[object Object],2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReportUpdate,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,),[object Object],2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,,[object Object],ip_D:66.206.25.227,),[object Object]
Gaming DETECTION in NOTSPAM FEEDBACK,[object Object],[object Object]
Delays classification of spamming IP addressesThrows off the classifiers if the feedback is not filtered well,[object Object],Model the problem as a bipartite graph,[object Object],Well known model for matching algorithms,[object Object],Broadly applied in various fields like coding theory,[object Object],A graph whose vertices are disjoint form disjoint sets U,V ,[object Object],There is an edge connecting every U to a vertex in V,[object Object],36,[object Object]
Connected COMPONETS - EXPLAINED,[object Object],Y1 = Yahoo user 1, Y2 = Yahoo user 2,[object Object],IP1 = IP address of the host Y1 “voted” notspam from,[object Object],37,[object Object],y1,[object Object],IP1,[object Object],y1,[object Object],SQUARING,[object Object],weight = 2,[object Object],y1,[object Object],IP2,[object Object],y1,[object Object]
Connected COMPONENTS for “GAMING” DETECTION,[object Object],38,[object Object],Set of IPs/YIDs used ,[object Object],exclusively for ,[object Object],voting notspam,[object Object],Set of (likely new) ,[object Object],spamming IPs which ,[object Object],are “worth”  voting for,[object Object],y1,[object Object],IP3,[object Object],IP1,[object Object],y2,[object Object],IP4,[object Object],IP2,[object Object],y3,[object Object],Set of ,[object Object],“voted on” IPs,[object Object],Set of ,[object Object],“voted from” IPs,[object Object],Set of Yahoo IDs,[object Object],voting notspam,[object Object]
Connected Components  - RESULTS,[object Object],39,[object Object],- Connnected components for IPsnotspam was voted from,[object Object]
Connected components - results,[object Object],40,[object Object],- Connnected components for IPsnotspam was voted on,[object Object]
CONCLUSIONS,[object Object],We have had success leveraging parallel, stateful algorithms on grid systems to keep pace with polymorphic spam that evade traditional analysis and algorithms,[object Object],Frequent Itemset Mining rapidly identifies cohesive campaigns in ISSPAM feedback,[object Object],Connected Components amplifies weak signals in gamed NOTSPAM feedback and helps separate signal from noise in the feedback,[object Object],Grid system based analysis platforms may be broadly applicable across the security domain,[object Object],41,[object Object]
Apply Slide,[object Object],Download Hadoop distribution,[object Object],http://hadoop.apache.org,[object Object],Try out Pig on standalone, single Linux box,[object Object],Identify source data to aggregate,[object Object],Start simple: IP patterns across web access logs,[object Object],Begin with offline aggregation; yesterday’s attacks still interesting,[object Object],Read Connected Components and Frequent Itemset Mining papers,[object Object],Stop looking for a single, invariant “tell” – far too costly,[object Object],Start thinking about co-occurrence of innocuous features ,[object Object],42,[object Object]
Resources for implementers,[object Object],Hadoop setup, documentation and resources,[object Object],http://hadoop.apache.org/,[object Object],Pig documentation and resources,[object Object],http://hadoop.apache.org/pig/,[object Object],Mahout documentation and resources,[object Object],http://lucene.apache.org/mahout/,[object Object],Frequent itemset mining implementation repository,[object Object],http://fimi.cs.helsinki.fi/src/,[object Object],Connected components description,[object Object],[link not yet live],[object Object],Ranger, Raghuraman, Penmetsa, Bradski, and Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA 2007,[object Object],43,[object Object]

Contenu connexe

Tendances

Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are AlgorithmsInfluxData
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt DowleSri Ambati
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansSpark Summit
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksHuafeng Wang
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangSri Ambati
 

Tendances (20)

Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowle
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 

En vedette

Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 

En vedette (20)

Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 

Similaire à Yahoo! Mail antispam - Bay area Hadoop user group

Beholding the giant pyramid of application development; why Ajax applications...
Beholding the giant pyramid of application development; why Ajax applications...Beholding the giant pyramid of application development; why Ajax applications...
Beholding the giant pyramid of application development; why Ajax applications...Javeline B.V.
 
Pragmatics of Declarative Ajax
Pragmatics of Declarative AjaxPragmatics of Declarative Ajax
Pragmatics of Declarative Ajaxdavejohnson
 
Building Complex GUI Apps The Right Way. With Ample SDK - SWDC2010
Building Complex GUI Apps The Right Way. With Ample SDK - SWDC2010Building Complex GUI Apps The Right Way. With Ample SDK - SWDC2010
Building Complex GUI Apps The Right Way. With Ample SDK - SWDC2010Sergey Ilinsky
 
Building Web Interface On Rails
Building Web Interface On RailsBuilding Web Interface On Rails
Building Web Interface On RailsWen-Tien Chang
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-templateshintaro mizuno
 
Expanding a tree node
Expanding a tree nodeExpanding a tree node
Expanding a tree nodeHemakumar.S
 
ImplementingChangeTrackingAndFlagging
ImplementingChangeTrackingAndFlaggingImplementingChangeTrackingAndFlagging
ImplementingChangeTrackingAndFlaggingSuite Solutions
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java ProfilingJerry Yoakum
 
Creating Responsive Experiences
Creating Responsive ExperiencesCreating Responsive Experiences
Creating Responsive ExperiencesTim Kadlec
 
Monitoring your electricity usage
Monitoring your electricity usageMonitoring your electricity usage
Monitoring your electricity usageDale Lane
 

Similaire à Yahoo! Mail antispam - Bay area Hadoop user group (20)

Ajax ons2
Ajax ons2Ajax ons2
Ajax ons2
 
Beholding the giant pyramid of application development; why Ajax applications...
Beholding the giant pyramid of application development; why Ajax applications...Beholding the giant pyramid of application development; why Ajax applications...
Beholding the giant pyramid of application development; why Ajax applications...
 
Pragmatics of Declarative Ajax
Pragmatics of Declarative AjaxPragmatics of Declarative Ajax
Pragmatics of Declarative Ajax
 
Building Complex GUI Apps The Right Way. With Ample SDK - SWDC2010
Building Complex GUI Apps The Right Way. With Ample SDK - SWDC2010Building Complex GUI Apps The Right Way. With Ample SDK - SWDC2010
Building Complex GUI Apps The Right Way. With Ample SDK - SWDC2010
 
&lt;img src="xss.com">
&lt;img src="xss.com">&lt;img src="xss.com">
&lt;img src="xss.com">
 
Fav
FavFav
Fav
 
Building Web Interface On Rails
Building Web Interface On RailsBuilding Web Interface On Rails
Building Web Interface On Rails
 
Odp
OdpOdp
Odp
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-template
 
Expanding a tree node
Expanding a tree nodeExpanding a tree node
Expanding a tree node
 
Front End on Rails
Front End on RailsFront End on Rails
Front End on Rails
 
ImplementingChangeTrackingAndFlagging
ImplementingChangeTrackingAndFlaggingImplementingChangeTrackingAndFlagging
ImplementingChangeTrackingAndFlagging
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
Ocul emergency-presentation
Ocul emergency-presentationOcul emergency-presentation
Ocul emergency-presentation
 
Ocul emergency-presentation
Ocul emergency-presentationOcul emergency-presentation
Ocul emergency-presentation
 
02 create first-map
02 create first-map02 create first-map
02 create first-map
 
Tugas Pw [6] (2)
Tugas Pw [6] (2)Tugas Pw [6] (2)
Tugas Pw [6] (2)
 
Tugas Pw [6]
Tugas Pw [6]Tugas Pw [6]
Tugas Pw [6]
 
Creating Responsive Experiences
Creating Responsive ExperiencesCreating Responsive Experiences
Creating Responsive Experiences
 
Monitoring your electricity usage
Monitoring your electricity usageMonitoring your electricity usage
Monitoring your electricity usage
 

Plus de Hadoop User Group

Plus de Hadoop User Group (20)

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Searching At Scale
Searching At ScaleSearching At Scale
Searching At Scale
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In Python
 
File Context
File ContextFile Context
File Context
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 

Dernier

Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Alexander Turgeon
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimizationarrow10202532yuvraj
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"DianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 

Dernier (20)

Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
UiPath Clipboard AI: "A TIME Magazine Best Invention of 2023 Unveiled"
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 

Yahoo! Mail antispam - Bay area Hadoop user group

Notes de l'éditeur

  1. Who knows what Yokai are? &lt;audience poll&gt;Shape-shifters from Japanese mythology. Many other examples, e.g. Proteus, who would tell you the future, but first you had to capture him. Just like the gods, Change shape to avoid capture* vary over IP, vary over content, vary over template features (e.g. document structure, subjects, size entropy)
  2. In abuse, these are “shape shifters.”They vary many aspects of the message to avoid detection: IPSubjectContentFor example, these four messages are obviously built from a single template, but changing its shape to avoid capture. How to catch?In the past: + Heuristics &amp; Regex + Dictionary (URLdb) + Invariant metadataChallenges: + slow to write+ difficult to write+ easy to evade
  3. Here is a third type of shape-shifting spamFor all of these: attackers have distinct advantage, because they can change most aspects and still get through
  4. 1.3 sextillion (1.3e21) variations, almost all can be recognized by human being in milisecondsspammers learned they can change any variable to hide from bulk filtershttp://cockeyed.com/lessons/viagra/viagra.html
  5. These bastards… the most despised doctors on the InternetAlmost all pages resolve through numerous HTML/Javascript redirectors to this page
  6. Daniel Geer said, there are targets of CHANCE and targets of CHOICE. Small businesses are in the former camp, catching the miscellaneous attacks out there.Increasingly, larger companies are TARGETS OF CHOICE, meaning the bad guys a) specifically tailor their attacks based on known vulnerabilities, and b) use feedback loops to improve the effectiveness of them.
  7. This is what a targetted attack profile looks like: After you patch, they almost stop trying
  8. One example of such a clearly targeted attack: 400KB of style gibberish embedded in a style sheet, completely throws out our parsersMaybe ASCII art spam, or something else that couldn’t be caught by simple pattern matchingThis is what our filters see: a stream of ASCII that is deliberately using multiple layerse.g. here, a TinyURL redirector, further obfuscated with non-printing HTML, spaces, and CSS chaffTo fight in olden days, hand-written regex to identify a patternOR heuristic on some invariant part of the message. But what is invariant? dozens of TinyURL clonesdozens of HTML and CSS tricks2^32 IP addressesinfinite FROM addressesinfinite SUBJECT lines…
  9. Sent by botnetsThis is Reactor Mailer; controlled Srizbi from the McColo datacenters until Nov 2008This is the template for Stormbot; notice it has control variables for all the settingsWhile most of these came in through SMTP port 25, now they are increasingly hitting HTTP and port 80
  10. Historically, POINT SOLUTIONS address each problem individuallyregexheuristicWouldn’t this be better if this guy could use more than one finger at a time?Something is *almost over the limit* along one dimension and *almost over the limit* along another.Message from IP that sends 80% good mail, with tinyurl that we don’t recognize, that was addressed to 40 people.*PRIOR PROBABILITY**COMPOSITE SCORE*
  11. Scale forces simplistic architectures; Feedback based architectures always lag behind the spam campaignFeedback also has many segments;- Personal preference spam: “I didn’t like this week’s Amazon gold box deals but I liked last week’s messages from Amazon”- Annoyance emails from legitimate bulk mailers: “This coupon is coming far too often these days”-Listserver spam: “This finance group - Newsletter messages that are no longer interesting to the user: “Gosh I am so not into that band any more”sometimes sends me stock spam”Traffic to a small enterprise domains can be restricted with firewall rules etc but large free mail provider traffic is full of corner casesCompounding the problem is the fact that adoption of DKIM and SPF has been slow, especially internationally and in emerging economies.But make no mistake, some of these spammers are very cleverIts more fruitful to target yahoo or google than to build a generic spam engine
  12. Lets looks at what is in place right now in terms of an architecture; Most large scale systems have some components from gen1 technologiesProvide attack mitigation and operational flexibility, highly explainable. Not durable, expensive to keep pace with fast morphing spamProprietary implementations, not very scalable, steep learning curveReactive and usually late
  13. Two ways this has been solved in the past: Machine management…Both systems, because of scale, were limited to looking at small pieces of data – an IP, a URL, etc.
  14. In this talk we’ll introduce Hadoop, an open-source grid computing environment with applications to fighting abuse. We’ll talk about how Hadoop can be applied to polymorphic spam and abuseAbout three years ago, Doug Cutting released version 0.15 of Hadoop, an open-source platform inspired by Google’s proprietary Map:Reduce algorithm“Supercomputer” – petabytes of storage, terabytes of RAM allow “needle in the haystack” even at Y!Mail scalehundred of featureshundreds of billions of recordstrends buried in global data
  15. Hadoop is the most prevalentAlso “Ngrid” and “Sun’s GridEngine” are other alternatives
  16. Input data format is application-specific, specified by the user Output is a set of &lt;key,value&gt; pairs User expresses algorithm using two functionsMap is applied on the input data and produces a list of intermediate &lt;key,value&gt; pairs Reduce is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs Finally, output pairs are sorted by their key value
  17. Toy exampleProvide some insight into what a map reduce program looks like, looks very much like unix command line
  18. Java code to highlight the mapper, mapper simply adds each word to a set and emits a count of 1 for each time the word is seen
  19. The reducer simply sums the values for each word, draw attention to line 32While this is a toy example, it should give a fair idea about how to structure a problem to be solvable by map reduce. The key takeaway is that writing even native map reduce programs can be quite simple and executing it even simpler
  20. Take the audience progressively through more and more sophisticated applications, starting from basic reporting and ending in outbound spammer analysis based on SWARM features
  21. Knowing the accurancy of your SVM/Bayes classifier puts you in no better situation to ask and answer what type of spam is leaking; and we know spammers are constantly probing80% of the spam/content classification problem is in smart feature engineering
  22. Lets looks at how our/Yahoo’s platform looks like Perl programs for feature engineering make it very easy and flexibleHadoop with its pig support is already well suited as a platform for adhoc data analysisFor deep data mining, open source mahout
  23. We will look at the hadoop is four different settings;
  24. * In antispam, these basic reports combined with human review form a barrier against highly directed attacks that exploit system weaknesses* Note how easy it is to slice and dice your data and write fairly sophisticated reports using pig/streaming. It is critical in antispam systems that the reporting platform be flexible and provide a lot of expressive power, hadoop and pig achieve that.*
  25. Previous such queries were against small samples, now we can do it against the full data set and get highly accurate results in a very short amount of timeAlternate architectures such as OLAP are too expensive at this scale
  26. * Pig is a data flow specification language. Its like SQL but unlike SQL it is better suited for data flow control. * In antispam, these basic reports combined with human review form a barrier against highly directed attacks that exploit system weaknessesNote how easy it is to slice and dice your data and write fairly sophisticated reports using pig. It is critical in antispam systems that the reporting platform be flexible and provide a lot of expressive power, hadoop and pig achieve that.*
  27. -- People who bought eggs also bought bread
  28. * We ran frequent itemset on one day’s spam votes, the results are striking.* Notice in the above example how the same campaign [the same FROMUSER] is being managed with different templates for subjects and URLs and is also originating from different IPs* Others records in the background are the result of the freq itemset mining algorithm as well and map very closely with spam campaigns.
  29. Develop a bipartite graph of users and the IPs they vote fromSquaring of the graph give rise to connected componentWeight of the connected component is a measured by the number of vertices that share the component.
  30. GamingIPs are IPs that the spammers try to whitelist in advanceDetected them by extending the connected component view on Ips the notspam is voted on
  31. The results are quite spectacular!! There is a massive amount of “gaming” going on with “notspam feedback” and there are only a handful of Ips that are doing this. There are a large number of smaller components not shown in the results above
  32. The results are less stronger – notice the two smaller weaker clusters in row 3 and 4The big takeway is that such unsupervised matching algorithms are going to be extremely power amplifiers of signals and can be used to rapidly separate out noise from signal.Imagine this being applied on traffic with more items such as IPs, message subjects, size of messages, fuzzy signatures etc.
  33. We encourage and invite others to try hadoop in anti spam and anti abuse architectures and share their experiences with us.
  34. Three users known badsame IP leads to new cookiesame cookie leads to new birthdayetc.*AMPLIFICATION OF SMALL SIGNAL*