SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Hadoop Distributed Filesystem ü Files as big as you want ü Horizontal scalability ü Failover Distributed Compu5ng ü MapReduce ü Batch oriented • Input ﬁles processed and converted in output ﬁles ü Horizontal scalability
Easier Hadoop Java API ü But keeping similar eﬃciency Common design paIerns covered ü Compound records ü Secondary sor5ng ü Joins Other improvements ü Instance based conﬁgura5on ü First class mul5ple input/output Tuple MapReduce implementaDon for Hadoop
Tuple MapReduce Our evoluDon to Google’s MapReduce Pere Ferrera, Iván de Prado, Eric Palacios, Jose Luis Fernandez-‐Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE Interna6onal Conference on Data Mining Brussels, Belgium | December 10 – 13, 2012
Sales diﬀerence between the most selling Tuple MapReduce oﬃces per each loca6on
Tuple MapReduce Main constraint ü Group by clause must be a subset of sort by clause Indeed, Tuple MapReduce can be implemented on top of any MapReduce implementaDon • Pangool -‐> Tuple MapReduce over Hadoop
Eﬃciency Similar eﬃciency to Hadoop hIp://pangool.net/benchmark.html
Voldemort & Hadoop Beneﬁts ü Scalability & failover ü Upda5ng the database does not aﬀect serving queries ü All data is replaced at each execu5on • Providing agility/ﬂexibility § Big development changes are not a pain • Easier survival to human errors § Fix code and run again • Easy to set up new clusters with diﬀerent topologies
Basic sta5s5cs Easy to implement with Pangool/Hadoop ü One job, grouping by the dimension over which you want to calculate the sta5s5cs. Count Average Min Max Stdev CompuDng several Dme periods in the same job ü Use the mapper for replica5ng each datum for each period ü Add a period iden5ﬁer ﬁeld in the tuple and include it in the group by clause
Dis5nct count Possible to compute in a single job ü Using secondary sor5ng by the ﬁeld you want to dis5nct count on ü Detec5ng changes on that ﬁeld Example ü Group by shop, sort by shop and card Shop Card Shop 1 1234 Shop 1 1234 Shop 1 1234 Change +1 Shop 1 5678 2 dis5nct buyers for Shop 1 5678 Change +1 shop 1
Histograms Typically two-‐pass algorithm ü First pass for detec5ng the minimum and the maximum and determine the bins ranges ü Second pass to count the number of occurrences on each bin AdaptaDve histogram ü One pass ü Fixed number of bins ü Bins adapt
Op5mal histogram Calculate the be:er histogram that represents the original one using a limited number of ﬂexible width bins ü Reduce storage needs ü More representa5ve than ﬁxed width ones -‐> beIer visualiza5on
Op5mal histogram Exact Algorithm Petri Kontkanen, Petri Myllym aki ̈ MDL Histogram Density EsDmaDon hIp://eprints.pascal-‐network.org/archive/00002983/ Too slow for producDon use
Op5mal histogram Alterna5ve: Approximated algorithm Random-‐restart hill climbing ü A solu5on is just a way of grouping exis5ng bins ü From a solu5on, you can move to some close solu5ons ü Some are beIer: reduce the representa5on error Algorithm 1. Iterate N 5mes, keeping best solu5on 1. Generate a random solu5on 2. Iterate un5l no improvement 1. Move to next beIer possible movement
Op5mal histogram Alterna5ve: Approximated algorithm Random-‐restart hill climbing ü One order of magnitude faster ü 99% accuracy
Everything in one job Basic staDsDcs -‐> 1 job DisDnct count staDsDcs -‐> 1 job One pass histograms -‐> 1 job Several periods & shops -‐> 1 job We can put all together so that compu5ng all sta5s5cs for all shops ﬁts into exactly one job
Shop recommenda5ons Based on co-‐occurrences ü If somebody bought in shop A and in shop B, then a co-‐occurrence between A and B exists ü Only one co-‐occurrence is considered although a buyer bought several 5mes in A and B ü Top co-‐occurrences per each shop are the recommenda5ons Improvements ü Most popular shops are ﬁltered out because almost everybody buys in them. ü Recommenda5ons by category, by locaDon and by both ü Diﬀerent calcula5on periods
Shop recommenda5ons Implemented in Pangool ü Using its coun5ng and joining capabili5es ü Several jobs Challenges ü If somebody bought in many shops, the list of co-‐occurrences can explode: • Co-‐occurrences = N * (N – 1), where N = # of dis5nct shops where the person bought ü Alleviated by limi5ng the total number of dis5nct shops to consider ü Only uses the top M shops where the client bought the most Future ü Time aware co-‐occurrences. The client bought in A and B and he did it in a close period of 5me.
Some numbers EsDmated resources needed with 1 year data 270 GB of stats to serve 24 large instances ~ 11 hours of execu5on $3500 month ü Op5miza5ons s5ll possible ü Cost without the use of reserved instances ü Probably cheaper with an in-‐house Hadoop cluster
Conclusion It was possible to develop a Big Data soluDon for a Bank ü With low use of resources ü Quickly ü Thanks to the use of technologies like Hadoop, Amazon Web Services and NoSQL databases The soluDon is ü Scalable ü Flexible/agile. Improvements easy to implement ü Prepared to stand human failures ü At a reasonable cost Main advantage: doing always everything
Future: Splout Key/value datastores have limitaDons ü Only accept querying by the key ü Aggrega5ons no possible ü In other words, we are forced to pre-‐compute everything ü Not always possible -‐> data explode ü For this par5cular case, 5me ranges are ﬁxed Splout: like Voldemort but SQL! ü The idea: to replace Voldemort by Splout SQL ü Much richer queries: real-‐5me aggrega5ons, ﬂexible 5me ranges ü It would allow to create some kind of Google Analy5cs for the sta5s5cs discussed in this presenta5on ü Open Sourced!!! hIps://github.com/datasalt/splout-‐db