SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Iván	
  de	
  Prado	
  Alonso	
  –	
  CEO	
  of	
  Datasalt	
  
www.datasalt.es	
  
@ivanprado	
  
@datasalt	
  
                                                                  www.bigdataspain.org	
  
                                                                  November	
  16th,	
  2012	
  
                                                                  ETSI	
  Telecomunicación	
  	
  
                                                                  Madrid	
  
                                                                  Spain	
  
                                                                  #BDSpain	
  




         Value extraction from BBVA
           credit card transactions
BIG	
  
“MAC”	
  
 DATA	
  
104,000	
  employees	
  
47	
  million	
  customers	
  
The	
  idea	
  
 Extract	
  value	
  
       from	
  
 anonymized	
  
  credit	
  card	
  
 transacNons	
  
data	
  &	
  share	
  it	
  	
  	
  
   Always:	
  	
  
   ü  Impersonal	
  
   ü  Aggregated	
  
   ü  Dissociated	
  
   ü  Irreversible	
  
Helping	
  

   Consumers	
  
      Informed	
  decision	
  
      ü  Shop	
  recommendaNons	
  (by	
  locaNon	
  and	
  by	
  category)	
  
      ü  Best	
  Nme	
  to	
  buy	
  
      ü  AcNvity	
  &	
  fidelity	
  of	
  shop’s	
  customers	
  


      Sellers	
  
      Learning	
  clients	
  paCerns	
  
      ü  AcNvity	
  &	
  fidelity	
  of	
  shop’s	
  customers	
  
      ü  Sex	
  &	
  Age	
  &	
  LocaNon	
  
      ü  Buying	
  paXerns	
  
Shop	
  stats	
            For	
  different	
  periods	
  
                           ü  All,	
  year,	
  quarter,	
  month,	
  week,	
  day	
  




                …	
  and	
  much	
  more	
  
The	
  applicaNons	
  

Internal	
  use	
  


Sellers	
  


Customers	
  
The	
  challenges	
  

Company	
  silos	
                The	
  costs	
  

The	
  amount	
  of	
  data	
         Security	
  

 Development	
  flexibility/agility	
  

            Human	
  failures	
  
The	
  pla]orm	
  




  Data	
  storage	
                 S3	
  
Data	
  processing	
     ElasNc	
  Map	
  Reduce	
  
  Data	
  serving	
                EC2	
  
The	
  architecture	
  
Hadoop	
  

Distributed	
  Filesystem	
  
 ü     Files	
  as	
  big	
  as	
  you	
  want	
  
 ü     Horizontal	
  scalability	
  
 ü     Failover	
  
 	
  

Distributed	
  CompuNng	
  
 ü     MapReduce	
  
 ü     Batch	
  oriented	
  
      •     Input	
  files	
  processed	
  and	
  converted	
  in	
  output	
  files	
  
 ü  Horizontal	
  scalability	
  
 	
  
Easier	
  Hadoop	
  Java	
  API	
  
    ü    But	
  keeping	
  similar	
  efficiency	
  

Common	
  design	
  paXerns	
  covered	
  
    ü    Compound	
  records	
  
    ü    Secondary	
  sorNng	
  
    ü    Joins	
  

Other	
  improvements	
  
    ü    Instance	
  based	
  configuraNon	
  
    ü    First	
  class	
  mulNple	
  input/output	
  

Tuple	
  MapReduce	
  implementaJon	
  for	
  Hadoop	
  
Tuple	
  MapReduce	
  
Our	
  evoluJon	
  to	
  Google’s	
  MapReduce	
  
Pere	
  Ferrera,	
  Iván	
  de	
  Prado,	
  Eric	
  Palacios,	
  Jose	
  Luis	
  Fernandez-­‐
Marquez,	
  Giovanna	
  Di	
  Marzo	
  Serugendo:	
  	
  
	
  
Tuple	
  MapReduce:	
  Beyond	
  classic	
  MapReduce.	
  	
  
	
  
In	
  ICDM	
  2012:	
  Proceedings	
  of	
  the	
  IEEE	
  Interna2onal	
  Conference	
  
on	
  Data	
  Mining	
  
	
  
Brussels,	
  Belgium	
  |	
  December	
  10	
  –	
  13,	
  2012	
  
Sales	
  difference	
  between	
  the	
  most	
  selling	
  
Tuple	
  MapReduce	
     offices	
  per	
  each	
  loca2on	
  
Tuple	
  MapReduce	
  

         Main	
  constraint	
  


         ü  Group	
  by	
  clause	
  must	
  be	
  a	
  subset	
  of	
  sort	
  by	
  clause	
  

Indeed,	
  Tuple	
  MapReduce	
  can	
  be	
  implemented	
  on	
  top	
  of	
  
any	
  MapReduce	
  implementaJon	
  
   •  Pangool	
  -­‐>	
  Tuple	
  MapReduce	
  over	
  Hadoop	
  
Efficiency	
  
Similar	
  efficiency	
  to	
  Hadoop	
  




    hXp://pangool.net/benchmark.html	
  
Voldemort	
  
Distributed	
  key/value	
  store	
  
Voldemort	
  &	
  Hadoop	
  

        Benefits	
  
     ü  Scalability	
  &	
  failover	
  
     ü  UpdaNng	
  the	
  database	
  does	
  not	
  affect	
  serving	
  queries	
  
     ü  All	
  data	
  is	
  replaced	
  at	
  each	
  execuNon	
  
           •  Providing	
  agility/flexibility	
  	
  
                   §  Big	
  development	
  changes	
  are	
  not	
  a	
  pain	
  
           •  Easier	
  survival	
  to	
  human	
  errors	
  
                   §  Fix	
  code	
  and	
  run	
  again	
  
           •  Easy	
  to	
  set	
  up	
  new	
  clusters	
  with	
  different	
  topologies	
  	
  
Basic	
  staNsNcs	
  
Easy	
  to	
  implement	
  with	
  Pangool/Hadoop	
  
   ü  One	
  job,	
  grouping	
  by	
  the	
  dimension	
  over	
  which	
  you	
  want	
  to	
  
       calculate	
  the	
  staNsNcs.	
  


Count	
               Average	
                          Min	
                Max	
                 Stdev	
  
CompuJng	
  several	
  Jme	
  periods	
  in	
  the	
  
same	
  job	
  
     ü  Use	
  the	
  mapper	
  for	
  replicaNng	
  each	
  datum	
  for	
  each	
  period	
  
     ü  Add	
  a	
  period	
  idenNfier	
  field	
  in	
  the	
  tuple	
  and	
  include	
  it	
  in	
  the	
  
         group	
  by	
  clause	
  	
  
DisNnct	
  count	
  
Possible	
  to	
  compute	
  in	
  a	
  single	
  job	
  
    ü  Using	
  secondary	
  sorNng	
  by	
  the	
  field	
  you	
  want	
  to	
  disNnct	
  count	
  
        on	
  
    ü  DetecNng	
  changes	
  on	
  that	
  field	
  	
  

Example	
  
    ü  Group	
  by	
  shop,	
  sort	
  by	
  shop	
  and	
  card	
  

    Shop	
                         Card	
  
    Shop	
  1	
                    1234	
  
    Shop	
  1	
                    1234	
  
    Shop	
  1	
                    1234	
                               Change	
  
                                                                                     +1	
  
    Shop	
  1	
                    5678	
                                                           2	
  disNnct	
  
                                                                                                    buyers	
  for	
  
    Shop	
  1	
                    5678	
                               Change	
  
                                                                                     +1	
           shop	
  1	
  
Histograms	
  
Typically	
  two-­‐pass	
  algorithm	
  
  ü  First	
  pass	
  for	
  detecNng	
  the	
  minimum	
  and	
  the	
  
      maximum	
  and	
  determine	
  the	
  bins	
  ranges	
  
  ü  Second	
  pass	
  to	
  count	
  the	
  number	
  of	
  occurrences	
  
      on	
  each	
  bin	
  
AdaptaJve	
  histogram	
  	
  
                                                                   ü  One	
  pass	
  
                                                                   ü  Fixed	
  number	
  of	
  bins	
  
                                                                   ü  Bins	
  adapt	
  	
  
OpNmal	
  histogram	
  
Calculate	
  the	
  beCer	
  histogram	
  that	
  represents	
  the	
  original	
  one	
  
using	
  a	
  limited	
  number	
  of	
  flexible	
  width	
  bins	
  
      ü  Reduce	
  storage	
  needs	
  
      ü  More	
  representaNve	
  than	
  fixed	
  width	
  ones	
  -­‐>	
  beXer	
  
          visualizaNon	
  
OpNmal	
  histogram	
  
   Exact	
  Algorithm	
  
   Petri	
  Kontkanen,	
  Petri	
  Myllym	
  aki	
  
                                             ̈
   	
  
   MDL	
  Histogram	
  Density	
  EsJmaJon	
  
   	
  
   hXp://eprints.pascal-­‐network.org/archive/00002983/	
  



Too	
  slow	
  for	
  producJon	
  use	
  
OpNmal	
  histogram	
  
 AlternaNve:	
  Approximated	
  algorithm	
  
Random-­‐restart	
  hill	
  climbing	
  	
  
    ü  A	
  soluNon	
  is	
  just	
  a	
  way	
  of	
  grouping	
  exisNng	
  bins	
  
    ü  From	
  a	
  soluNon,	
  you	
  can	
  move	
  to	
  some	
  close	
  
        soluNons	
  
    ü  Some	
  are	
  beXer:	
  reduce	
  the	
  representaNon	
  error	
  	
  

Algorithm	
  
    1.  Iterate	
  N	
  Nmes,	
  keeping	
  best	
  
        soluNon	
  
        1.  Generate	
  a	
  random	
  soluNon	
  
        2.  Iterate	
  unNl	
  no	
  improvement	
  
             1.  Move	
  to	
  next	
  beXer	
  
                    possible	
  movement	
  
OpNmal	
  histogram	
  
 AlternaNve:	
  Approximated	
  algorithm	
  
Random-­‐restart	
  hill	
  climbing	
  	
  
    ü  One	
  order	
  of	
  magnitude	
  faster	
  
    ü  99%	
  accuracy	
  	
  
Everything	
  in	
  one	
  job	
  
 Basic	
  staJsJcs	
  -­‐>	
  1	
  job	
  
 DisJnct	
  count	
  staJsJcs	
  -­‐>	
  1	
  job	
  
 One	
  pass	
  histograms	
  -­‐>	
  1	
  job	
  
 Several	
  periods	
  &	
  shops	
  -­‐>	
  1	
  job	
  

     We	
  can	
  put	
  all	
  together	
  so	
  that	
  
   compuNng	
  all	
  staNsNcs	
  for	
  all	
  shops	
  
          fits	
  into	
  exactly	
  one	
  job	
  	
  	
  
Shop	
  recommendaNons	
  
Based	
  on	
  co-­‐occurrences	
  
   ü  If	
  somebody	
  bought	
  in	
  shop	
  A	
  and	
  in	
  shop	
  B,	
  then	
  a	
  co-­‐occurrence	
  
       between	
  A	
  and	
  B	
  exists	
  
   ü  Only	
  one	
  co-­‐occurrence	
  is	
  considered	
  although	
  a	
  buyer	
  bought	
  
       several	
  Nmes	
  in	
  A	
  and	
  B	
  
   ü  Top	
  co-­‐occurrences	
  per	
  each	
  shop	
  are	
  the	
  recommendaNons	
  

Improvements	
  
   ü  Most	
  popular	
  shops	
  are	
  filtered	
  out	
  because	
  almost	
  everybody	
  buys	
  
       in	
  them.	
  
   ü  RecommendaNons	
  by	
  category,	
  by	
  locaJon	
  and	
  by	
  both	
  
   ü  Different	
  calculaNon	
  periods	
  
Shop	
  recommendaNons	
  
Implemented	
  in	
  Pangool	
  
    ü  Using	
  its	
  counNng	
  and	
  joining	
  capabiliNes	
  
    ü  Several	
  jobs	
  

Challenges	
  
    ü  If	
  somebody	
  bought	
  	
  in	
  many	
  shops,	
  the	
  list	
  of	
  co-­‐occurrences	
  can	
  
        explode:	
  
            •  Co-­‐occurrences	
  =	
  N	
  *	
  (N	
  –	
  1),	
  where	
  N	
  =	
  #	
  of	
  disNnct	
  shops	
  
                where	
  the	
  person	
  bought	
  
    ü  Alleviated	
  by	
  limiNng	
  the	
  total	
  number	
  of	
  disNnct	
  shops	
  to	
  consider	
  
            ü  Only	
  uses	
  the	
  top	
  M	
  shops	
  where	
  the	
  client	
  bought	
  the	
  most	
  	
  
Future	
  
    ü  Time	
  aware	
  co-­‐occurrences.	
  The	
  client	
  bought	
  in	
  A	
  and	
  B	
  and	
  he	
  
        did	
  it	
  in	
  a	
  close	
  period	
  of	
  Nme.	
  
Some	
  numbers	
  
EsJmated	
  resources	
  needed	
  with	
  1	
  year	
  
data	
  
                  270	
  GB	
  of	
  stats	
  to	
  serve	
  

24	
  large	
  instances	
  ~	
  11	
  hours	
  of	
  execuNon	
  

                               $3500	
  month	
  
       ü  OpNmizaNons	
  sNll	
  possible	
  
       ü  Cost	
  without	
  the	
  use	
  of	
  reserved	
  instances	
  
       ü  Probably	
  cheaper	
  with	
  an	
  in-­‐house	
  Hadoop	
  cluster	
  
Conclusion	
  
It	
  was	
  possible	
  to	
  develop	
  a	
  Big	
  Data	
  
soluJon	
  for	
  a	
  Bank	
  
  ü  With	
  low	
  use	
  of	
  resources	
  
  ü  Quickly	
  
  ü  Thanks	
  to	
  the	
  use	
  of	
  technologies	
  like	
  Hadoop,	
  Amazon	
  Web	
  
      Services	
  and	
  NoSQL	
  databases	
  

The	
  soluJon	
  is	
  
    ü  Scalable	
  
    ü  Flexible/agile.	
  Improvements	
  easy	
  to	
  implement	
  
    ü  Prepared	
  to	
  stand	
  human	
  failures	
  
    ü  At	
  a	
  reasonable	
  cost	
  

Main	
  advantage:	
  doing	
  always	
  everything	
  
Future:	
  Splout	
  
Key/value	
  datastores	
  have	
  limitaJons	
  
  ü  Only	
  accept	
  querying	
  by	
  the	
  key	
  
  ü  AggregaNons	
  no	
  possible	
  
  ü  In	
  other	
  words,	
  we	
  are	
  forced	
  to	
  pre-­‐compute	
  everything	
  
       ü  Not	
  always	
  possible	
  -­‐>	
  data	
  explode	
  
       ü  For	
  this	
  parNcular	
  case,	
  Nme	
  ranges	
  are	
  fixed	
  

Splout:	
  like	
  Voldemort	
  but	
  SQL!	
  
  ü  The	
  idea:	
  to	
  replace	
  Voldemort	
  by	
  Splout	
  SQL	
  
  ü  Much	
  richer	
  queries:	
  real-­‐Nme	
  aggregaNons,	
  flexible	
  Nme	
  ranges	
  
  ü  It	
  would	
  allow	
  to	
  create	
  some	
  kind	
  of	
  Google	
  AnalyNcs	
  for	
  the	
  
      staNsNcs	
  discussed	
  in	
  this	
  presentaNon	
  
  ü  Open	
  Sourced!!!	
  
       hXps://github.com/datasalt/splout-­‐db	
  	
  
Iván	
  de	
  Prado	
  Alonso	
  –	
  CEO	
  of	
  Datasalt	
  
www.datasalt.es	
  
@ivanprado	
  
@datasalt	
  


             QuesJons?	
  

Contenu connexe

Similaire à Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Quant trading with artificial intelligence
Quant trading with artificial intelligenceQuant trading with artificial intelligence
Quant trading with artificial intelligenceRoger Lee, CFA
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012MapR Technologies
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing PlatformSAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing PlatformAmazon Web Services
 
LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19Alberto Paro
 
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...marcja
 
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Big Data Spain
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Introduction To Msbi By Yasir
Introduction To Msbi By YasirIntroduction To Msbi By Yasir
Introduction To Msbi By Yasiryasir873
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopRoman Nikitchenko
 
Smart Grids and Big Data
Smart Grids and Big DataSmart Grids and Big Data
Smart Grids and Big DataDave Callaghan
 
Stock price prediction using Neural Net
Stock price prediction using Neural NetStock price prediction using Neural Net
Stock price prediction using Neural NetRajat Sharma
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Knoldus Inc.
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowTom Lous
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical dataPaul Skeie
 

Similaire à Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012 (20)

Quant trading with artificial intelligence
Quant trading with artificial intelligenceQuant trading with artificial intelligence
Quant trading with artificial intelligence
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
 
Cognos powerplay
Cognos powerplayCognos powerplay
Cognos powerplay
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing PlatformSAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
 
LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19
 
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
 
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Introduction To Msbi By Yasir
Introduction To Msbi By YasirIntroduction To Msbi By Yasir
Introduction To Msbi By Yasir
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
 
Smart Grids and Big Data
Smart Grids and Big DataSmart Grids and Big Data
Smart Grids and Big Data
 
1000 track2 Bharadwaj
1000 track2 Bharadwaj1000 track2 Bharadwaj
1000 track2 Bharadwaj
 
Stock price prediction using Neural Net
Stock price prediction using Neural NetStock price prediction using Neural Net
Stock price prediction using Neural Net
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
 
Big Data Usecases
Big Data UsecasesBig Data Usecases
Big Data Usecases
 

Plus de Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 

Plus de Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Dernier

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Dernier (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

  • 1. Iván  de  Prado  Alonso  –  CEO  of  Datasalt   www.datasalt.es   @ivanprado   @datasalt   www.bigdataspain.org   November  16th,  2012   ETSI  Telecomunicación     Madrid   Spain   #BDSpain   Value extraction from BBVA credit card transactions
  • 2.
  • 4. 104,000  employees   47  million  customers  
  • 5. The  idea   Extract  value   from   anonymized   credit  card   transacNons   data  &  share  it       Always:     ü  Impersonal   ü  Aggregated   ü  Dissociated   ü  Irreversible  
  • 6. Helping   Consumers   Informed  decision   ü  Shop  recommendaNons  (by  locaNon  and  by  category)   ü  Best  Nme  to  buy   ü  AcNvity  &  fidelity  of  shop’s  customers   Sellers   Learning  clients  paCerns   ü  AcNvity  &  fidelity  of  shop’s  customers   ü  Sex  &  Age  &  LocaNon   ü  Buying  paXerns  
  • 7. Shop  stats   For  different  periods   ü  All,  year,  quarter,  month,  week,  day   …  and  much  more  
  • 8. The  applicaNons   Internal  use   Sellers   Customers  
  • 9. The  challenges   Company  silos   The  costs   The  amount  of  data   Security   Development  flexibility/agility   Human  failures  
  • 10. The  pla]orm   Data  storage   S3   Data  processing   ElasNc  Map  Reduce   Data  serving   EC2  
  • 12. Hadoop   Distributed  Filesystem   ü  Files  as  big  as  you  want   ü  Horizontal  scalability   ü  Failover     Distributed  CompuNng   ü  MapReduce   ü  Batch  oriented   •  Input  files  processed  and  converted  in  output  files   ü  Horizontal  scalability    
  • 13. Easier  Hadoop  Java  API   ü  But  keeping  similar  efficiency   Common  design  paXerns  covered   ü  Compound  records   ü  Secondary  sorNng   ü  Joins   Other  improvements   ü  Instance  based  configuraNon   ü  First  class  mulNple  input/output   Tuple  MapReduce  implementaJon  for  Hadoop  
  • 14. Tuple  MapReduce   Our  evoluJon  to  Google’s  MapReduce   Pere  Ferrera,  Iván  de  Prado,  Eric  Palacios,  Jose  Luis  Fernandez-­‐ Marquez,  Giovanna  Di  Marzo  Serugendo:       Tuple  MapReduce:  Beyond  classic  MapReduce.       In  ICDM  2012:  Proceedings  of  the  IEEE  Interna2onal  Conference   on  Data  Mining     Brussels,  Belgium  |  December  10  –  13,  2012  
  • 15. Sales  difference  between  the  most  selling   Tuple  MapReduce   offices  per  each  loca2on  
  • 16. Tuple  MapReduce   Main  constraint   ü  Group  by  clause  must  be  a  subset  of  sort  by  clause   Indeed,  Tuple  MapReduce  can  be  implemented  on  top  of   any  MapReduce  implementaJon   •  Pangool  -­‐>  Tuple  MapReduce  over  Hadoop  
  • 17. Efficiency   Similar  efficiency  to  Hadoop   hXp://pangool.net/benchmark.html  
  • 19. Voldemort  &  Hadoop   Benefits   ü  Scalability  &  failover   ü  UpdaNng  the  database  does  not  affect  serving  queries   ü  All  data  is  replaced  at  each  execuNon   •  Providing  agility/flexibility     §  Big  development  changes  are  not  a  pain   •  Easier  survival  to  human  errors   §  Fix  code  and  run  again   •  Easy  to  set  up  new  clusters  with  different  topologies    
  • 20. Basic  staNsNcs   Easy  to  implement  with  Pangool/Hadoop   ü  One  job,  grouping  by  the  dimension  over  which  you  want  to   calculate  the  staNsNcs.   Count   Average   Min   Max   Stdev   CompuJng  several  Jme  periods  in  the   same  job   ü  Use  the  mapper  for  replicaNng  each  datum  for  each  period   ü  Add  a  period  idenNfier  field  in  the  tuple  and  include  it  in  the   group  by  clause    
  • 21. DisNnct  count   Possible  to  compute  in  a  single  job   ü  Using  secondary  sorNng  by  the  field  you  want  to  disNnct  count   on   ü  DetecNng  changes  on  that  field     Example   ü  Group  by  shop,  sort  by  shop  and  card   Shop   Card   Shop  1   1234   Shop  1   1234   Shop  1   1234   Change   +1   Shop  1   5678   2  disNnct   buyers  for   Shop  1   5678   Change   +1   shop  1  
  • 22. Histograms   Typically  two-­‐pass  algorithm   ü  First  pass  for  detecNng  the  minimum  and  the   maximum  and  determine  the  bins  ranges   ü  Second  pass  to  count  the  number  of  occurrences   on  each  bin   AdaptaJve  histogram     ü  One  pass   ü  Fixed  number  of  bins   ü  Bins  adapt    
  • 23. OpNmal  histogram   Calculate  the  beCer  histogram  that  represents  the  original  one   using  a  limited  number  of  flexible  width  bins   ü  Reduce  storage  needs   ü  More  representaNve  than  fixed  width  ones  -­‐>  beXer   visualizaNon  
  • 24. OpNmal  histogram   Exact  Algorithm   Petri  Kontkanen,  Petri  Myllym  aki   ̈   MDL  Histogram  Density  EsJmaJon     hXp://eprints.pascal-­‐network.org/archive/00002983/   Too  slow  for  producJon  use  
  • 25. OpNmal  histogram   AlternaNve:  Approximated  algorithm   Random-­‐restart  hill  climbing     ü  A  soluNon  is  just  a  way  of  grouping  exisNng  bins   ü  From  a  soluNon,  you  can  move  to  some  close   soluNons   ü  Some  are  beXer:  reduce  the  representaNon  error     Algorithm   1.  Iterate  N  Nmes,  keeping  best   soluNon   1.  Generate  a  random  soluNon   2.  Iterate  unNl  no  improvement   1.  Move  to  next  beXer   possible  movement  
  • 26. OpNmal  histogram   AlternaNve:  Approximated  algorithm   Random-­‐restart  hill  climbing     ü  One  order  of  magnitude  faster   ü  99%  accuracy    
  • 27. Everything  in  one  job   Basic  staJsJcs  -­‐>  1  job   DisJnct  count  staJsJcs  -­‐>  1  job   One  pass  histograms  -­‐>  1  job   Several  periods  &  shops  -­‐>  1  job   We  can  put  all  together  so  that   compuNng  all  staNsNcs  for  all  shops   fits  into  exactly  one  job      
  • 28. Shop  recommendaNons   Based  on  co-­‐occurrences   ü  If  somebody  bought  in  shop  A  and  in  shop  B,  then  a  co-­‐occurrence   between  A  and  B  exists   ü  Only  one  co-­‐occurrence  is  considered  although  a  buyer  bought   several  Nmes  in  A  and  B   ü  Top  co-­‐occurrences  per  each  shop  are  the  recommendaNons   Improvements   ü  Most  popular  shops  are  filtered  out  because  almost  everybody  buys   in  them.   ü  RecommendaNons  by  category,  by  locaJon  and  by  both   ü  Different  calculaNon  periods  
  • 29. Shop  recommendaNons   Implemented  in  Pangool   ü  Using  its  counNng  and  joining  capabiliNes   ü  Several  jobs   Challenges   ü  If  somebody  bought    in  many  shops,  the  list  of  co-­‐occurrences  can   explode:   •  Co-­‐occurrences  =  N  *  (N  –  1),  where  N  =  #  of  disNnct  shops   where  the  person  bought   ü  Alleviated  by  limiNng  the  total  number  of  disNnct  shops  to  consider   ü  Only  uses  the  top  M  shops  where  the  client  bought  the  most     Future   ü  Time  aware  co-­‐occurrences.  The  client  bought  in  A  and  B  and  he   did  it  in  a  close  period  of  Nme.  
  • 30. Some  numbers   EsJmated  resources  needed  with  1  year   data   270  GB  of  stats  to  serve   24  large  instances  ~  11  hours  of  execuNon   $3500  month   ü  OpNmizaNons  sNll  possible   ü  Cost  without  the  use  of  reserved  instances   ü  Probably  cheaper  with  an  in-­‐house  Hadoop  cluster  
  • 31. Conclusion   It  was  possible  to  develop  a  Big  Data   soluJon  for  a  Bank   ü  With  low  use  of  resources   ü  Quickly   ü  Thanks  to  the  use  of  technologies  like  Hadoop,  Amazon  Web   Services  and  NoSQL  databases   The  soluJon  is   ü  Scalable   ü  Flexible/agile.  Improvements  easy  to  implement   ü  Prepared  to  stand  human  failures   ü  At  a  reasonable  cost   Main  advantage:  doing  always  everything  
  • 32. Future:  Splout   Key/value  datastores  have  limitaJons   ü  Only  accept  querying  by  the  key   ü  AggregaNons  no  possible   ü  In  other  words,  we  are  forced  to  pre-­‐compute  everything   ü  Not  always  possible  -­‐>  data  explode   ü  For  this  parNcular  case,  Nme  ranges  are  fixed   Splout:  like  Voldemort  but  SQL!   ü  The  idea:  to  replace  Voldemort  by  Splout  SQL   ü  Much  richer  queries:  real-­‐Nme  aggregaNons,  flexible  Nme  ranges   ü  It  would  allow  to  create  some  kind  of  Google  AnalyNcs  for  the   staNsNcs  discussed  in  this  presentaNon   ü  Open  Sourced!!!   hXps://github.com/datasalt/splout-­‐db    
  • 33. Iván  de  Prado  Alonso  –  CEO  of  Datasalt   www.datasalt.es   @ivanprado   @datasalt   QuesJons?