SlideShare a Scribd company logo
1 of 50
Download to read offline
MapR's Hadoop Distribution
Who am I?

     http://www.mapr.com/company/events/
             speaking/pdb-10-16-12
•     Keys Botzum
•     kbotzum@maprtech.com
•     Senior Principal Technologist, MapR Technologies
•     MapR Federal and Eastern Region
Agenda
  •    What’s a Hadoop?
  •    What’s MapR?
  •    Enterprise Grade Hadoop
  •    Making Hadoop More Open
Hadoop in 15 minutes
How to Scale?
Big Data has Big Problems
•  Petabytes of data
•  MTBF on 1000s of nodes is < 1 day
•  Something is always broken
•  There are limits to scaling Big Iron
•  Sequential and random access just don’t scale
Example: Update 1% of 1TB
•  Data consists of 1010 records, each 100 bytes
•  Task: Update 1% of these records
Approach 1: Just Do It
•  Each update involves read, modify and write
    •  t = 1 seek + 2 disk rotations = 20ms
    •  1% x 1010 x 20 ms = 2 mega-seconds = 23 days
•  Total time dominated by seek and rotation times
Approach 2: The “Hard” Way
•  Copy the entire database 1GB at a time
•  Update records on the fly
    •  t = 2 x 1GB / 100MB/s + 20ms = 20s
    •  103 x 20s = 20,000s = 5.6 hours
•  100x faster to do 100x more work!
•  Moral: Read data sequentially even if you only want 1%
   of it
MapReduce: A Paradigm Shift
•  Distributed computing platform
    •  Large clusters
    •  Commodity hardware
•  Pioneered at Google
    •  BigTable, MapReduce and Google File System
•  Commercially available as Hadoop
Hadoop
•  Commodity hardware – thousands of nodes
•  Handles Big Data – petabytes and more
•  Sequential file access – each spindle provides data as fast as
   possible
•  Sharding
    •  Data distributed evenly across cluster
    •  More spindles and CPUs working on different parts of data set
•  Reliability – self-healing (mostly), self-balancing
•  MapReduce
    •  Parallel computing framework
    •  Function shipping
        §  Moves the computation to the data rather than the typical
            reverse
        §  Takes into account sharding
    •  Hides most of complexity from developers
Inside Map-Reduce




                                   the,	
  1	
  
    "The	
  6me	
  has	
  come,"	
  the	
  Walrus	
  said,	
  
                                   6me,	
  1	
  
    "To	
  talk	
  of	
  many	
  things:	
   come,	
  [3,2,1]	
  
                                   has,	
  1	
  
    Of	
  shoes—and	
  ships—and	
  shas,	
  [1,5,2]	
  
                                                  ealing-­‐wax	
  
                                   come,	
  1	
                       come,	
  6	
  
                                                 the,	
  [1,2,1]	
    has,	
  8	
  
                                   …	
  
                                                 6me,	
  [10,1,3]	
   the,	
  4	
  
                                                 …	
                  6me,	
  14	
  
        Input	
            Map	
          Shuffle	
             Reduce	
  
                                                                      …	
         Output	
  
                                         and	
  sort	
  
Agenda
  •    What’s a Hadoop?
  •    What’s MapR?
  •    Enterprise Grade Hadoop
  •    Making Hadoop More Open
The MapR Distribution for Apache Hadoop

•  Commercial Hadoop Distribution
•  Open, enterprise-grade distribution
    •  Primarily leveraging open source components
    •  Carefully targeted enhancements to make Hadoop more
       open and enterprise-grade

•  Growing fast and a recognized leader
MapR in the Cloud

•  Available as a service with Amazon Elastic MapReduce
   (EMR)
    •  http://aws.amazon.com/elasticmapreduce/mapr


      	
  
§           Available	
  as	
  a	
  service	
  with	
  Google	
  Compute	
  Engine	
  
      	
  
MapR Partners
Agenda
  •    What’s a Hadoop?
  •    What’s MapR?
  •    Enterprise Grade Hadoop
  •    Making Hadoop More Open
MapR’s Complete Distribution
for Apache Hadoop
                                                           MapR Control System
•    Integrated, tested,
     hardened and supported                 MapR
                                          Heatmap™
                                                          LDAP, NIS
                                                          Integration
                                                                           Quotas,           CLI,
                                                                                           REST APT
                                                                        Alerts, Alarms
•    Integrated with
     Accumulo
                                     Hive           Pig       Oozle        Sqoop         HBase        Whirr
•    Runs on commodity
     hardware
•    Open source with          Accumulo    Mahout     Cascading     Naglos       Ganglia         Flume        Zoo-
                                                                   Integration   Integration                 keeper
     standards-based
     extensions for:
      •  Security
      •  File-based access
                               Direct                                            Snap-
      •  Most SQL-based        Access
                                            Real-     Volumes      Mirrors                       Data
                                            Time                                 shots         Placemen
         access                 NFS       Streamin                                                 t
      •  Easiest integration                  g
                                     No NameNode             High Performance            Stateful Failover
•    High availability                Architecture             Direct Shuffle            and Self Healing

•    Best performance
                                                      MapR’s Storage Services™
                                                                2.7	
  
Easy Management at Scale



•  Health
   Monitoring
•  Cluster
   Administration
•  Application
   Resource
   Provisioning
                    Same information and tasks available via
                    command line and REST
MapR: Lights Out Data Center Ready


                                                  Dependable
Reliable Compute
                                                   Storage


 •  Automated	
  stateful	
  failover	
     §  Business	
  con6nuity	
  with	
  	
  
                                                snapshots	
  	
  and	
  mirrors	
  
 •  Automated	
  re-­‐replica6on	
          §  Recover	
  to	
  a	
  point	
  in	
  6me	
  
 •  Self-­‐healing	
  from	
  HW	
  	
      §  End-­‐to-­‐end	
  check	
  
    and	
  SW	
  failures	
                     summing	
  	
  
 •  Load	
  balancing	
                     §  Strong	
  consistency	
  
                                            §  Built	
  in	
  compression	
  
 •  Rolling	
  upgrades	
  
                                            §  Mirror	
  across	
  sites	
  to	
  
 •  No	
  lost	
  jobs	
  or	
  data	
          meet	
  
 •  99999’s	
  of	
  up6me	
                    Recovery	
  Time	
  Objec6ves
Storage Architecture

§    How	
  does	
  MapR	
  manage	
  storage	
  and	
  how	
  is	
  this	
  different	
  
      from	
  generic	
  Hadoop?	
  
What	
  is	
  a	
  Volume?	
  


                                                                    §   Like	
  a	
  sub-­‐directory	
  
                                                                           §  related	
  dirs/files	
  together	
  
                                                                     §  Contains	
  file	
  metadata	
  for	
  this	
  
                                                                         volume	
  
                                                                     §        Mounted	
  to	
  form	
  global	
  name-­‐
                                                                               space	
  
                                                                    §         Logical	
  unit	
  of	
  policy	
  


                                                  Volumes	
  help	
  you	
  manage	
  data	
  
©MapR	
  Technologies	
  -­‐	
  Confiden6al	
                          21	
  
Typical	
  Volume	
  Layout	
  

                                                                                   /	
  




           /binaries	
                                   /hbase	
          /projects	
         /users	
           /var/mapr	
  




                /build	
                                  /test	
           /mjones	
          /jsmith	
             local...	
  



                                                 Create	
  lots	
  of	
  volumes,	
  100K	
  volumes	
  OK!	
  
©MapR	
  Technologies	
  -­‐	
  Confiden6al	
                                   22	
  
Volumes	
  Let	
  You	
  Manage	
  Data	
  

                                                          §    Replica6on	
  factor	
  
                                                          §    Quotas	
  
                                                          §    Load	
  balancing	
  
                                                          §    Snapshots	
  
                                                          §    Mirrors	
  
                                                          §    Data	
  placement	
  	
  
                                                          §    Made	
  of	
  containers	
  
                                                                  §  Container	
  is	
  Sharding	
  unit	
  
                                                                  §  16	
  –	
  32G	
  




©MapR	
  Technologies	
  -­‐	
  Confiden6al	
     23	
  
Storage	
  Architecture	
  

                                                                       §    Nodes	
  
                                                                       §    Disks	
  
                                                                       §    Storage	
  Pools	
  
                                                                       §    Containers	
  
                                                                             –  Distributed	
  across	
  cluster	
  
                                                                             –  16-­‐32	
  GB	
  	
  

                                                                       §    Volumes	
  




©MapR	
  Technologies	
  -­‐	
  Confiden6al	
                  24	
  
No	
  NameNode	
  Architecture	
  
       Other	
  Hadoop	
  Distribu6ons	
                                                                                    MapR	
  
                                                    NAS	
  
                                                 APPLIANCE	
  


                         A           B                 C    D	
         E	
      F	
  
                     A           B                 C       D	
       E	
      F	
  
                                                                     NameNode	
  
               NameNode	
                        NameNode	
         NameNode	
  


                                                                                                            E	
  
                DataNode	
                       DataNode	
          DataNode	
  
                                                                                                        A           F	
       C    D	
         E	
     D


                DataNode	
                       DataNode	
          DataNode	
  
                                                                                                        A           B         B    C           E	
     B


                DataNode	
                       DataNode	
          DataNode	
  
                                                                                                        A           D	
       C     F	
         B      F	
  

               §    HA	
  requires	
  specialized	
  hardware	
  and/or	
                        §    HA	
  w/	
  automa6c	
  failover	
  and	
  re-­‐replica6on	
  
                     sonware	
                                                                    §    Up	
  to	
  1T	
  files	
  (>	
  5000x	
  advantage)	
  
               §    File	
  scalability	
  hampered	
  by	
  namenode	
                          §    Higher	
  performance	
  
                     booleneck	
                                                                  §    100%	
  commodity	
  hardware	
  
               §    Metadata	
  must	
  fit	
  in	
  memory	
                                     §    Metadata	
  is	
  persisted	
  to	
  disk	
  

©MapR	
  Technologies	
  -­‐	
  Confiden6al	
                                             25	
  
MapR	
  Snapshots	
  

                 Hadoop	
   H	
  HBASE	
  
                  Hadoop	
  /
                Hadoop	
  /	
  /	
  HBASE	
  
                                    BASE	
  
                                                                                                  NFS	
  
                                                                                                 NFS	
  
                                                                                                NFS	
  
                   APPLICATIONS	
  
                  APPLICATIONS	
                                                             APPLICAITONS	
  
                                                                                            APPLICAITONS	
  
                 APPLICATIONS	
                                                            APPLICAITONS	
                   §    Snapshots	
  without	
  data	
  
                                                                READ	
  /	
  WRITE	
                                              duplica6on	
  

                                              MapR	
  Storage	
  Services	
  
                                                                                                                            §    Saves	
  space	
  by	
  sharing	
  
                                                                                                                                  blocks	
  
Data	
  Blocks	
                                             REDIRECT	
  ON	
  WRITE	
  
                                                               	
  FOR	
  SNAPSHOT	
                                        §    Lightning	
  fast	
  
                   A	
                               B	
                  C	
                C’	
              D	
          §    Zero	
  performance	
  loss	
  on	
  
                                                                                                                                  wri6ng	
  to	
  original	
  
                                                                                                                            §    Scheduled,	
  or	
  on-­‐demand	
  
                                                                                                                            §    Easy	
  recovery	
  by	
  user	
  

                    Snapshot	
  1	
                                  Snapshot	
  2	
                  Snapshot	
  3	
  




    ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
                                                                   26	
  
MapR Mirroring/COOP Requirements

                                                      Business	
  Con6nuity	
  	
  
  Production                 Research                 and	
  Efficiency	
  

                                                      Efficient	
  design	
  
                      WAN                             §    Differen6al	
  deltas	
  are	
  updated	
  
Datacenter	
  1	
           Datacenter	
  1	
  
                                                      §    Compressed	
  and	
  	
  
                                                            check-­‐summed	
  


                                                      Easy	
  to	
  manage	
  
  Production
                      WAN
                             Cloud                    §    Scheduled	
  or	
  on-­‐demand	
  
                                                      §    WAN,	
  Remote	
  Seeding	
  
                                                      §    Consistent	
  point-­‐in-­‐6me	
  

                                           Compute Engine
Thought Questions
•    Consider a cluster with
      •  Petabytes of data
      •  Hundred or thousands of jobs running each day, creating new data
      •  Many users and teams all using this cluster
•    How do I back this up?
      •  User “oops” protection
•    How do I replicate data from one cluster to another in support of disaster
     recovery?
      •  Protection from power outages, floods, fire, etc
Designed	
  for	
  Performance	
  and	
  Scale	
  
                                                    MapR	
                          Apache/CDH	
  
Terasort	
  w/	
  1x	
  replica6on	
  (no	
  compression)	
  
Total	
  (minutes)	
                                24	
  min	
  34	
  sec	
        49	
  min	
  33	
  sec	
  
                                                                                                                            §        1.4	
  PB	
  user	
  data	
  
Map	
                                               9	
  min	
  54	
  sec	
         28	
  min	
  12	
  sec	
                §        900-­‐1200	
  MapReduce	
  jobs	
  per	
  day	
  
Shuffle	
                                             9	
  min	
  8	
  sec	
          27	
  min	
  0	
  sec	
                 §        16	
  TB/day	
  average	
  IO	
  through	
  each	
  server	
  
                                                                                                                            §        85-­‐90%	
  storage	
  u6liza6on	
  (with	
  snapshots)	
  
Terasort	
  w/	
  3x	
  replica6on	
  (no	
  compression)	
  
                                                                                                                            §        Very	
  low-­‐end	
  hardware	
  (consumer	
  drives)	
  
Total	
                                             47	
  min	
  4	
  sec	
         73	
  min	
  42	
  sec	
  
Map	
                                               11	
  min	
  2	
  sec	
         30	
  min	
  8	
  sec	
  
Shuffle	
                                             9	
  min	
  17	
  sec	
         28	
  min	
  40	
  sec	
                Large	
  Web	
  2.0	
  company	
  
DFSIO/local	
  write	
  
                                                                                                                            §        6B	
  files	
  on	
  a	
  single	
  cluster	
  (+	
  3x	
  replica6on)	
  
Throughput/node	
                                   870	
  MB/s	
                   240	
  MB/s	
                           §        2000	
  servers	
  targeted	
  
YCSB	
  (HBase	
  benchmark,	
  50%	
  read,	
  50%	
  update)	
                                                            §        No	
  degrada6on	
  during	
  hardware	
  failures	
  
                                                                                                                            §        Heavy	
  read/write/delete	
  workload	
  
Throughput	
                                        33102	
  ops/sec	
              7904	
  ops/sec	
  
                                                                                                                            §        1.7K	
  creates/sec/node	
  
Latency	
  (r/u)	
                                  2.9-­‐4	
  ms/0.4	
  ms	
       7-­‐30	
  ms/0-­‐5	
  ms	
  
                                                                                                                                                                  Response	
  Eme	
  
YCSB	
  (HBase	
  benchmark,	
  95%	
  read,	
  5%	
  update)	
  
                                                                                                                                                                  (write/read/delete)	
  
Throughput	
                                        18K	
  ops/sec	
                8500	
  ops/sec	
  
                                                                                                                                  Atomic	
  workload	
            7.8/4.5/8.7	
  ms	
  
Latency	
  (r/u)	
                                  5.5-­‐5.7	
  ms/0.6	
  ms	
     12-­‐30	
  ms/1	
  ms	
  
                                                                                                                                  Mixed	
  workload	
             6.6/4.9/9.1	
  ms	
  

HW:	
  10	
  servers,	
  2	
  x	
  4	
  cores	
  (2.4	
  GHz),	
  11	
  x	
  2TB,	
  32	
  GB	
  


   ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
                                                                    29	
  
Customer Support

•    24x7x365 “Follow-The-Sun” coverage
      •  Critical customer issues are worked on
         around the clock
•    Dedicated team of Hadoop engineering
     experts
•    Contacting MapR support
      •  Email: support@mapr.com
         (automatically opens a case)
      •  Phone: 1.855.669.6277
      •  Self Service options:
           §  http://answers.mapr.com/
           §  Web Portal: http://mapr.com/
               support
Two MapR Editions – M3 and M5


§    Control	
  System	
                       §    Control	
  System	
  
§    NFS	
  Access	
                           §    NFS	
  Access	
  
§    Performance	
                             §    Performance	
  
§    Unlimited	
  Nodes	
                      §    High	
  Availability	
  
§    Free	
  	
                                §    Snapshots	
  &	
  Mirroring	
  
                                                §    24	
  X	
  7	
  Support	
  
Also Available through:
                                                §    Annual	
  Subscrip6on	
  




                               Compute Engine
Agenda
  •    What’s a Hadoop?
  •    What’s MapR?
  •    Enterprise Grade Hadoop
  •    Making Hadoop More Open
Not	
  All	
  ApplicaEons	
  Use	
  the	
  Hadoop	
  APIs	
  


                                                        Applica6ons	
  and	
  
                                                        libraries	
  that	
  use	
  files	
  
                                                        and/or	
  SQL	
  
                                                        •  These	
  are	
  not	
  legacy	
  
                           30	
  years	
  
                                                           applica6ons,	
  they	
  are	
  
                 100,000s	
  applica6ons	
                 valuable	
  applica6ons	
  
                       10,000s	
  libraries	
  
              10s	
  programming	
  languages	
  
  	
  

                                                        Applica6ons	
  and	
  
                                                        libraries	
  that	
  use	
  the	
  
                                                        Hadoop	
  APIs	
  	
  
©MapR	
  Technologies	
                        33	
  
Hadoop	
  Needs	
  Industry-­‐Standard	
  Interfaces	
  


              Hadoop	
                •  MapReduce	
  and	
  HBase	
  applica6ons	
  
                API	
                 •  Mostly	
  custom-­‐built	
  



                                      •  File-­‐based	
  applica6ons	
  
                            NFS	
     •  Supported	
  by	
  most	
  opera6ng	
  systems	
  


                                      •  SQL-­‐based	
  tools	
  
                    ODBC	
            •  Supported	
  by	
  most	
  BI	
  applica6ons	
  and	
  
                                         query	
  builders	
  

©MapR	
  Technologies	
                           34	
  
NFS	
  


©MapR	
  Technologies	
     35	
  
Your	
  Data	
  is	
  Important	
  

§    HDFS-­‐based	
  Hadoop	
  distribu6ons	
  do	
  not	
  (cannot)	
  
      properly	
  support	
  NFS	
  

§    Your	
  data	
  is	
  important,	
  it	
  drives	
  your	
  business	
  –	
  make	
  
      sure	
  you	
  can	
  access	
  it	
  
      –  Why	
  store	
  your	
  data	
  in	
  a	
  system	
  which	
  cannot	
  be	
  accessed	
  
           by	
  95%	
  of	
  the	
  world’s	
  applica6ons	
  and	
  libraries?	
  




©MapR	
  Technologies	
                             36	
  
Direct	
  Access	
  NFS™	
  
                 File	
  Browsers	
                                     Standard	
  Linux	
  
                                                                      Commands	
  &	
  Tools	
  
                                                                             grep!
                                         Access	
  Directly	
  	
            sed!
                                         “Drag	
  &	
  Drop”	
               sort!
                                                                             tar!




                                        Random	
  Read	
  
                                        Random	
  Write	
  


                                          Log	
  directly	
  
                  Applica6ons	
  
©MapR	
  Technologies	
                          37	
  
The	
  NFS	
  Protocol	
  

§     RFC	
  1813	
                                    WRITE3res	
  NFSPROC3_WRITE(WRITE3args)	
  =	
  7;	
  
                                                        	
  
                                                        struct	
  WRITE3args	
  {	
  
                                                        	
  	
  	
  	
  nfs_fh3	
  	
  	
  	
  	
  file;	
  
§     Very	
  simple	
  protocol	
                     	
  	
  	
  	
  offset3	
  	
  	
  	
  	
  offset;	
  
                                                        	
  	
  	
  	
  count3	
  	
  	
  	
  	
  	
  count;	
  
                                                        	
  	
  	
  	
  stable_how	
  	
  stable;	
  
§     Random	
  reads/writes	
                         	
  	
  	
  	
  opaque	
  	
  	
  	
  	
  	
  data<>;	
  
       –  Read	
  count	
  bytes	
  from	
              };	
  
          offset	
  offset	
  of	
  file	
  file	
          	
  
                                                        READ3res	
  NFSPROC3_READ(READ3args)	
  =	
  6;	
  
       –  Write	
  buffer	
  data	
  to	
  	
  
                                                        	
  
          offset	
  offset	
  of	
  a	
  file	
  file	
  
                                                        struct	
  READ3args	
  {	
  
                                                        	
  	
  	
  	
  nfs_fh3	
  	
  file;	
  
                                                        	
  	
  	
  	
  offset3	
  	
  offset;	
  
§     HDFS	
  does	
  not	
  support	
                 	
  	
  	
  	
  count3	
  	
  	
  count;	
  
       random	
  writes	
  so	
  it	
                   };	
  
       cannot	
  support	
  NFS	
  
	
  
©MapR	
  Technologies	
                                         38	
  
S3	
  
                            o.a.h.fs.s3na6ve.Na6veS3FileSystem	
  




©MapR	
  Technologies	
  
                                         HDFS	
  
                              o.a.h.hdfs.DistributedFileSystem	
  

                             Local	
  File	
  System	
  
                                                                                                                      Storage	
  Layers	
  



                                  o.a.h.fs.LocalFileSystem	
  
                                                                                        MapReduce	
  




                                           FTP	
  
                                 o.a.h.fs.np.FTPFileSystem	
  




39	
  
                            MapR	
  storage	
  layer	
  
                                                                               o.a.h.fs.FileSystem	
  Interface	
  




                               com.mapr.fs.MapRFileSystem	
  
                                                                                                 Hadoop	
  
                                                                                                                      Hadoop	
  Was	
  Designed	
  to	
  Support	
  MulEple	
  




                                                        NFS	
  interface	
  
                                                                                                 FileSystem	
  API	
  
One	
  NFS	
  Gateway	
  




      What	
  about	
  scalability	
  and	
  high	
  availability?	
  
©MapR	
  Technologies	
                                   40	
  
MulEple	
  NFS	
  Gateways	
  




©MapR	
  Technologies	
     41	
  
MulEple	
  NFS	
  Gateways	
  with	
  Load	
  Balancing	
  




©MapR	
  Technologies	
      42	
  
MulEple	
  NFS	
  Gateways	
  with	
  NFS	
  HA	
  (VIPs)	
  




©MapR	
  Technologies	
        43	
  
Customer Examples: Import/Export Data
•    Network security vendor
      •  Network packet captures from switches are streamed into the cluster
      •  New pattern definitions are loaded into online IPS via NFS

•    Online measurement company
      •  Clickstreams from application servers are streamed into the cluster

•    SaaS company
      •  Exporting a database to Hadoop over NFS

•    Ad exchange
      •  Bids and transactions are streamed into the cluster
Customer Examples: Productivity and Operations

•    Retailer
      •  Operational scripts are easier with NFS than HDFS + MapReduce
           §  chmod/chown, file system searches/greps, perl, awk, tab-complete
      •  Consolidate object store with analytics

•    Credit card company
      •  User and project home directories on Linux gateways
           §  Local files, scripts, source code, …
           §  Administrators manage quotas, snapshots/backups, …

•    Large Internet company recommendation system
      •  Web server serve MapReduce results (item relationships) directly from
         cluster

•    Email marketing company
      •  Object store with HBase and NFS
Apache Drill
 Interactive Analysis of Large-Scale Datasets
Latency Matters

•    Ad-hoc analysis with interactive tools

•    Real-time dashboards

•    Event/trend detection and analysis
      •  Network intrusion analysis on the fly
      •  Fraud
      •  Failure detection and analysis
Big Data Processing

                 Batch processing   Interactive analysis   Stream processing
Query runtime    Minutes to hours   Milliseconds to        Never-ending
                                    minutes
Data volume      TBs to PBs         GBs to PBs             Continuous stream
Programming      MapReduce          Queries                DAG
model
Users            Developers         Analysts and           Developers
                                    developers
Google project   MapReduce          Dremel
Open source      Hadoop                                    Storm and S4
project          MapReduce




          Introducing Apache Drill…
Innovations
•  MapReduce
    •    Scalable IO and compute trumps efficiency with today's commodity hardware
    •    With large datasets, schemas and indexes are too limiting
    •    Flexibility is more important than efficiency
    •    An easy to use scalable, fault tolerant execution framework is key for large
         clusters
•  Dremel
    •    Columnar storage provides significant performance benefits at scale
    •    Columnar storage with nesting preserves structure and can be very efficient
    •    Avoiding final record assembly as long as possible improves efficiency
    •    Optimizing for the query use case can avoid the full generality of MR and thus
         significantly reduce latency. No need to start JVMs, just push compact queries to
         running agents.
•  Apache Drill
    •  Open source project based upon Dremel’s ideas
    •  More flexibility and openness
More Reading on Apache Drill
•    MapR and Apache Drill
      •  http://www.mapr.com/drill
•    Apache Drill project page
      •  http://incubator.apache.org/projects/drill.html
•    Google’s Dremel
      •  http://research.google.com/pubs/pub36632.html
•    Google’s BigQuery
      •  https://developers.google.com/bigquery/docs/query-reference
•    MIT’s C-Store – a columnar database
      •  http://db.csail.mit.edu/projects/cstore/
•    Microsoft’s Dryad
      •  Distributed execution engine
      •  http://research.microsoft.com/en-us/projects/dryad/
•    Google’s Protobufs
      •  https://developers.google.com/protocol-buffers/docs/proto

More Related Content

What's hot

MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeMapR Technologies
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 

What's hot (20)

MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
10c introduction
10c introduction10c introduction
10c introduction
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Apache drill
Apache drillApache drill
Apache drill
 

Viewers also liked

Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
さくっとはじめるテキストマイニング(R言語)  スタートアップ編さくっとはじめるテキストマイニング(R言語)  スタートアップ編
さくっとはじめるテキストマイニング(R言語)  スタートアップ編Yutaka Shimada
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~sugiyama koki
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 

Viewers also liked (9)

Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
さくっとはじめるテキストマイニング(R言語)  スタートアップ編さくっとはじめるテキストマイニング(R言語)  スタートアップ編
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 

Similar to Philly DB MapR Overview

Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Accumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and PigAccumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and PigJason Trost
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationCeph Community
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011GlusterFS
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Sandeep Kunkunuru
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011GlusterFS
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Open Stack
 
Applications in the Cloud
Applications in the CloudApplications in the Cloud
Applications in the CloudEberhard Wolff
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storageGlusterFS
 
How to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastHow to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastMapR Technologies
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageGlusterFS
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 

Similar to Philly DB MapR Overview (20)

Google Compute and MapR
Google Compute and MapRGoogle Compute and MapR
Google Compute and MapR
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Accumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and PigAccumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and Pig
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Applications in the Cloud
Applications in the CloudApplications in the Cloud
Applications in the Cloud
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storage
 
How to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastHow to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and Fast
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 

Philly DB MapR Overview

  • 2. Who am I? http://www.mapr.com/company/events/ speaking/pdb-10-16-12 •  Keys Botzum •  kbotzum@maprtech.com •  Senior Principal Technologist, MapR Technologies •  MapR Federal and Eastern Region
  • 3. Agenda •  What’s a Hadoop? •  What’s MapR? •  Enterprise Grade Hadoop •  Making Hadoop More Open
  • 4. Hadoop in 15 minutes
  • 5. How to Scale? Big Data has Big Problems •  Petabytes of data •  MTBF on 1000s of nodes is < 1 day •  Something is always broken •  There are limits to scaling Big Iron •  Sequential and random access just don’t scale
  • 6. Example: Update 1% of 1TB •  Data consists of 1010 records, each 100 bytes •  Task: Update 1% of these records
  • 7. Approach 1: Just Do It •  Each update involves read, modify and write •  t = 1 seek + 2 disk rotations = 20ms •  1% x 1010 x 20 ms = 2 mega-seconds = 23 days •  Total time dominated by seek and rotation times
  • 8. Approach 2: The “Hard” Way •  Copy the entire database 1GB at a time •  Update records on the fly •  t = 2 x 1GB / 100MB/s + 20ms = 20s •  103 x 20s = 20,000s = 5.6 hours •  100x faster to do 100x more work! •  Moral: Read data sequentially even if you only want 1% of it
  • 9. MapReduce: A Paradigm Shift •  Distributed computing platform •  Large clusters •  Commodity hardware •  Pioneered at Google •  BigTable, MapReduce and Google File System •  Commercially available as Hadoop
  • 10. Hadoop •  Commodity hardware – thousands of nodes •  Handles Big Data – petabytes and more •  Sequential file access – each spindle provides data as fast as possible •  Sharding •  Data distributed evenly across cluster •  More spindles and CPUs working on different parts of data set •  Reliability – self-healing (mostly), self-balancing •  MapReduce •  Parallel computing framework •  Function shipping §  Moves the computation to the data rather than the typical reverse §  Takes into account sharding •  Hides most of complexity from developers
  • 11. Inside Map-Reduce the,  1   "The  6me  has  come,"  the  Walrus  said,   6me,  1   "To  talk  of  many  things:   come,  [3,2,1]   has,  1   Of  shoes—and  ships—and  shas,  [1,5,2]   ealing-­‐wax   come,  1   come,  6   the,  [1,2,1]   has,  8   …   6me,  [10,1,3]   the,  4   …   6me,  14   Input   Map   Shuffle   Reduce   …   Output   and  sort  
  • 12. Agenda •  What’s a Hadoop? •  What’s MapR? •  Enterprise Grade Hadoop •  Making Hadoop More Open
  • 13. The MapR Distribution for Apache Hadoop •  Commercial Hadoop Distribution •  Open, enterprise-grade distribution •  Primarily leveraging open source components •  Carefully targeted enhancements to make Hadoop more open and enterprise-grade •  Growing fast and a recognized leader
  • 14. MapR in the Cloud •  Available as a service with Amazon Elastic MapReduce (EMR) •  http://aws.amazon.com/elasticmapreduce/mapr   §  Available  as  a  service  with  Google  Compute  Engine    
  • 16. Agenda •  What’s a Hadoop? •  What’s MapR? •  Enterprise Grade Hadoop •  Making Hadoop More Open
  • 17. MapR’s Complete Distribution for Apache Hadoop MapR Control System •  Integrated, tested, hardened and supported MapR Heatmap™ LDAP, NIS Integration Quotas, CLI, REST APT Alerts, Alarms •  Integrated with Accumulo Hive Pig Oozle Sqoop HBase Whirr •  Runs on commodity hardware •  Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo- Integration Integration keeper standards-based extensions for: •  Security •  File-based access Direct Snap- •  Most SQL-based Access Real- Volumes Mirrors Data Time shots Placemen access NFS Streamin t •  Easiest integration g No NameNode High Performance Stateful Failover •  High availability Architecture Direct Shuffle and Self Healing •  Best performance MapR’s Storage Services™ 2.7  
  • 18. Easy Management at Scale •  Health Monitoring •  Cluster Administration •  Application Resource Provisioning Same information and tasks available via command line and REST
  • 19. MapR: Lights Out Data Center Ready Dependable Reliable Compute Storage •  Automated  stateful  failover   §  Business  con6nuity  with     snapshots    and  mirrors   •  Automated  re-­‐replica6on   §  Recover  to  a  point  in  6me   •  Self-­‐healing  from  HW     §  End-­‐to-­‐end  check   and  SW  failures   summing     •  Load  balancing   §  Strong  consistency   §  Built  in  compression   •  Rolling  upgrades   §  Mirror  across  sites  to   •  No  lost  jobs  or  data   meet   •  99999’s  of  up6me   Recovery  Time  Objec6ves
  • 20. Storage Architecture §  How  does  MapR  manage  storage  and  how  is  this  different   from  generic  Hadoop?  
  • 21. What  is  a  Volume?   §  Like  a  sub-­‐directory   §  related  dirs/files  together   §  Contains  file  metadata  for  this   volume   §  Mounted  to  form  global  name-­‐ space   §  Logical  unit  of  policy   Volumes  help  you  manage  data   ©MapR  Technologies  -­‐  Confiden6al   21  
  • 22. Typical  Volume  Layout   /   /binaries   /hbase   /projects   /users   /var/mapr   /build   /test   /mjones   /jsmith   local...   Create  lots  of  volumes,  100K  volumes  OK!   ©MapR  Technologies  -­‐  Confiden6al   22  
  • 23. Volumes  Let  You  Manage  Data   §  Replica6on  factor   §  Quotas   §  Load  balancing   §  Snapshots   §  Mirrors   §  Data  placement     §  Made  of  containers   §  Container  is  Sharding  unit   §  16  –  32G   ©MapR  Technologies  -­‐  Confiden6al   23  
  • 24. Storage  Architecture   §  Nodes   §  Disks   §  Storage  Pools   §  Containers   –  Distributed  across  cluster   –  16-­‐32  GB     §  Volumes   ©MapR  Technologies  -­‐  Confiden6al   24  
  • 25. No  NameNode  Architecture   Other  Hadoop  Distribu6ons   MapR   NAS   APPLIANCE   A B C D   E   F   A B C D   E   F   NameNode   NameNode   NameNode   NameNode   E   DataNode   DataNode   DataNode   A F   C D   E   D DataNode   DataNode   DataNode   A B B C E   B DataNode   DataNode   DataNode   A D   C F   B F   §  HA  requires  specialized  hardware  and/or   §  HA  w/  automa6c  failover  and  re-­‐replica6on   sonware   §  Up  to  1T  files  (>  5000x  advantage)   §  File  scalability  hampered  by  namenode   §  Higher  performance   booleneck   §  100%  commodity  hardware   §  Metadata  must  fit  in  memory   §  Metadata  is  persisted  to  disk   ©MapR  Technologies  -­‐  Confiden6al   25  
  • 26. MapR  Snapshots   Hadoop   H  HBASE   Hadoop  / Hadoop  /  /  HBASE   BASE   NFS   NFS   NFS   APPLICATIONS   APPLICATIONS   APPLICAITONS   APPLICAITONS   APPLICATIONS   APPLICAITONS   §  Snapshots  without  data   READ  /  WRITE   duplica6on   MapR  Storage  Services   §  Saves  space  by  sharing   blocks   Data  Blocks   REDIRECT  ON  WRITE    FOR  SNAPSHOT   §  Lightning  fast   A   B   C   C’   D   §  Zero  performance  loss  on   wri6ng  to  original   §  Scheduled,  or  on-­‐demand   §  Easy  recovery  by  user   Snapshot  1   Snapshot  2   Snapshot  3   ©MapR  Technologies  -­‐  Confiden6al   26  
  • 27. MapR Mirroring/COOP Requirements Business  Con6nuity     Production Research and  Efficiency   Efficient  design   WAN §  Differen6al  deltas  are  updated   Datacenter  1   Datacenter  1   §  Compressed  and     check-­‐summed   Easy  to  manage   Production WAN Cloud §  Scheduled  or  on-­‐demand   §  WAN,  Remote  Seeding   §  Consistent  point-­‐in-­‐6me   Compute Engine
  • 28. Thought Questions •  Consider a cluster with •  Petabytes of data •  Hundred or thousands of jobs running each day, creating new data •  Many users and teams all using this cluster •  How do I back this up? •  User “oops” protection •  How do I replicate data from one cluster to another in support of disaster recovery? •  Protection from power outages, floods, fire, etc
  • 29. Designed  for  Performance  and  Scale   MapR   Apache/CDH   Terasort  w/  1x  replica6on  (no  compression)   Total  (minutes)   24  min  34  sec   49  min  33  sec   §  1.4  PB  user  data   Map   9  min  54  sec   28  min  12  sec   §  900-­‐1200  MapReduce  jobs  per  day   Shuffle   9  min  8  sec   27  min  0  sec   §  16  TB/day  average  IO  through  each  server   §  85-­‐90%  storage  u6liza6on  (with  snapshots)   Terasort  w/  3x  replica6on  (no  compression)   §  Very  low-­‐end  hardware  (consumer  drives)   Total   47  min  4  sec   73  min  42  sec   Map   11  min  2  sec   30  min  8  sec   Shuffle   9  min  17  sec   28  min  40  sec   Large  Web  2.0  company   DFSIO/local  write   §  6B  files  on  a  single  cluster  (+  3x  replica6on)   Throughput/node   870  MB/s   240  MB/s   §  2000  servers  targeted   YCSB  (HBase  benchmark,  50%  read,  50%  update)   §  No  degrada6on  during  hardware  failures   §  Heavy  read/write/delete  workload   Throughput   33102  ops/sec   7904  ops/sec   §  1.7K  creates/sec/node   Latency  (r/u)   2.9-­‐4  ms/0.4  ms   7-­‐30  ms/0-­‐5  ms   Response  Eme   YCSB  (HBase  benchmark,  95%  read,  5%  update)   (write/read/delete)   Throughput   18K  ops/sec   8500  ops/sec   Atomic  workload   7.8/4.5/8.7  ms   Latency  (r/u)   5.5-­‐5.7  ms/0.6  ms   12-­‐30  ms/1  ms   Mixed  workload   6.6/4.9/9.1  ms   HW:  10  servers,  2  x  4  cores  (2.4  GHz),  11  x  2TB,  32  GB   ©MapR  Technologies  -­‐  Confiden6al   29  
  • 30. Customer Support •  24x7x365 “Follow-The-Sun” coverage •  Critical customer issues are worked on around the clock •  Dedicated team of Hadoop engineering experts •  Contacting MapR support •  Email: support@mapr.com (automatically opens a case) •  Phone: 1.855.669.6277 •  Self Service options: §  http://answers.mapr.com/ §  Web Portal: http://mapr.com/ support
  • 31. Two MapR Editions – M3 and M5 §  Control  System   §  Control  System   §  NFS  Access   §  NFS  Access   §  Performance   §  Performance   §  Unlimited  Nodes   §  High  Availability   §  Free     §  Snapshots  &  Mirroring   §  24  X  7  Support   Also Available through: §  Annual  Subscrip6on   Compute Engine
  • 32. Agenda •  What’s a Hadoop? •  What’s MapR? •  Enterprise Grade Hadoop •  Making Hadoop More Open
  • 33. Not  All  ApplicaEons  Use  the  Hadoop  APIs   Applica6ons  and   libraries  that  use  files   and/or  SQL   •  These  are  not  legacy   30  years   applica6ons,  they  are   100,000s  applica6ons   valuable  applica6ons   10,000s  libraries   10s  programming  languages     Applica6ons  and   libraries  that  use  the   Hadoop  APIs     ©MapR  Technologies   33  
  • 34. Hadoop  Needs  Industry-­‐Standard  Interfaces   Hadoop   •  MapReduce  and  HBase  applica6ons   API   •  Mostly  custom-­‐built   •  File-­‐based  applica6ons   NFS   •  Supported  by  most  opera6ng  systems   •  SQL-­‐based  tools   ODBC   •  Supported  by  most  BI  applica6ons  and   query  builders   ©MapR  Technologies   34  
  • 36. Your  Data  is  Important   §  HDFS-­‐based  Hadoop  distribu6ons  do  not  (cannot)   properly  support  NFS   §  Your  data  is  important,  it  drives  your  business  –  make   sure  you  can  access  it   –  Why  store  your  data  in  a  system  which  cannot  be  accessed   by  95%  of  the  world’s  applica6ons  and  libraries?   ©MapR  Technologies   36  
  • 37. Direct  Access  NFS™   File  Browsers   Standard  Linux   Commands  &  Tools   grep! Access  Directly     sed! “Drag  &  Drop”   sort! tar! Random  Read   Random  Write   Log  directly   Applica6ons   ©MapR  Technologies   37  
  • 38. The  NFS  Protocol   §  RFC  1813   WRITE3res  NFSPROC3_WRITE(WRITE3args)  =  7;     struct  WRITE3args  {          nfs_fh3          file;   §  Very  simple  protocol          offset3          offset;          count3            count;          stable_how    stable;   §  Random  reads/writes          opaque            data<>;   –  Read  count  bytes  from   };   offset  offset  of  file  file     READ3res  NFSPROC3_READ(READ3args)  =  6;   –  Write  buffer  data  to       offset  offset  of  a  file  file   struct  READ3args  {          nfs_fh3    file;          offset3    offset;   §  HDFS  does  not  support          count3      count;   random  writes  so  it   };   cannot  support  NFS     ©MapR  Technologies   38  
  • 39. S3   o.a.h.fs.s3na6ve.Na6veS3FileSystem   ©MapR  Technologies   HDFS   o.a.h.hdfs.DistributedFileSystem   Local  File  System   Storage  Layers   o.a.h.fs.LocalFileSystem   MapReduce   FTP   o.a.h.fs.np.FTPFileSystem   39   MapR  storage  layer   o.a.h.fs.FileSystem  Interface   com.mapr.fs.MapRFileSystem   Hadoop   Hadoop  Was  Designed  to  Support  MulEple   NFS  interface   FileSystem  API  
  • 40. One  NFS  Gateway   What  about  scalability  and  high  availability?   ©MapR  Technologies   40  
  • 41. MulEple  NFS  Gateways   ©MapR  Technologies   41  
  • 42. MulEple  NFS  Gateways  with  Load  Balancing   ©MapR  Technologies   42  
  • 43. MulEple  NFS  Gateways  with  NFS  HA  (VIPs)   ©MapR  Technologies   43  
  • 44. Customer Examples: Import/Export Data •  Network security vendor •  Network packet captures from switches are streamed into the cluster •  New pattern definitions are loaded into online IPS via NFS •  Online measurement company •  Clickstreams from application servers are streamed into the cluster •  SaaS company •  Exporting a database to Hadoop over NFS •  Ad exchange •  Bids and transactions are streamed into the cluster
  • 45. Customer Examples: Productivity and Operations •  Retailer •  Operational scripts are easier with NFS than HDFS + MapReduce §  chmod/chown, file system searches/greps, perl, awk, tab-complete •  Consolidate object store with analytics •  Credit card company •  User and project home directories on Linux gateways §  Local files, scripts, source code, … §  Administrators manage quotas, snapshots/backups, … •  Large Internet company recommendation system •  Web server serve MapReduce results (item relationships) directly from cluster •  Email marketing company •  Object store with HBase and NFS
  • 46. Apache Drill Interactive Analysis of Large-Scale Datasets
  • 47. Latency Matters •  Ad-hoc analysis with interactive tools •  Real-time dashboards •  Event/trend detection and analysis •  Network intrusion analysis on the fly •  Fraud •  Failure detection and analysis
  • 48. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Introducing Apache Drill…
  • 49. Innovations •  MapReduce •  Scalable IO and compute trumps efficiency with today's commodity hardware •  With large datasets, schemas and indexes are too limiting •  Flexibility is more important than efficiency •  An easy to use scalable, fault tolerant execution framework is key for large clusters •  Dremel •  Columnar storage provides significant performance benefits at scale •  Columnar storage with nesting preserves structure and can be very efficient •  Avoiding final record assembly as long as possible improves efficiency •  Optimizing for the query use case can avoid the full generality of MR and thus significantly reduce latency. No need to start JVMs, just push compact queries to running agents. •  Apache Drill •  Open source project based upon Dremel’s ideas •  More flexibility and openness
  • 50. More Reading on Apache Drill •  MapR and Apache Drill •  http://www.mapr.com/drill •  Apache Drill project page •  http://incubator.apache.org/projects/drill.html •  Google’s Dremel •  http://research.google.com/pubs/pub36632.html •  Google’s BigQuery •  https://developers.google.com/bigquery/docs/query-reference •  MIT’s C-Store – a columnar database •  http://db.csail.mit.edu/projects/cstore/ •  Microsoft’s Dryad •  Distributed execution engine •  http://research.microsoft.com/en-us/projects/dryad/ •  Google’s Protobufs •  https://developers.google.com/protocol-buffers/docs/proto