SlideShare a Scribd company logo
1 of 24
Big Data Security
    Joey Echeverria | Principal Solutions Architect
    joey@cloudera.com | @fwiffo




1                               ©2013 Cloudera, Inc.
Big Data Security




     EARLY DAYS




2
Hadoop File Permissions

    •   Added in HADOOP-1298
        •   Hadoop 0.16
        •   Early 2008
    • Authorization without authentication
    • POSIX-like RWX bits




3
MapReduce ACLs

    •   Added in HADOOP-3698
        •   Hadoop 0.19
        •   Late 2008
    • ACLs per job queue
    • Set a list of allowed users or groups per operation
        •   Job submission
        •   Job administration
    •   No authentication



4
Securing a Cluster Through a Gateway

    • Hadoop cluster runs on a private network
    • Gateway server dual-homed (Hadoop network and
      public network)
    • Users SSH onto gateway
        •   Optionally can create an SSH proxy for jobs to be
            submitted from the client machine
    •   Provides minimum level of protection




5
Big Data Security




     WHY SECURITY MATTERS




6
Prevent Accidental Access

    • Don’t let users shoot themselves in the foot
    • Main driver for early features
    • Not security per-se, but a critical first step
    • Doesn’t require strong authentication




7
Stop Malicious Users

    • Early features were necessary, but not sufficient
    • Security has to get real
    • Hadoop runs arbitrary code
    • Implicit trust doesn’t prevent the insider threat




8
Co-mingle All Your Data

    • Often overlooked
    • Big data means getting rid of stovepipes
        •   Scalability and flexibility are only 50% of the problem
        •   Trust your data in a multi-tenant environment
    •   Most critical driver




9
Big Data Security




      AN EVOLVING STORY




10
Authorization

     • Files
     • MapReduce/YARN job queues
     • Service-level authorization
         •   Whitelists and blacklists of hosts and users




11
Authentication
                 2.2 High Level Use Cases                                                  2 USE CASES
     •   HADOOP-4487
         •   Hadoop 0.22evel U0.20.205
                2.2 H igh L
                                   and se Cases
                  1. A ppl icat i ons accessing fi les on H D F S cl ust er s Non-MapReduce ap-
         •   Late 2010ions, including hadoop fs, access files st ored on one or more HDFS
                     plicat
                      clust ers. T he applicat ion should only be able t o access files and services
     •   Based on Kerberos and internal delegation tokens
                      t hey are aut horized t o access. See figure 1. Variat ions:

                       (a) Access HDFS direct ly using HDFS prot ocol.
         •   Provides strong user authentication servers via t he HFT P
                    (b) Access HDFS indirect ly t hough HDFS proxy
                        FileSyst em or HT T P get .
         •   Also used for service-to-service authentication
                                                    Name
                                                               delg(jo
                                         (joe)      Node               e
                                    kerb                                   )
                                                                                    MapReduce
                     Application
                                                       kerb(hdfs)                      Task
                                   bloc                                     e   n
                                          k to
                                              ken                       tok
                                                                   ck
                                                     Data      blo
                                                     Node



                                          Figure 1: HDFS High-level Dat aflow
12
Encryption

     •   Over the wire encryption for some socket
         connections
     •   RPC encryption added soon after Kerberos
     •   Shuffle encryption (HTTPS) added in Hadoop 2.0.2-
         alpha, back ported to CDH4 MR1
     •   HDFS block streamer encryption added in Hadoop
         2.0.2-alpha
     •   Volume-level encryption for data at rest



13
Big Data Security




      SECURITY FOR KEY VALUE STORES




14
Apache Accumulo

     •   Robust, scalable, high performance data storage and
         retrieval system
     •   Built by NSA, now an Apache project
     •   Based on Google’s BigTable
     •   Built on top of HDFS, ZooKeeper and Thrift
     •   Iterators for server-side extensions
     •   Cell labels for flexible security models




15
Data Model

     • Multi-dimensional, persistent, sorted map
     • Key/Value store with a twist
     • A single primary key (Row ID)
     • Secondary key (Column) internal to a row
         •   Family
         •   Qualifier
     •   Per-cell timestamp




16
Cell-Level Security

     • Labels stored per cell
     • Labels consist of Boolean expressions
       (AND, OR, nesting)
     • Labels associated with each user
     • Cell labels checked against user’s labels with a built-
       in iterator




17
Pluggable Authentication

     • Currently supports username/password
       authentication backed by ZooKeeper
     • ACCUMULO-259
         •   Targeted for Accumulo 1.5.0
     • Authentication info replaced with generic tokens
     • Supports multiple implementations (e.g. Kerberos)




18
Application Level

     • Accumulo often paired with application level
       authentication/authorization
     • Accumulo users created per application
     • Each application granted access level of most
       permitted user
     • Application authenticates users, grabs user
       authorizations, passes user labels with requests




19
Apache HBase

     •   Also based on Google’s BigTable
     •   Started as a Hadoop contrib project
     •   Supports column-level ACLs
     •   Kerberos for authentication
     •   Discussion and early prototypes of cell-level security
         ongoing




20
Big Data Security




      FUTURE




21
Encryption for Data at Rest

     • Need multiple levels of granularity
     • Encryption keys tied to authorization labels (like
       Accumulo labels or HBase ACLs)
     • APIs for file-level, block-level, or record-level
       encryption




22
Hive Security

     • Column-level ACLs
     • Kerberos authentication
     • AccessServer




23
24   ©2013 Cloudera, Inc.

More Related Content

What's hot

Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)
Ali Raw
 
Osi model vs TCP/IP
Osi model vs TCP/IPOsi model vs TCP/IP
Osi model vs TCP/IP
Mannu Khani
 
Intrusion detection system ppt
Intrusion detection system pptIntrusion detection system ppt
Intrusion detection system ppt
Sheetal Verma
 

What's hot (20)

User authentication
User authenticationUser authentication
User authentication
 
CRYPTOGRAPHY AND NETWORK SECURITY
CRYPTOGRAPHY AND NETWORK SECURITYCRYPTOGRAPHY AND NETWORK SECURITY
CRYPTOGRAPHY AND NETWORK SECURITY
 
Hash Function
Hash FunctionHash Function
Hash Function
 
Network Security
Network SecurityNetwork Security
Network Security
 
Secure Socket Layer
Secure Socket LayerSecure Socket Layer
Secure Socket Layer
 
Public Key Cryptosystem
Public Key CryptosystemPublic Key Cryptosystem
Public Key Cryptosystem
 
Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)
 
Firewall ppt
Firewall pptFirewall ppt
Firewall ppt
 
Web security
Web securityWeb security
Web security
 
Kdd process
Kdd processKdd process
Kdd process
 
Osi model vs TCP/IP
Osi model vs TCP/IPOsi model vs TCP/IP
Osi model vs TCP/IP
 
Data encryption
Data encryptionData encryption
Data encryption
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Network security cryptography ppt
Network security cryptography pptNetwork security cryptography ppt
Network security cryptography ppt
 
Intrusion detection and prevention system
Intrusion detection and prevention systemIntrusion detection and prevention system
Intrusion detection and prevention system
 
Intrusion detection system ppt
Intrusion detection system pptIntrusion detection system ppt
Intrusion detection system ppt
 
Ethical hacking
Ethical hackingEthical hacking
Ethical hacking
 
Web Security
Web SecurityWeb Security
Web Security
 
Chapter 5 - Identity Management
Chapter 5 - Identity ManagementChapter 5 - Identity Management
Chapter 5 - Identity Management
 
Virtualization in cloud computing
Virtualization in cloud computingVirtualization in cloud computing
Virtualization in cloud computing
 

Viewers also liked

Big data security the perfect storm
Big data security   the perfect stormBig data security   the perfect storm
Big data security the perfect storm
Ulf Mattsson
 
Open-BDA Hadoop Summt 2014 - Post Summit Report
Open-BDA Hadoop Summt 2014 - Post Summit ReportOpen-BDA Hadoop Summt 2014 - Post Summit Report
Open-BDA Hadoop Summt 2014 - Post Summit Report
Innovative Management Services
 

Viewers also liked (19)

Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data mining
 
Big Data, Security Intelligence, (And Why I Hate This Title)
Big Data, Security Intelligence, (And Why I Hate This Title) Big Data, Security Intelligence, (And Why I Hate This Title)
Big Data, Security Intelligence, (And Why I Hate This Title)
 
Information Security in Big Data : Privacy and Data Mining
Information Security in Big Data : Privacy and Data MiningInformation Security in Big Data : Privacy and Data Mining
Information Security in Big Data : Privacy and Data Mining
 
Big Data Security with Hadoop
Big Data Security with HadoopBig Data Security with Hadoop
Big Data Security with Hadoop
 
Big data security the perfect storm
Big data security   the perfect stormBig data security   the perfect storm
Big data security the perfect storm
 
Big data Overview
Big data OverviewBig data Overview
Big data Overview
 
Demystify big data data science
Demystify big data  data scienceDemystify big data  data science
Demystify big data data science
 
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
Balancing Mobile UX & Security: An API Management Perspective Presentation fr...
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Big Data Security and Governance
Big Data Security and GovernanceBig Data Security and Governance
Big Data Security and Governance
 
"Big Data" in the Energy Industry
"Big Data" in the Energy Industry"Big Data" in the Energy Industry
"Big Data" in the Energy Industry
 
BigDataEurope - Big Data & Energy
BigDataEurope - Big Data & EnergyBigDataEurope - Big Data & Energy
BigDataEurope - Big Data & Energy
 
Add
AddAdd
Add
 
Kerberos, Token and Hadoop
Kerberos, Token and HadoopKerberos, Token and Hadoop
Kerberos, Token and Hadoop
 
Open-BDA Hadoop Summt 2014 - Post Summit Report
Open-BDA Hadoop Summt 2014 - Post Summit ReportOpen-BDA Hadoop Summt 2014 - Post Summit Report
Open-BDA Hadoop Summt 2014 - Post Summit Report
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 

Similar to Big data security

Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api Compatibility
Cloudera, Inc.
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
DataWorks Summit
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 

Similar to Big data security (20)

Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api Compatibility
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
 
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by ClouderaBig Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
 
Hops - Distributed metadata for Hadoop
Hops - Distributed metadata for HadoopHops - Distributed metadata for Hadoop
Hops - Distributed metadata for Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Securing Your Apache Spark Applications
Securing Your Apache Spark ApplicationsSecuring Your Apache Spark Applications
Securing Your Apache Spark Applications
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 

More from Joey Echeverria

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
Joey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
Joey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
Joey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
Joey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
Joey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
Joey Echeverria
 

More from Joey Echeverria (12)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Big data security

  • 1. Big Data Security Joey Echeverria | Principal Solutions Architect joey@cloudera.com | @fwiffo 1 ©2013 Cloudera, Inc.
  • 2. Big Data Security EARLY DAYS 2
  • 3. Hadoop File Permissions • Added in HADOOP-1298 • Hadoop 0.16 • Early 2008 • Authorization without authentication • POSIX-like RWX bits 3
  • 4. MapReduce ACLs • Added in HADOOP-3698 • Hadoop 0.19 • Late 2008 • ACLs per job queue • Set a list of allowed users or groups per operation • Job submission • Job administration • No authentication 4
  • 5. Securing a Cluster Through a Gateway • Hadoop cluster runs on a private network • Gateway server dual-homed (Hadoop network and public network) • Users SSH onto gateway • Optionally can create an SSH proxy for jobs to be submitted from the client machine • Provides minimum level of protection 5
  • 6. Big Data Security WHY SECURITY MATTERS 6
  • 7. Prevent Accidental Access • Don’t let users shoot themselves in the foot • Main driver for early features • Not security per-se, but a critical first step • Doesn’t require strong authentication 7
  • 8. Stop Malicious Users • Early features were necessary, but not sufficient • Security has to get real • Hadoop runs arbitrary code • Implicit trust doesn’t prevent the insider threat 8
  • 9. Co-mingle All Your Data • Often overlooked • Big data means getting rid of stovepipes • Scalability and flexibility are only 50% of the problem • Trust your data in a multi-tenant environment • Most critical driver 9
  • 10. Big Data Security AN EVOLVING STORY 10
  • 11. Authorization • Files • MapReduce/YARN job queues • Service-level authorization • Whitelists and blacklists of hosts and users 11
  • 12. Authentication 2.2 High Level Use Cases 2 USE CASES • HADOOP-4487 • Hadoop 0.22evel U0.20.205 2.2 H igh L and se Cases 1. A ppl icat i ons accessing fi les on H D F S cl ust er s Non-MapReduce ap- • Late 2010ions, including hadoop fs, access files st ored on one or more HDFS plicat clust ers. T he applicat ion should only be able t o access files and services • Based on Kerberos and internal delegation tokens t hey are aut horized t o access. See figure 1. Variat ions: (a) Access HDFS direct ly using HDFS prot ocol. • Provides strong user authentication servers via t he HFT P (b) Access HDFS indirect ly t hough HDFS proxy FileSyst em or HT T P get . • Also used for service-to-service authentication Name delg(jo (joe) Node e kerb ) MapReduce Application kerb(hdfs) Task bloc e n k to ken tok ck Data blo Node Figure 1: HDFS High-level Dat aflow 12
  • 13. Encryption • Over the wire encryption for some socket connections • RPC encryption added soon after Kerberos • Shuffle encryption (HTTPS) added in Hadoop 2.0.2- alpha, back ported to CDH4 MR1 • HDFS block streamer encryption added in Hadoop 2.0.2-alpha • Volume-level encryption for data at rest 13
  • 14. Big Data Security SECURITY FOR KEY VALUE STORES 14
  • 15. Apache Accumulo • Robust, scalable, high performance data storage and retrieval system • Built by NSA, now an Apache project • Based on Google’s BigTable • Built on top of HDFS, ZooKeeper and Thrift • Iterators for server-side extensions • Cell labels for flexible security models 15
  • 16. Data Model • Multi-dimensional, persistent, sorted map • Key/Value store with a twist • A single primary key (Row ID) • Secondary key (Column) internal to a row • Family • Qualifier • Per-cell timestamp 16
  • 17. Cell-Level Security • Labels stored per cell • Labels consist of Boolean expressions (AND, OR, nesting) • Labels associated with each user • Cell labels checked against user’s labels with a built- in iterator 17
  • 18. Pluggable Authentication • Currently supports username/password authentication backed by ZooKeeper • ACCUMULO-259 • Targeted for Accumulo 1.5.0 • Authentication info replaced with generic tokens • Supports multiple implementations (e.g. Kerberos) 18
  • 19. Application Level • Accumulo often paired with application level authentication/authorization • Accumulo users created per application • Each application granted access level of most permitted user • Application authenticates users, grabs user authorizations, passes user labels with requests 19
  • 20. Apache HBase • Also based on Google’s BigTable • Started as a Hadoop contrib project • Supports column-level ACLs • Kerberos for authentication • Discussion and early prototypes of cell-level security ongoing 20
  • 21. Big Data Security FUTURE 21
  • 22. Encryption for Data at Rest • Need multiple levels of granularity • Encryption keys tied to authorization labels (like Accumulo labels or HBase ACLs) • APIs for file-level, block-level, or record-level encryption 22
  • 23. Hive Security • Column-level ACLs • Kerberos authentication • AccessServer 23
  • 24. 24 ©2013 Cloudera, Inc.