Contenu connexe
Similaire à Lessons Learned on How to Secure Petabytes of Data
Similaire à Lessons Learned on How to Secure Petabytes of Data (20)
Plus de DataWorks Summit (20)
Lessons Learned on How to Secure Petabytes of Data
- 1. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton
Lesson Learned Securing Data at Scale
Drew Farris
Peter Guerra
Hadoop Summit 2014
- 3. © Copyright 2014 Booz Allen Hamilton Photo: CC BY 2.0: https://www.flickr.com/photos/atoach/5015711744
- 4. © Copyright 2014 Booz Allen Hamilton
Photo CC BY 2.0: https://www.flickr.com/photos/dutchamsterdam/
- 5. © Copyright 2014 Booz Allen Hamilton
Who we are
Founded and run DC Hadoop Users
Group Meetup –
http://www.meetup.com/Hadoop-DC
Technical talks at multiple conferences
– Strata, Data Science Summit, IDGA Gov
Cloud Conference, Cloudera Hadoop
Summit,Yahoo! Hadoop Summit, IEEE Cloud
Conference, CSA Congress, Black Hat
Multiple client engagements over the
last 7 years
– Defense
– Civil and Commercial Health
– Civil and Commercial Financial Services
– Commercial and International
+ Booz Allen Big Data and Data Science
Points-of-View
+ http://www.boozallen.com/cloud
+ http://www.boozallen.com/datascience
+ Advancing the Art of Analytics & Big Data
+ http://www.boozallen.com/insights/expertvoices/big-
data
+ http://www.federalnewsradio.com/?
nid=154&sid=2080808
+ Tackling Large Scale Data in Government
+ http://www.cloudera.com/blog/2010/11/tackling-
large-scale-data-in-government/
+ IT Architectures for Complex Search and Information
Retrieval
+ http://www.slideshare.net/cloudera/fuzzy-table-final
+ http://www.slideshare.net/ydn/3-biometric-
hadoopsummit2010
- 6. © Copyright 2014 Booz Allen Hamilton
Agenda
+ Securing Data in Hadoop
+ Architectural Case Study
+ What we did
+ How we did it
+ What tools we used
+ Smart Data
+ Emerging Security Capabilities
- 7. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton
Securing Data in Hadoop
- 8. © Copyright 2014 Booz Allen Hamilton
+ Data is growing exponentially and our
ability to securely store and process it is
falling behind
+ Security policies haven’t kept up with the
technology
+ Most security policies and tools were not
written for Big Data systems, so mapping
can be difficult
+ Clients are often not prepared for the
security challenges when integrating
multiple data sources
What are the security challenges with these architectures?
- 9. © Copyright 2014 Booz Allen Hamilton
Our approach to data security has made adoption more difficult
+ For the last 20 years we have built systems in silos,
isolated data containers (databases, applications, and
so forth)
+ Most organizations secure each silo individually and
protect access by database
+ Most certification and accreditation programs (FISMA),
PCI, HIPAA, and SANS top 20 controls define security
controls around each data silo
+ Most security controls implemented are to protect the
servers, user, or network access to data
- 10. © Copyright 2014 Booz Allen Hamilton
Example: SANS 20 – Control 15; Controlled Access on Need to Know
Deploy data protection such as IDS,
firewalls, anti-virus, HIPS, DLP, GRC…
Wrap those around a number of Big Data
technologies, most of which are based on
Apache Hadoop or integrate with it:
+ Hortonworks / Cloudera Stack
+ NoSQL MongoDB / CouchDB /
Cassandra
+ BigTable (Apache Accumulo / Apache
Hbase )
Distributed Systems by nature have
different security challenges because of
their architecture
SANS Control 15:
… the data classification system and permission
baseline is the blueprint for how authentication and
access of data is controlled…
+ Step 1:An appropriate data classification system and
permissions baseline applied to production data
systems
+ Step 2:Access appropriately logged to a log
management system
+ Step 3: Proper access control applied to portable
media/USB drives
+ Step 4:Active scanner validates,checks access,and
checks data classification
+ Step 5: Host-based encryption and data-loss
prevention validates and checks all access requests.
- 11. © Copyright 2014 Booz Allen Hamilton
Overview of Security Architecture Components
+ Infrastructure & Network
+ Encryption (at Rest & in Transit)
+ Authentication (User Principal and Device)
+ Authorization (Privileged Access Management)
+ Access Controls (Data Visibility)
+ Auditing & Monitoring of Data Access
+ Policy & Compliance
Driving Principles
+ Start with People, Process and
Culture
+ Understand the Data and the
Threat
+ Start small and build
+ Never finished
- 12. © Copyright 2014 Booz Allen Hamilton
Apache Hadoop Security Challenges
Scale
+ The large number of tasks presents problems with direct authentication
HDFS / File System
+ NameNodes have ACLs, while DataNodes don’t
Job Execution
+ Propagation of credentials to executing nodes
Job Data
+ Task Parameters / Intermediate output accessible via HTTP
Multi Tenancy
+ Access to Intermediate Output & Local Block Storage
Trust Of Auxiliary Services (Oozie, Hadoop clients, Hadoop Pipes/Streaming)
- 13. © Copyright 2014 Booz Allen Hamilton
First Hadoop release with Kerberos in 2008
A better solution was available, not always
implemented:
+ Tokens: Delegation Token, Block Access Token, Job
Token
+ Symmetric Encryption == Shared Keys
+ Large Cluster = Thousands of Copies of Shared
Keys
+ Performance Goals (Less than 3% impact) lead to
weak SASL QoP
+ Pluggable Authentication left to end-user
+ HDFS proxies for bulk transfer expose data
Often not implemented in favor of putting Hadoop into
an enclave, but still doesn’t fully regulate access to data
Alternatives?
+ Tahoe-LAFS. Cool,
but significant
Performance Impact
- 14. © Copyright 2014 Booz Allen Hamilton
Apache Hadoop 2.x Security
Hadoop RPC
+ Clients, MapReduce Jobs, Hadoop Daemons
+ SASL with varying levels of protection (QoP):
- Authorization, Integrity Protection and Confidentiality
Direct TCP/IP
+ HDFS Data Transfer between Clients, DN
+ Tunnel existing protocol over SASL HDFS-3637
HTTP
+ Web-UI, FSImage Operations between NN / SNN
+ HTTPS, Reloadable Java Keystore, Others
+ MAPREDUCE-4417, HADOOP-8581
- 15. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton
Architectural Case Study
Commercial Client
- 16. © Copyright 2014 Booz Allen Hamilton
+ Client is a multi-national Fortune 500 company with over 100,000
employees
+ Client had multiple data sources for each business unit – R&D,
Manufacturing, Sales and Marketing, Corporate
+ Client wanted to combine data, but many sensitive issues around new
product development and access to data by third party contractors, others
within its network boundaries
+ Efforts to integrate data previously had failed because of political and
technical issues
+ Could not get CISO to sign off on combining data!
Challenges
- 17. © Copyright 2014 Booz Allen Hamilton
Securing the Enterprise Ecosystem
Design Goals
+ Build a fully realized “Data Lake” combining information from many
different sources
+ Protect from unauthorized release or modification of information
+ Focus primarily on full-text retrieval but enable a variety of analytic
functions.
+ Enable the use of a variety of components from Hadoop Ecosystem
+ Implement in a series of phases based on client requirements
- 18. © Copyright 2014 Booz Allen Hamilton
Services (SOA)
Analytics and
Discovery
Views and Indexes
Data Lake
Metadata Tagging
Data Sources
Infrastructure/
Management
Visualization,
Reporting, Dashboards,
and Query Interface
Human Insights and Actions
Enabled by customizable interfaces
and visualizations of the data
Analytics and Services
Your tools for analysis, modeling, testing,
and simulations
Data Management
The single, secure repository for all
of your valuable data
Infrastructure
The technology platform for storing and
managing your data
Machine Learning Free-Computation Alerting
Geographic
Language
Translation
Entity
Relationship
Event Grab
Dense/
Sparse
Structured Unstructured Streaming
Provisioning Deployment Monitoring Workflow
Streaming Analytics
Streaming
indexes
Our Common Reference Architecture for Big Data
- 19. © Copyright 2014 Booz Allen Hamilton
Distributed*
Storage
Extract
Distributed
Analy6cs*&*Indexing
Presenta6on*Layer
periodic*updates
Non=Rela6onal*Stores
Sta6c*Rela6onal*
Databases
Sta6c*Data
Custom*Ingest*Logic
Sqoop
Hadoop
HDFS
Storm+Lucene*
Processing*Layer
Index*Files
Index*Persistence*&
Meta=data*Management
depending*on*use*case
JeGy*App*Server
Applica6ons*&*
Services*Layer
interac6ve*search
batch*repor6ng
View*/*UI*Model
Browser*App
Front=end*Client
(On=Network*Users)
Data$Lake$Pla*orm$Components$&$Search$App.$Architecture
Enterprise*Security,*Monitoring,*and*Governance*Controls
Hadoop
Map/Reduce
Search*&*BI*Logic
Kerberos*SSO*
Connector
Directory
Services
On=Premise*Firewall
Hive
DNS,*DHCP,*NTP,*
SMTP,*Proxy*(package*
updates)*Services
ZooKeeper
Informa6on*Model*/*
Hive*meta=store
Security
Groups*(FW)
Network*ACLs
Standard*AWS*
Machine*
Images
Encrypted*Data*
Volumes
An6virus*&*
System
Monitoring
Knox*Gateway*
&*Audit*Logging
AWS*Direct*Connect
AWS$Virtual$Private$Cloud$(EC2) OnCPremise$Network
Remote*Access*
Cer6ficate
(2=way*SSL)
Accumulo
Data*
Governance*&**
Stewardship
Analy6c*App*&*BI*
Users*(On=Network)
Spoire*&*Other*BI*
Tools
Privileged*Users*/*
Data*Scien6sts
(Direct*Access)
Streaming*Data
User*Uploaded
Data*Sets
Rela6onal*Database*
Triggers
Ka]a
low-latency
updates
=*Open*Source*Components*(Green)
- 20. © Copyright 2014 Booz Allen Hamilton
tl; dr;
+ Data Loading via Sqoop / Custom Transport
+ Ingest / Index via MapReduce
+ Distributed Query via Storm+Lucene
+ Batch / Reporting Via MR / Hive
+ Authentication via Kerberos
+ Access Via Web Application & Knox
+ Currently 100TB / 50% used, 150TB by EOY
- 21. © Copyright 2014 Booz Allen Hamilton
Infrastructure and Network Security
+ Amazon Web Services Provided
+ Virtual Private Cloud / Security Groups
+ Time to Deployment in Early Phases
+ Physical access to data centers, network isolation, etc.
+ Future Transition on-Premise Infrastructure
+ Concerned with procurement time
+ Other clients we’ve worked with 3-6 month turnaround for infrastructure
prep
+ Instance Level Malware Detection tuned to co-exist with cluster workloads
- 22. © Copyright 2014 Booz Allen Hamilton
Encryption
At Rest:
+ LUKS (Linux Unified Key Setup) for Ephemeral Storage Volumes
+ “Lock it up and throw away the key”
In Transit:
+ SSL to Web App Endpoints and Knox Gateway
+ Internal Network Isolation – VPC Controls prevent traffic interception &
MITM attacks
- 23. © Copyright 2014 Booz Allen Hamilton
Authentication and Authorization
+ Authentication via Kerberos
+ Authorization via LDAP
+ Future transition to enterprise authentication services: Oracle IAM.
+ Multi-factor Authentication for both Users and Devices via PKI
+ Authorization performed at both the User and Device Level
- 24. © Copyright 2014 Booz Allen Hamilton
Operating System user accounts and groups for users, projects and teams
reflected in HDFS permissions
Privileged access via Knox Gateway extension which provides access via SSH,
auditing and monitoring and control of administrative connections into the
cluster. (KNOX-250)
Identity Provider
Knox
Gateway
Hadoop Cluster
(Master)
(Oozie)
(Hive2 Server)
External Sources
REST/SSL
SSH HTTP
SPNEGO
Privileged Access Management
- 25. © Copyright 2014 Booz Allen Hamilton
Putting it All Together
+ Search UI is a web application accessed via SSL
+ Knox is the primary cluster access mechanism for users who need to access
to the cluster. Knox Provides access to the following services:
+ WebHDFS, WebHCat, Hive, Oozie
+ Knox for administrative access, via custom SSH plugin
- 26. © Copyright 2014 Booz Allen Hamilton
Future Directions
+ Role Base Access Control is an emerging client need. This will require:
+ Integration with enterprise role management
+ Passing roles through Web App & Knox to backend
+ Role based access in Accumulo, Lucene Indexes
+ Smart Data Tagging Strategy …
- 27. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton
Smart Data
- 28. © Copyright 2014 Booz Allen Hamilton
Smart Data
+ How many organizations have data security requirements?
+ A structured, verifiable representation of security tags bound to the data is
required in order for the enterprise to become inherently "smarter" about
the information flowing in and around it –
Smart Data
+ Overview of design principles:
+ PKI
+ Implement ABAC controls in IdAM
+ Define trusted data format based on data security
+ Tag all your data
+ Deploy Hadoop platform that leverages tags to track access
+ Log, monitor, and audit everything
- 29. © Copyright 2014 Booz Allen Hamilton
Data
Element
Visibility Tags
(red | blue | green)
Authorization
Authentication
Attributes
(red, orange, blue)
IDAM
User
Machine Learning Free-Computation Alerting
Geographic
Language
Translation
Entity
Relationship
Event Grab
Dense/
Sparse
Structured Unstructured Streaming
Provisioning Deployment Monitoring Workflow
Streaming Analytics
Streaming
indexes
Apache
Accumulo
Overview of Smart Data
- 30. © Copyright 2014 Booz Allen Hamilton
Allow access to resource MedicalJournal with attribute patientID=x
if Subject match DesignatedDoctorOfPatient
and action is read
with obligation
on Permit: doLog_Inform(patientID,Subject,time)
on Deny : doLog_UnauthorizedLogin(patientID,Subject,time)
Smart Data Security Controls
+ Trusted Client – Authorization and Authentication using PKI
+ Trusted Data Format – Data visibility is controlled using Boolean expressions
+ Ex.“((red|blue|green) & (white|yellow))”
+ Clients present Authorizations (red, blue, green, yellow) to Apache Accumulo
+ Corresponding tags are bound to data stored in Apache Accumulo
+ Trusted Log – All data interactions are logged and audited
Identity and Access Management
+ Attribute Based Access Control – Users all assigned series of attributes
+ Attributes and Authorization Bound by XACML, SAML
+ Policy Decision Point (PDP)
+ Policy Enforcement Point (PEP)
+ Policy Retrieval Point (PRP)
+ Policy Information Point (PIP)
+ Policy Administration Point (PAP)
- 31. © Copyright 2014 Booz Allen Hamilton
Tagging Smart Data
Formulate the tags used to control data from multiple perspectives
+ Data Origin
+ Level of Access Required
+ Information Governance Policy
+ Data Owners
+ Intended Recipients
Use fine grained tags, assign users many roles
+ Tag at the field level so that existence can be verified without revealing the
full data record
In Accumulo:
+ Capitalize on the richness of boolean expressions in visibility tags
+ Differential Compression eliminates the impact of repartition of data
+ Visibility Tags are bound to the data, changing visibilities is not trivial: it
means a delete and a re-add.
- 32. © Copyright 2014 Booz Allen Hamilton
Representational versus Referential Tags
Representational tags encode the specific visibilities they represent, including
all alternate controls for a specific document
User has roles of ACCOUNTING, RESEARCH and PII
+ If data has tag PII&RESEARCH, user can access data
+ If data has tag HIPAA&ACCOUNTING, user can’t access data
Referential Tags are a code, that relies on external translation between assigned
access controls and visibility markings:
Data has marking of 03DECAF00D
+ User has roles of ACCOUNTING, RESEARCH and PII
+ At lookup, translation of user roles into possible referential tags
Choice depends on security posture, what are the consequences of getting it
wrong versus the ease of shifting policy or data?
- 33. © Copyright 2014 Booz Allen Hamilton© Copyright 2014 Booz Allen Hamilton
Emerging Security Capabilities
- 34. © Copyright 2014 Booz Allen Hamilton
Ecosystem for security capabilities for Hadoop is growing rapidly
Cloudera (with Intel Rhino)
+ Sentry (ACLs for Hive / Impala)
+ Gazzang (Filesystem Encryption)
+ Intel Rhino
+ Encryption Codec Support HADOOP-9331
+ Key Distribution & Management MAPREDUCE-5025
+ Token Based Authentication HADOOP-9392
+ Unified Authorization Framework HADOOP-9466
+ Transparent Encryption for Hbase/Zookeeper
+ Others, see https://github.com/intel-hadoop/project-rhino/
Hortonworks
+ Production Ready Apache Knox
+ XA Secure
+ Central Administration
+ Authorization for HDFS / Hive / Hbase
+ Compliance Controls
Lots of talks at this Hadoop Summit on
data security:
The Future of Hadoop Security –
Joey Echeverria
Hadoop REST API Security with the
Apache Knox Gateway –
Kevin Minder,Larry McCay
Securing Big Data: Lock it Down, or
Liberate?
Jeff Graham,Mark Tomallo
Improvements in Hadoop Security –
Sanjay Radia,Chris Nauroth
- 35. © Copyright 2014 Booz Allen Hamilton
Summary
+ Security for Hadoop has come a long way and is changing rapidly, but is still
maturing
+ Securing the data in Hadoop means thinking differently about the architecture
when combining multiple data sources
+ Your Hadoop Architecture should provide consistent security mechanisms across
all of the data
+ A more complete way to secure data is to implement Smart Data (ABAC and Fine
Grained Access Controls) but this hasn’t been embraced consistently across the
Hadoop ecosystem yet
+ The next 6 months will be interesting …
- 36. © Copyright 2014 Booz Allen Hamilton
Just Released!
The Field Guide to Data Science
120 page e-book of data science geekery
Download for free:
http://www.boozallen.com/datascience
Thanks!
Drew (@drewfarris)
Peter (@petrguerra)