This document discusses protecting big data with Intel technologies. It summarizes Intel's Distribution for Apache Hadoop software, which includes encryption and role-based access control features. The software provides an encryption framework that extends Hadoop's compression codec and establishes a common encryption API. It also allows different key storage systems to integrate for key management. Performance tests show Intel AES-NI instructions accelerate encryption and decryption, providing up to 19.8x faster decryption compared to non-AES-NI.
Designing IA for AI - Information Architecture Conference 2024
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
1. Protect Your Big Data with Intel® Xeon®
Processors and Intel® Software Products
for Apache* Hadoop*
Bing Wang, Product Manager, Intel
Tianyou Li, System Architect & Engineering Manager, Intel
Haidong Xia, Cloud Security Designer, Intel
BIGS003
2. Agenda
• Big Data Security Trend
• Intel® Distribution for Apache Hadoop*
• Intel Distribution for Apache Hadoop Encryption
• Intel Distribution for Apache Hadoop Role Based
Access Control
• Summary/Call to Action
The PDF for this Session presentation is available from our
Technical Session Catalog at the end of the day at:
intel.com/go/idfsessionsBJ
URL is on top of Session Agenda Pages in Pocket Guide
2
3. Agenda
• Big Data Security Trend
• Intel® Distribution for Apache Hadoop*
• Intel Distribution for Apache Hadoop Encryption
• Intel Distribution for Apache Hadoop Role Based
Access Control
• Summary/Call to Action
3
4. Big Data Insights … New Frontier for Innovation
Billions >3000 exabytes 690% Storage
connected users and of new integrated growth
devices sharing devices & Cloud Volume
traffic Sensed data
Arrival of Skype*
Facebook*
629m
Scientific data
massive data 663m Cell Unstructured
Social data
Phones data
5.3 bn Structured
Network data
data
Hotmail* Corporate data
Yahoo* 364m
273m
Time
Traditional MPP - $50K
Dramatic Data processing
ROI costs
per terabyte
New analytics tools &
Biz info processing
products &
insights
690 percent growth in storage capacity based off Intel analysis and IDC data,
between 2010 (26,066 petabytes) to 2015 (179,327) which is ~690%
4
5. Big Data Security Concerns
Data Protection Access Control
• How to protect sensitive
• Who can access the
data:
data?
−PII, customer info, IP,
−Need granular control
credit card, …
for data access
• Regulatory and compliance
requirments
• Encryption is method BIG DATA
of choice for data
protection • No built-in access
• Encryption was control in current Big
infeasible due to Data framework
performance
overhead
5
6. Agenda
• Big Data Security Trend
• Intel® Distribution for Apache Hadoop*
• Intel Distribution for Apache Hadoop Encryption
• IDH Role Based Access Control
• Summary/Call to Action
6
7. Intel® Distribution for Apache
Hadoop* Software
This session
focus
Automatic tuning of Multi-site scalability and
Industry’s 1st hardware- Role-based access control
Hadoop* cluster adaptive replication in
assisted encryption & granular ACLs in HBase*
configuration HBase
Intel® Manager for Apache Hadoop* software
Deployment, Configuration, Monitoring, Alerts, and Security
Mahout*
Data Exchange
Sqoop* 1.4.1
Oozie* Pig* R Hive*
0.7
HBase 0.94.1
3.3.0 0.9.2 connectors 0.9.0
Columnar Store
Machine
Workflow Scripting Statistics SQL Query
ZooKeeper* 3.4.5
Learning
Coordination
YARN (MRv2)
Distributed Processing Framework
Flume* 1.3.0
Log Collector
HDFS 2.0.3
Hadoop Distributed File System
Intel proprietary Intel enhancements contributed back to open source Open source components included without change
7
8. Hadoop* Encryption: Protect Data from
“Disk Leak”
&$!@... Data I have the key
was encrypted, and passphrase,
how can I crack I can recover
it? the data via
Intel tool
8
9. Agenda
• Big Data Security Trend
• Intel® Distribution for Apache Hadoop*
• Intel Distribution for Apache Hadoop Encryption
• Intel Distribution for Apache Hadoop Role Based
Access Control
• Summary/Call to Action
9
10. Data Protection with Intel® AES-NI
Efficient Ways to Use Encryption for Data Protection
Intel® AES-NI: Data at Rest
Full disk encryption software
• 7 instructions that protects data while saving to disk
expose special Data in Motion
Secure transactions used
math functions pervasively in
ecommerce, banking, etc.
built in the
processor Internet Intranet
accelerate AES
• Makes enabled
encryption
software faster Data in Process
and stronger Most enterprise and cloud applications offer
encryption options to secure information and
protect confidentiality
10 Intel® Advanced Encryption Standard New Instructions
11. Intel® Distribution for Apache Hadoop*
Software: Encryption Framework
HDFS MapReduce
Derivative RecordReader
Decrypt
Encrypt Map
Combiner
Client
Partitioner
Local
Decrypt Merge & Sort
Reduce
Derivative
Encrypt
RecordWriter
11
11
12. Crypto Codec Framework
• Extends compression codec and establishes a
common abstraction of the API level that can be
shared by all crypto codec implementations as well
as users that use the API
CryptoCodec cryptoCodec = (CryptoCodec) ReflectionUtils.newInstance(codecClass,
conf);
CryptoContext cryptoContext = new CryptoContext();
...
cryptoCodec.setCryptoContext(cryptoContext);
CompressionInputStream input = cryptoCodec.createInputStream(inputStream);
…
• Provides a foundation for other components in
Hadoop* such as MapReduce or HBase* to support
encryption features
12
14. Crypto Codec File Format
Block Block Block Block …
Sync Block Algorithm Original Encrypted
Mark header header Size Size (4 byte)
(16 byte) (4 byte)
Encryption data …
Stream
Version Key Exten-
header Stream IV (16
(4 profile sion
length (4 header byte)
byte) header header
byte)
Encryption Data
Compressed Compressed Compressed Compressed
…
Size (4 byte) data Size (4 byte) data
14
15. Crypto Codec: API Example
The usage is aligned with compression codec but with context
supporting.
Configuration conf = new Configuration();
CryptoCodec cryptoCodec =
(CryptoCodec) ReflectionUtils.newInstance(AESCodec.class, conf);
CryptoContext cryptoContext = new CryptoContext();
cryptoContext.setKey(Key.derive(password));
cryptoCodec.setCryptoContext(cryptoContext);
DataInputStream input = inputFile.getFileSystem(conf).open(inputFile);
DataOutputStream outputStream = outputFile.getFileSystem(conf).create(outputFile);
CompressionOutputStream output = cryptoCodec.createOutputStream(outputStream);
// encrypt the stream
writeStream(input, output);
input.close();
output.close();
15
16. Crypto Codec: A Simple MapReduce
Example
The usage is aligned with compression codec usage in MapReduce
job but with context resolving.
Job job = Job.getInstance(conf, "example");
JobConf jobConf = (JobConf)job.getConfiguration();
FileMatches fileMatches = new FileMatches(
KeyContext.refer("KEY00", Key.KeyType.SYMMETRIC_KEY, "AES", 128));
fileMatches.addMatch("^.*/input1.intelaes$",
KeyContext.refer("KEY01", Key.KeyType.SYMMETRIC_KEY, "AES", 128));
String keyStoreFile = "file:///" + secureDir + "/my.keystore";
String keyStorePasswordFile = "file:///" + secureDir + "/my.keystore.passwords";
KeyProviderConfig keyProviderConfig =
KeyProviderCryptoContextProvider.getKeyStoreKeyProviderConfig(
keyStoreFile, "JCEKS", null, keyStorePasswordFile, true);
KeyProviderCryptoContextProvider.setInputCryptoContextProvider(
jobConf, fileMatches, true, keyProviderConfig);
16
17. Key Distribution and Protection for
MapReduce
• Targets
– A framework at MapReduce side for enabling crypto codec in
MapReduce job such as key context resolving, distribution
and protection
– Enabling different key storage or management systems to
plug-in for providing keys
– Satisfying the common requirements that stage and file of a
single job may use different keys
• A complete key management system is not part of
Intel® Distribution for Apache Hadoop* Software
– An API to integrate with an external key manage system is
included
17
18. Test Environment
Processor Intel® Xeon® processor E5-2690 @2.90GHz (32
core, only 1 core is used)
Software Intel® Distribution for Apache Hadoop* version
2.3
Memory 32GB
Operating System CentOS* 6.3
Encryption OpenSSL* 1.0.1c (Intel® AES-NI enabled)
Software
File System Apache Hadoop Distributed File System
(HDFS*)—namemode, datanode, and the test
program were all run on the same server
Storage 240 GB Intel® Solid-State Drive (SSD) 320 Series
Test Input 1 GB text file
Input Buffer Size 64K, 4K, 1K – data size for calling
encryption/decryption interface each time
18
19. Encryption in Memory
AES Encryption
Higher is better
500 Up to
450
400
5.3x
350
Speed(MB/s)
300
250
200
150
100
50
0
64k 4k 1k
AES-NI 460 457 454
No AES-NI 87 87 86
AES = Intel® Advanced Encryption Standard New Instructions
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance
tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions.
19 4/10/2013
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
19 information go to http://www.intel.com/performance.
20. Decryption in Memory
AES-Decryption
Higher is better
1400 Up to
1200 19.8x
1000
Speed(MB/s)
800
600
400
200
0
64k 4k 1k
AES-NI 1266 1259 1253
No AES-NI 64 63 63
AES = Intel® Advanced Encryption Standard New Instructions
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance
tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions.
20 4/10/2013
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
20 information go to http://www.intel.com/performance.
21. Combining Encryption with Compression
(Memory-to-HDFS Transfer)
600 Higher is better
500 489
475 468 464
436 435
400
Throughput (MB/s)
292 282
300 280
200
114 113 115
100 84 86 89
58 56 53 52 57 55 52 59 55 52 51 56 55 53 58 55 53 51 56 55 52
0
64k 4k 1k
hdfs io write aes w/ AES-NI aes w/o AES-NI
snappy + hdfs io aes + snappy w/ AES-NI aes + snappy w/o AES-NI
gzip + hdfs io aes + gzip w/ AES-NI aes + gzip w/o AES-NI
zlib + hdfs io aes + zlib w/ AES-NI aes + zlib w/o AES-NI
Up to 1.5X faster with Intel® AES-NI
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as
SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors
may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including
the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.
21 aes = Intel® Advanced Encryption Standard New Instructions, HDFS = Hadoop* Distributed File System
22. Combining Decryption with Decompression
(HDFS-to-Memory File Transfer)
1400 Higher is better
1287
1231
1199
1200
1104
1072 1048
1000
Throughput (MB/s)
800
661 677 661
611 635 624
600 565 566 557
466
456 476
461 471
455
410 409 417
400 365 369 367
322 324 325
299 300 299
200
57 56 56
0
64k 4k 1k
hdfs io read aes w/ AES-NI aes w/o AES-NI
snappy + hdfs io aes + snappy w/ AES-NI aes + snappy w/o AES-NI
gzip + hdfs io aes + gzip w/ AES-NI aes + gzip w/o AES-NI
zlib + hdfs io aes + zlib w/ AES-NI aes + zlib w/o AES-NI
Up to 3.3X faster with Intel® AES-NI
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark*
and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance
of that product when combined with other products. For more information go to http://www.intel.com/performance.
22 aes = Intel® Advanced Encryption Standard New Instructions, HDFS = Hadoop* Distributed File System
23. Where to Find the Source Code…
• Patch and design document already submit to
HADOOP-9331
• A working fork of Hadoop* with encryption
framework can be found in GitHub project
23
24. Agenda
• Big Data Security Trend
• Intel® Distribution for Apache Hadoop*
• Intel Distribution for Apache Hadoop Encryption
• Intel Distribution for Apache Hadoop Role Based
Access Control
• Summary/Call to Action
24
25. Role Based Access Control (RBAC):
Overview
Intel Manager
HDFS
Permissions
HBase*
Users
Permissions
Role
Hive*
Groups Permissions
MapReduce
Permissions
Active Directory
• User/Group & Roles will
be translated into
configuration files
• ACL configurations will
be pushed into every
required node
HDFS = Hadoop* Distributed File System
25
26. RBAC: Role Definition
• Role is a collection of permissions
• Permission can have resource parameters
• Role can be associate to users/groups
• One user/group can have multiple roles
• Currently we do not support role nesting
26
29. Beyond This…Project Rhino!
• A common authorization framework for the Hadoop*
ecosystem
• Token based authentication and single sign on
• Extend Hbase* support for ACLs to the cell level
• Improve audit logging
Please visit:
https://github.com/intel-hadoop/project-rhino/
29
30. Agenda
• Big Data Security Trend
• Intel® Distribution for Apache Hadoop*
• Intel Distribution for Apache Hadoop Encryption
• Intel Distribution for Apache Hadoop Role Based
Access Control
• Summary/Call to Action
30
31. Summary/Call to Action
• Intel® Xeon® processor based servers
provide a strong foundation for big data
workloads
• Intel® Distribution for Apache Hadoop* with
Intel Xeon processors provides breakthrough
data security and access control for big data
analytics
• Develop applications to leverage Intel
Distribution for Apache Hadoop Security
capabilities
• Deploy big data solutions with Intel
Distribution for Apache Hadoop on Intel
Xeon processor-based servers
31
32. Additional Resources
• Intel® AES-NI Website
• Intel® Distribution for Apache Hadoop* Website
• Intel AES-NI animation
• Secure Cloud with High Performing Intel® Data
Protection Technologies animation
• “The Rijndael Cipher” - an AES tutorial animation
• Shay Gueron, “Advanced Encryption Standard (AES)
Instruction Set rev 2”, Intel whitepaper, June 2009.
• Shay Gueron, Michael Kounavis, “Carry-less
multiplication and its usage for computing the GCM
Mode”, Intel whitepaper, August 2009
• Intel AES-NI use with IBM DB2 database white paper
32 Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI)
34. Legal Disclaimer
• Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute
the instructions in the correct sequence. AES-NI is available on select Intel® processors. For availability, consult your
reseller or system manufacturer. For more information, see Intel® Advanced Encryption Standard Instructions (AES-NI)
• Intel® Trusted Execution Technology (Intel® TXT): No computer system can provide absolute security under all
conditions. Intel® TXT requires a computer with Intel® Virtualization Technology, an Intel TXT enabled processor,
chipset, BIOS, Authenticated Code Modules and an Intel TXT compatible measured launched environment (MLE). Intel
TXT also requires the system to contain a TPM v1.s. For more information, visit
http://www.intel.com/technology/security.
• Intel® Virtualization Technology (Intel® VT) requires a computer system with an enabled Intel® processor, BIOS, and
virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and
software configurations. Software applications may not be compatible with all operating systems. Consult your PC
manufacturer. For more information, visit http://www.intel.com/go/virtualization.
• Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more information go to
http://www.intel.com/performance.
• Any software source code reprinted in this document is furnished under a software license and may only be used or
copied in accordance with the terms of that license.
• Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to
whom the Software is furnished to do so, subject to the following conditions:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT
OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
34
35. Risk Factors
The above statements and any others in this document that refer to plans and expectations for the first quarter, the year and the
future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,”
“intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking
statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking
statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors
could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the
following to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand
could be different from Intel's expectations due to factors including changes in business and economic conditions; customer acceptance
of Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns
including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial
conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, which could
negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by
a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult
to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions and the demand for and
market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing
programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological
developments and to incorporate new features into its products. The gross margin percentage could vary significantly from
expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying
products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and
associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials
or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and
intangible assets. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in
countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters,
infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and
compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's
products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures.
Intel’s current chief executive officer plans to retire in May 2013 and the Board of Directors is working to choose a successor. The
succession and transition process may have a direct and/or indirect effect on the business and operations of the company. In
connection with the appointment of the new CEO, the company will seek to retain our executive management team (some of whom are
being considered for the CEO position), and keep employees focused on achieving the company’s strategic goals and objectives. Intel's
results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and
by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as
the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an
injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting
Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed
discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most
recent Form 10-Q, report on Form 10-K and earnings release.
Rev. 1/17/13
35
37. Pillars & Challenges of Big Data
Massive scale and growth of unstructured data
80%~90% of total data
Volume Growing 10x~50x faster than structured (relational) data
10x~100x of traditional data warehousing
Heterogeneity and variable nature of Big Data
Many different forms (text, document, image, video...)
Variety No schema or weak schema
Inconsistent syntax and semantics
Real-time rather than batch-style analysis
Velocity Data streamed in, tortured, and discarded
Making impact on the spot rather than
after-the-fact
Predictive analytics for future trends and patterns
Value Deep, complex analysis (machine learning, statistic modeling,
graph algorithms…) versus
Traditional business intelligence (querying, reporting…)
37
38. HDFS File Encryption: Architecture
Overview
Key Management
Input Data Stream Output Data Stream
Encrypt/Decrypt
Encryption Codec
Buffer
Native Crypto Lib
HDFS = Hadoop* Distributed File System
38