3. What is sqrrl?
Based on Accumulo
A proven, secure, multi tenant, data platform for building real-time
applications
Scales elastically to tens of petabytes of data and enables
organizations to eliminate their internal data silos
Seamless integration with Hadoop, and most of its variants
Providing a supply for a much needed security demand (groundup security)
Already deployed and utilized by defense and government
industries
6. What is Accumulo?
Development began at the NSA in 2008
Base foundation for sqrrl
Cell-level security reduces the cost of app development,
circumnavigating complex, sometimes impossible, legal or
policy restrictions
Provides the ability to scale to >PB levels
Highly adaptive schema and sorted key/value paradigm
Stores key/value pairs in parsed, sorted, secure controls
8. How does Accumulo provide security?
Security Labels are applied to keys
Cell-level security is implemented to allow for security policy
enforcement, using data labeler tags
These policies are applied when data is ingested
Tablets contain data, are controlled using security policies
Stores key/value pairs in parsed, sorted, secure controls, a 5-tuple
key system
9. Accumulo Security (cont.)
Why Cell-Level Security Is Important:
Many databases insufficiently implement security through row-and columnlevel restrictions. Column-level security is only sufficient when the data
schema is static, well known, and aligned with security concerns. Row-level
security breaks down when a single record conveys multiple levels of
information. The flexible, fine-grained cell-level security within Sqrrl
Enterprise (or its root Accumulo) supports flexible schemas, new indexing
patterns, and greater analytic adaptability at scale.
10. Accumulo Security (cont.)
An Accumulo key is a 5-tuple key, consisting of:
Row: Controls Atomicity
Column Family: Controls Locality
Column Qualifier: Controls Uniqueness
Visibility Label: Controls Access
Timestamp: Controls Versioning
(Values are byte arrays)
Keys are sorted:
Hierarchically: Row first, then column family, and so on
Lexicographically: Compare first byte, then second, and so on
12. Accumulo Architecture
Accumulo servers (tablets) utilize a multitude of big data
technologies, but their layout is different than Map/Reduce, HDFS,
MongoDB, Cassandra, etc. used alone.
Data is stored in HDFS
Zookeeper is utilized for configuration management
SSH, password-less, node configuration
An emphasis, more of an imperative, on data model and data
model design
14. Accumulo Architecture (cont.)
Tablet Servers
Receive writes, responds to reads, from
clients
Writes to a write-ahead log, sorting new
key/value pairs in memory, while
periodically flushing sorted key/value
pairs to new files in HDFS
Managed by Master
Responsible for detecting and
responding to Tablet Server failure,
load balancing
Coordinates startup, graceful
shutdown, and recovery of writeahead logs
Zookeeper
An apache project, open source
Utilized for distributed locking
mechanism, with no single point of
failure
15. Integration with users/access
The visibility labels are a feature that is unique to Accumulo. No other
database can apply access controls at such a fine-grained level.
Labels are generated
by translating an
organization’s existing
data security and
information sharing
policies into Boolean
expressions
1.
Gather an organization’s information security policies and dissecting them into data--‐centric
and user--‐centric components
2.
As data is ingested into Accumulo, a data labeler tags individual key/value pairs with the
appropriate data--‐centric visibility labels based on these policies.
3.
Data is then stored in Accumulo where it is available for real--‐time queries by operational
applications. End users are authenticated through these applications and authorized to
access underlying data
4.
As an end user performs an operation via the app (e.g., performs a search request), the
visibility label on each candidate key/value pair is checked against his or her attributes, and
only the data that he or she is authorized to see is returned.
16. Making sqrrl work
sqrrl’s extensibility of Accumulo allows it to process
millions of records per second, as either static or
streaming objects
These records are converted into hierarchical JSON
documents, giving document store capabilities
Passing this data to the analytics layer is designed to
make integration and development of real-time
analytics possible, and accessible
Combining access at the cell level, with Accumulo,
sqrrl integrates Identity and Access Management
(IAM) systems (LDAP, RADIUS, etc.)
17. Making sqrrl work
Apache thrift:
(cont.)
Sqrrl process
Apache Lucene:
Enables development in
diverse language choices
Custom iterators, providing developers with real-time
capabilities, such as full-text search, graph analysis, and
statistics, for analytical applications and dashboards.
HDFS:
File Storage system,
compatible with
both open source
(OSS) and
commercial
versions
Data Ingest:
JSON, or graph
format
Apache Accumulo:
The core of transactional and online
analytical data processing in sqrrl
18. Who is sqrrl for?
CTOs/CIOs: Unlock the value in fractured and unstructured datasets
across your organization
Developers: More easily create apps on top of Big Data and
distributed databases
Infrastructure Managers: Simplify administration of Big Data through
highly scalable and multitenant distributed systems
Data Analysts: Dig deeper into your data using advanced analytical
techniques, such as graph analysis
Business Users: Use Big Data seamlessly via apps developed on top
of sqrrl enterprise
19. sqrrl/Accumulo wrap-up
Accumulo
sqrrl
bridges the gap for security perspectives that restrict
a large swath of industries
combines the best of available technologies,
develops and contributes their own, and designs
big apps for big data.
Accumulo Setup:
1.
2.
3.
Installation of HDFS and ZooKeeper must installed and configured
Password-less SSH should be configured between all nodes (emphasized master <> tablet)
Installation of Accumulo (from http://accumulo.apache.org/downloads/ using
http://accumulo.apache.org/1.4/user_manual/Administration.html#Installation
Or get started using their AMI (http://www.sqrrl.com/downloads#getting-started)
Editor's Notes
Welcome everyone to the meetup.Thank Michael Walker for being such a generous and kind host.Give a brief overview of the topic: Security is now paramount
We’ve all seen decision trees, this one is a little different.Notice how Accumulo has positioning in almost every aspect of big data analysis.We’ll get into the greater explanation of the interactions and operations of both sqrrl and Accumulo.
Big data has been brewing for a while, and even though Accumulo hasn’t been widely known; it’s had a rich history.Most of you will know about Google’s Bigtable white paper, which Accumulo is based on.In 2008, the NSA began developmentIn 2011 NSA open sources the solution and Accumulo becomes an incubation project at ApacheIn 2012 Accumulo becomes a top-level projectAnd this year the first release of sqrrl was unleashed.
The bulk of sqrrl’s design architecture is held within Accumulo, hence why we’ll be diving so deep into its inner-workings.The majority of value from sqrrl is extracted by augmenting and adapting other technologies to provide real-time analytics, security, SQL interactions and analysis, such as graph search.
As the history described, development began in 2008.Sqrrl uses Accumulo as a basis for most data workflow and storage access.Cell-level security becomes the key element to differentiate Accumulo and allow for lowering of costs and implementing key security measures and compliances.Scalability is, of course, accommodated with Accumulo.Its schema/tablet/table design complements an adaptive approach.Accomplishes the cell-level security by parsing and sorting specific parts of key/value pairs.
This is an overview that we’ll return to in a bit.Notice the “keystone” is Accumulo, but also the diverse technologies associated.
Labels are used to apply security to keysPolicy driven enforcement allows for cell-level securityData is marked and labeled at ingest, requiring a well thought-out approachTablets contain tables, or data, and are controlled by those, hopefully, well thought-out security policiesKey/Value pairs contain inherent data at ingestion, using a 5-tuple system
I’d like to take a minute, just sit right there, and explain how cell security became the most sought-after feature for real-time analysis.And this is why sqrrl and Accumulo have, natively, risen to the challenge, and why it’s important.
An Accumulo key is a pretty complicated column, and is actually a set of nested columns.Explain each individually:Row, provides controlled atomicityCrucial for large data setsColumn Family, controlling locality, fast access systemsColumn Qualifier, sorts third and provides uniqueness to the keyVisibility Label, runs fourth and controls user access to individual key/value pairs. They are Boolean, and require parenthesis to determine orderingVisibility Label, runs fourth and controls user access to individual key/value pairs. They are Boolean, and require parenthesis to determine orderingTimestamp, controls which version of the key is which.Sorting of keys:Hierarchically, row first, column family second, qualifier third, etc.Lexicographically, compare the first by first, second, second, etc.
This is an example of how one might setup a key configuration.Row = NameCol. Fam. = Sorting Locality used as a diverse descriptorCol. Qual. = Additional descriptor for granularityVisibility = User access (the syntax is described in documentation, but parenthesis can be used to “nest” and order access restrictions)Timestamp = The calendar dateValue = Diverse result
Explain above.Make sure to spend some extra time on data modeling and design.
Mention and focus that the sqrrl enterprise white paper is much more verbose and granular when explaining the inner-workings of a Tablet server, etc.
Mention and focus that the sqrrl enterprise white paper is much more verbose and granular when explaining the inner-workings of a Tablet server, etc.
The visibility labels are a feature that is unique to Accumulo. No other database can apply access controls at this fine-‐grained level. The labels are generated by translating an organization’s existing data security and information sharing policies into Boolean expressions over data attributes and user attributes.The process for applying these security labels to the data and connecting the labels to user’s designated authorizations is depicted in this figure. The first step is gathering an organization’s information security policies and dissecting or dissemenatingthem into data-‐centric and user-‐centric components. As data is ingested into Accumulo, a data labeler tags individual key/value pairs with the appropriate data-‐centric visibility labels based on these policies. Data is then stored in Accumulo where it is available for real-‐time queries by operational applications. End users are authenticated through these applications and authorized to access underlying data based on their defined attributes (e.g., sworn law enforcement officer, current employee, etc.). As an end user performs an operation via the app (e.g., performs a search request), the visibility label on each candidate key/value pair is checked against his or her attributes, and only the data that he or she is authorized to see is returned.
Real-time processing of millions of records per second, give sqrrl an edge.The workflow processes the ingestion through conversion to hierarchical JSON documents, or can, to utilize a document store’s capabilities.Egress out of Accumulo is where sqrrl shines, and passing this data to the analytics layer gives integration and real-time analytics development a lot more accessibility.Integration with existing Identity and Access Management systems is crucial to a direct, seamless, implementation.
Remember this graphic? There’s quite a lot more to sqrrl than just AccumuloStarting with ingest, sqrrl facilitates delineationPushed to Accumulo, processing and results are utilized, with HDFS as a storage systemThrift enables development from almost any developers favorite flavor of languageAnd finally Lucene, one of many post processing analytics platforms, takes the churned data and presents it to a number of layers.
Explain each usage case
Explain the above.I’d like to thank everyone for their time, and would love to answer any questions.