3. Search Based Applications
Search Based Applications are software application in which Search Engine
platform is used as the core infrastructure for information accessing and
reporting.
E-commerce web applications or content management systems are the types of
search based application.
4. Overview of Search Based System
User Authentication System
Search Based Application Server
Unified Data Layer
Archives
Documents
Emails
File
Server
Authentication
• User is authenticated before providing access
to the application
Application
•
Presents with full fledge User Interface
• Perform user operations such as upload
documents, send emails, search, etc.
Unified Data Layer
• Search Server
• Indexes content across the sources
• Retrieves data at very high speed.
Data Storage
• Volume of data sources from different
repositories.
6. Common Access
To Unified data Layer
User Authentication System
Search Based Application
How is this a
threat?
Unified Data Layer
Archives
Documents
Emails
File
Servers
7. Consider a Sample Use Case
User A :
- Logs in to application.
- Performs a search operation
- With the key words such as ‘Pay Slips’, ‘Personal’ or ‘appraisal’.
Sample results demonstrated for “appraisal”
9. Observations
Relevant Search Results : [Correct]
- User A was returned with relevant search results based on his search query;
such as exact matches, more like this key words, synonym key words, etc.
Unauthorized Search results: [Wrong]
- Few of the search results retrieved were the documents to which he was not
authorized to view.
How are we
doing with this?
Threats:
• Exposure to other users’ confidential documents
• Access to Unauthorized information.
10. Problem Definition
•
To develop a search platform where every user has access to only those
documents to which he/she is authorized to.
•
To ensure that all the confidential data uploaded is not globally searchable unless
it is intended to be globally accessible.
How can we
achieve this?
12. Access Control List
• Access Controls are Security features
that control how users [subject] and
documents[object] communicate and
interact with one another.
• Subject: An active entity[User] that
requests access to an
object[Document].
• Object: A passive entity[Document]
that contains information
Interaction
Subject
Document
Object
13. Data Model
Let’s first understand the data model of search engine.
Alec_1167
,_id:”1167”,
Name:”Ale C”,
Agent:”Miller”
Place:”NY, NJ, CA”,
Units:570}
NY
2
NJ
1167
1167 Alec Miller
1
1167
How are documents stored in search engine?
Document Oriented Approach.
1167
3
CA
570
3424 Kiwi
reds
340
5612 Reh
Mo’s
664
14. Indexing and Storing Document Object
•
•
•
•
User A uploads a document into the system
Text Extraction
Convert it to a flat structure
Input it to Search Engine
Document
Text
Extract
Search
Engine
Document
Saved
15. •
We missed to capture something!
•
What did we miss?
– Capturing of User information for each document!
• Who uploaded the document
• To whom did the user share with?
Document
•
Text
Extract
Search
Engine
How do we maintain this information?
– Access control list to each document object.
Document
Saved
16. Conventional Solution
•
Access Control Lists for each user.
•
At the time of search,
– Retrieve search results,
– And perform a check on each document for
user’s authorization and
– Finally return the results.
Search Engine
Security Filter Each
Document
Return Results to
User
18. Access Control Models
Solutions are dependent on the Access Control Models we choose.
Two important types of Access Control Models:
1.
2.
Non-Discretionary Access Control(Role Based)
Discretionary Access Control (DAC)
19. 1. Non-Discretionary (Role Based)
Sales
Definition:
•
Non-Discretionary ACL uses a
administered set of rules to
determine how Users and
Documents interact.
Sales Documents
Marketing Documents
Manager
Engineering Documents
•
It is referred to as
nondiscretionary because
assigning a user to a role is
unavoidable
Admin Documents
Super User
20. Solution For Role Based ACL - Type 1
System that has,
• Roles defined during design time and Static ACL set
to each document .
•
We choose, “Early Binding with ACL bound to
Document Objects”
In such systems,
• Document objects will include a multi-valued Roleid field that will contain list of role-Ids which has
access to the document.
Index Time
Document 1
role-Ids: *“1”, “2”, “3”+
Document 1
role-Ids: *“1”, “2”, “3”+
Document 2
“role-Ids:” * “2”, “3”+
Documents with ACLs
21. Continued…
At the time of search,
•
User Search Query should be appended with user’s
Role Id.
•
Solr’s Filter Query feature and it’s caching
techniques gives the most efficient solution for
such ACL Techniques. This approach is called as
‘Early Binding’ approach.
Query
Request
Early Binding
User Role-Id
Solr J Client
Query
Response
22. Solution For Role Based ACL - Type 2
Systems that has,
•
Roles which often change; data is normalized by
segregating access control information into
different tables.
•
Document1
D1
This approach is called as ‘Early Binding with
Externalized ACL’
•
•
•
In such systems:
Role-Ids are not attached to the document object.
Instead they are stored into different tables with
foreign key relation.
Use Pseudo Joins at the time of Search
Doc ID
Role-Ids
D1
1, 2, 3, N
23. 2. Discretionary Access Control
Definition:
• Discretionary – Document
owner has the authority to
control access of the document.
• A system that enables the
document owner to specify set
of Users with access to a set of
documents
Owner
Specifies Users/groups
who can Access
Object
24. Solution for Discretionary ACL - Type 1
System that has
•
Frequent changes in ACL
•
ACL is defined for each user and a document,
•
We choose ‘Late Binding Approach with
Externalized ACL’
Users
Doc1
Doc2 Doc N
User A
1
1
1
User B
0
1
1
User M
In such systems,
•
ACL is a 2D-matrix with users and documents
along its rows and columns
Encode Values – 0 :No access, 1 : Access
N : Number of Users, M – Number of Documents
25. Continued…
For implementation, the ACL matrix can be represented as a array of bits.
Users
Doc1 Doc2
Doc N
UserA 1
1
1
UserB
1
1
0
[1] 111
[2] 011
This compact representation improves search efficiency and memory over head.
26. Example
Consider,
•
•
•
•
Maximum documents in the Search systems is 5 with document ids:{1,2, 3, 4, 5}
Maximum Users are 2 { Id : 1,2 }
User 1 has access to document {1, 2, 3} 1 1 1 0 0
User 2 has access to Document {1,2,3,4,5} 1 1 1 1 1
•
ACL matrix and array representation:
User
1 2 3 4 5
1
1 1 1 0 0
2
1 1 1 1 1
[1] 11100
[2] 11111
27. Solr Implementation
Solution 1
• Solr has a Post Filter Interface that can be extended to develop a Custom Plugin.
• Interface has a method called ‘collect()’
•
Collect() has a list of documents matched to the user’s search query.
– Iterate through the list, get the document-Id from the Field Cache and
apply ACL using bit array . 1 1 1 0 0
•
Code Snippets: https://gist.github.com/rajanim/7197154
28. Other Implementation Solution
Solution 2
• Using BitSet utilities
• Get the bitset of documents matched by the search query from Search Engine
• Get the User ACL bitset instance
• Obtain the intersection of the two bitsets [intersect(bitset other)]
1
1
1
0
0
1
1
1
1
1
0
0
1
0
0
29. Solution for Discretionary ACL - Type 2
•
•
Discretionary ACL systems have static ACL
We choose, “Early Binding with ACL bound to Document
Objects”
In such systems,
• Document objects will include a multi-valued user-id field that
contains a list of user-ids with access to the document.
• The user-id field has to be indexed.
30. Continued…
•
This solution requires the ACL and document data to be de-normalized to flat
structure.
Index Time
Parse Document
Search Time
Query Request
With User ID
Add List of Users
Who has access
Solr J Client
Query
Response
32. Summary
•
Discretionary ACL with late binding solution is a complex model and it requires
extensive verification
•
Leverage Solr’s smart caching capability
•
Since ACL always adds an additional over head it has to be optimized to provide
minimum delay.
33. References:
•
•
•
•
•
•
searchhub.org/2012/02/22/custom-security-filtering-in-solr/
Secure Search in Enterprise Webs: Tradeoffs in Efficient Implementation for
Document Level Security By Peter Bailey, David Hawking, Brett Matson
All in One Book (Shon Harris, 2005)
http://www.searchtechnologies.com/enterprise-search-document-levelsecurity.html
http://alvinalexander.com/java/jwarehouse/lucene/src/test/org/apache/lucene
/search/TestFilteredQuery.java.shtml
https://github.com/Zvents/score_stats_component/blob/master/src/main/java/
com/zvents/solr/components/ScoreStatsPostFilter.java
By maintaining ACL mapped to each document object. ACL are mapping between user space to document space. Mapping indicates the list of document user has access to. Ok let’s define ACL’
This is a role based type of model. In such models, every user is assigned to administered set of role. ACL are developed based on the role user is assigned to.It is called as non-discretionary type because in such systems, user has to be unavoidably assigned to a role. There is no option of user existing without a role.So users and documents interact with each other based on role-ids. If the user uploads a document, his role-id is also captured with the document so that while searching this document is accessible by only those users who belong to that role.
Each row can be rep as stream of bits and entire matrix can be represented at array of such bits