Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
© 2016 Infosys Ltd.
Machine Learning & Artificial Intelligence -
Machine Controlled Data Dispensation
1
Muthu Venkatesh - ...
2
• Handling customer sensitive data has always been a challenge for organizations in Banking , Insurance and
Healthcare d...
3
Table of Contents
• Introduction / Background
• In-Scope/Out of Scope
• Step 1 – Training Set Creation
• Step 2 – Test S...
© 2016 Infosys Ltd.
Introduction
1 2 3TRAINING
SET
CREATION 4TEST
SET
CREATION
ALGORITHM
TUNING
VALIDATION
• Increased foc...
5
• Identifying sensitive data stored in database schemas
• Identification based on selected attributes of each column in ...
© 2016 Infosys Ltd.
Step 1 – Training Set Creation
Parser program runs on schema tables and creates training set with non-...
© 2016 Infosys Ltd.
7
© 2016 Infosys Ltd.
8
© 2016 Infosys Ltd.
Step 2 – Test Set Creation
Parser program runs on schema tables and creates test set with non-null val...
© 2016 Infosys Ltd.
10
© 2016 Infosys Ltd.
Step 3 – Fine Tuning Algorithm
Algorithm runs iteratively on the data in training set file and iterati...
© 2016 Infosys Ltd.
Step 4 – Validation
Algorithm runs on test set file and compares its predictions from training set fil...
© 2016 Infosys Ltd.
13
© 2016 Infosys Ltd.
Case Study
14
© 2016 Infosys Ltd.
Key Considerations:
 Usage of Java to implement the Machine Learning Algorithm
 Reduction in time co...
© 2016 Infosys Ltd.
Overall Benefits
• Algorithm Accuracy
– Our Machine Learning Algorithm is able to train efficiently an...
© 2016 Infosys Ltd.
ML Based Data Dispensation – Features and Scalability
• Simple Plug & Play – For any kind of database(...
© 2016 Infosys Ltd.
Thank You
18
Prochain SlideShare
Chargement dans…5
×

Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensation

2 825 vues

Publié le

by Muthuvenkatesh Sivakadatcham, Principal Test Consultant & Karthikeyan Mani, Technology Test Lead, Infosys at STeP-IN SUMMIT 2018 - 15th International Conference on Software Testing on August 30, 2018 at Taj, MG Road, Bengaluru

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensation

  1. 1. © 2016 Infosys Ltd. Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensation 1 Muthu Venkatesh - Principal Test Consultant Karthikeyan Mani - Technology Test Lead
  2. 2. 2 • Handling customer sensitive data has always been a challenge for organizations in Banking , Insurance and Healthcare domains. • While building and validating data analytics model has gained huge importance to stay afloat amidst the stiff competition from peers , it is also of paramount importance that the data dispensation rules are adhered to . • There are well established traditional methods for identifying Sensitive Information so that they can be masked to prevent interpretation with a malicious intent by intruders or any third party involved in validation of data. • But , the pain point is in identifying the sensitive information either through manual intervention (or) by automation through coding demands lot of effort and continuous updation of scripts. • This paper throws light how machine can be deployed for the same with very minimal intervention from the user. • Testers / Test Leads / Test Managers/Business Analysts & Data Scientists will benefit from this thought/idea Abstract
  3. 3. 3 Table of Contents • Introduction / Background • In-Scope/Out of Scope • Step 1 – Training Set Creation • Step 2 – Test Set Creation • Step 3 – Fine Tuning Algorithm • Step 4 – Validation • Demo
  4. 4. © 2016 Infosys Ltd. Introduction 1 2 3TRAINING SET CREATION 4TEST SET CREATION ALGORITHM TUNING VALIDATION • Increased focus on data security across various domains • Identification of sensitive data across systems is most challenging for many organisations today MACHINE LEARNING – Easy 4 Step Process
  5. 5. 5 • Identifying sensitive data stored in database schemas • Identification based on selected attributes of each column in schema Out of Scope In-Scope • Current solution does not handle other aspects of data regulation compliances like GDPR • Columns with free text like blogs, chat histories etc are not analyzed in current solution . They would be treated as a distinct value at a high level • Data from sources other than datastores are not handled as part of this exercise
  6. 6. © 2016 Infosys Ltd. Step 1 – Training Set Creation Parser program runs on schema tables and creates training set with non-null values and stores it in the training set file. Fields/Columns needed for Sensitive data determination are pulled out. Column name, Max column length & non-null Value Database Parser program Training set with non- null values File 1 (Training_Set)
  7. 7. © 2016 Infosys Ltd. 7
  8. 8. © 2016 Infosys Ltd. 8
  9. 9. © 2016 Infosys Ltd. Step 2 – Test Set Creation Parser program runs on schema tables and creates test set with non-null values and stores it in the test set file. Column name, Max column length & non-null Value Database Parser program Training set with non- null values File 2 (Test_Set)
  10. 10. © 2016 Infosys Ltd. 10
  11. 11. © 2016 Infosys Ltd. Step 3 – Fine Tuning Algorithm Algorithm runs iteratively on the data in training set file and iterations are repeated till we achieve 100% accuracy. Run Algorithm on Training Set Validate Accuracy of Algorithm Is Accuracy = 100 % Freeze Algorithm for Test Set Fine Tune Algorithm - Adjust no of rows, columns , neighbors etc., Y N File 1 (Training_Set)
  12. 12. © 2016 Infosys Ltd. Step 4 – Validation Algorithm runs on test set file and compares its predictions from training set file and stores it in another file (Recommendations_File). Run Frozen Algorithm on Test Sets Validate Output of Algorithm Persist output to File 3 File 2 (Test_Set) File 3 ( Recommendations from Algorithm)
  13. 13. © 2016 Infosys Ltd. 13
  14. 14. © 2016 Infosys Ltd. Case Study 14
  15. 15. © 2016 Infosys Ltd. Key Considerations:  Usage of Java to implement the Machine Learning Algorithm  Reduction in time consumption and human effort  Better accuracy in identification of sensitive fields.  The Machine Learning algorithm will not be written. Instead, it shall be acquired from an open source platform. Context 15 Objective: Sensitive Data Discovery-To identify sensitive fields in a target database containing sensitive as well as non-sensitive information. Scope: • The target database has around 800 tables. • Implementation of an Algorithm based Machine Learning for identifying Sensitive Fields. • Output of the Machine Learning Algorithm needs to be compared with the manual analysis to arrive at the accuracy. • The ML PoC will provide a human readable output.
  16. 16. © 2016 Infosys Ltd. Overall Benefits • Algorithm Accuracy – Our Machine Learning Algorithm is able to train efficiently and is able to get ~96% accuracy on the datasets when executed in the test environment. – Percentage of Training Data – 15-20% – Percentage of Test data – 80-85% – Algorithm Used : Naïve – Bayes • Performance – We also conducted a performance evaluation on the run time of algorithms. – The timings for code execution on some of the test data sets are below: 16 Table Size Training set creation Training set run Test set creation Algorithm run Total time 5598 MB 59 s 20 s 57 s 50 s 3 m 6 s 600 MB 45 s 14 s 45 s 38 s 2m 22 s 22 MB 32 s 11 s 31 s 29 s 1 m 43 s 1732 MB 49 s 17 s 48 s 41 s 2 m 35 s
  17. 17. © 2016 Infosys Ltd. ML Based Data Dispensation – Features and Scalability • Simple Plug & Play – For any kind of database(s) & files • Custom data types – Creating customized search patterns • Accuracy and Reporting – Automated custom reporting • Easy to use – Implementation of a front end based solution • Low cost solution • Easy Maintainable & Scalable • Lower tool configuration effort – For conventional TDM tools, there is a need to configure the pattern for each new sensitive field type. An ML based solution will learn on its own. Therefore it is a more efficient way of approaching the problem of Sensitive Data discovery. • Continuous improvement – The output of the ML based solution will improve over a period of time. The conventional approach does not provide any such benefit.
  18. 18. © 2016 Infosys Ltd. Thank You 18

×