Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Krishnaram Kenthapadi
Principal Scientist, Amazo...
What is Privacy?
• Right of/to privacy
• “Right to be let alone” [L. Brandeis & S. Warren, 1890]
• “No one shall be subjec...
Data Privacy (or Information Privacy)
• “The right to have some control over how your personal information is
collected an...
Data Privacy vs. Security
• Data privacy: use & governance of personal data
• Data security: protecting data from maliciou...
Data Privacy:Technical Problem
Given a dataset with sensitive personal information, how can we compute
and release functio...
Massachusetts Group
Insurance Commission
(1997): Anonymized
medical history of state
employees
William Weld vs
Latanya Swe...
64%Uniquely identifiable with ZIP
+ birth date + gender (in the
US population)
Golle, “Revisiting the Uniqueness of Simple...
A History of Privacy Failures …
Credit: Kobbi Nissim,Or Sheffet
Lessons Learned …
• Attacker’s advantage: Auxiliary information; high dimensionality;
enough to succeed on a small fractio...
• Ethical challenges
posed by AI systems
• Inherent biases present
in society
• Reflected in training
data
• AI/ML models ...
Laws against Discrimination
Immigration Reform and Control Act
Citizenship
Rehabilitation Act of 1973;
Americans with Disa...
Fairness Privacy
Transparency Explainability
Motivation & Business Opportunities
• Regulatory. We need to understand why the ML model made a given
decision and also wh...
15© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Scaling Fairness, Explainability & Privacy acr...
LinkedIn operates the largest professional
network on the Internet
Tell your
story
645M+ members
30M+
companies are
repres...
Threat Models
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Threat Models
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Threat Models
Threat Models
User Access Only
• Users store their
data
• Noisy data or
analytics transmitted
Trusted Curator
• Stored by ...
Privacy in
AI @
LinkedIn
PriPeARL: Framework to
compute robust,
privacy-preserving
analytics
Analytics & Reporting Products at LinkedIn
Profile View
Analytics
23
Content
Analytics
Ad Campaign
Analytics
All showing
d...
Admit only a small # of predetermined query types
Querying for the number of member actions, for a specified time period,
...
Admit only a small # of predetermined query types
Querying for the number of member actions, for a specified time period,
...
Privacy Requirements
Attacker cannot infer whether a member performed an action
E.g., click on an article or an ad
Attacke...
Possible Privacy Attacks
27
Targeting:
Senior directors in US, who studied at Cornell
Matches ~16k LinkedIn members
→ over...
Problem Statement
Compute robust, reliable analytics in a privacy-
preserving manner, while addressing the product
needs.
Differential Privacy
Curator
Defining Privacy
Defining Privacy
31
CuratorCurator
+ your data
- your data
Differential Privacy
32
Databases D and D′ are neighbors if they differ in one person’s data.
Differential Privacy: The di...
(ε, 𝛿)-Differential Privacy: The distribution of the curator’s output M(D) on
database D is (nearly) the same as M(D′).
Di...
Differential Privacy: Random Noise Addition
If ℓ1-sensitivity of f : D → ℝn:
maxD,D′ ||f(D) − f(D′)||1 = s,
then adding La...
PriPeARL: A Framework for Privacy-Preserving Analytics
K. Kenthapadi, T. T. L. Tran, ACM CIKM 2018
35
Pseudo-random noise ...
PriPeARL System Architecture
Lessons Learned from Deployment (> 1
year)
Semantic consistency vs. unbiased, unrounded noise
Suppression of small counts
...
Summary
Framework to compute robust, privacy-preserving analytics
Addressing challenges such as preserving member privacy,...
Acknowledgements
Team:
AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran
Ad Analytics Product & Engineering: Mark Dietz, Tayl...
LinkedIn Salary
LinkedIn Salary (launched in Nov, 2016)
Data Privacy Challenges
Minimize the risk of inferring any one
individual’s compensation data
Protection against data brea...
Problem Statement
How do we design LinkedIn Salary system taking into
account the unique privacy and security challenges,
...
Title Region
$$
User Exp
Designer
SF Bay
Area
100K
User Exp
Designer
SF Bay
Area
115K
... ...
...
Title Region
$$
User Exp...
System
Architecture
Acknowledgements
Team:
AI/ML: Krishnaram Kenthapadi, Stuart Ambler, Xi Chen, Yiqun Liu, Parul
Jain, Liang Zhang, Ganesh Ve...
Privacy Research @ Amazon -
Sampler
Work done by Oluwaseyi Feyisetan, Tom Diethe, Thomas Drake, Borja Belle
Simple but effective, privacy-preserving mechanism
Task: subsample from dataset using additional information in privacy-
p...
Differentially-private text redaction
Task: automatically redact sensitive text for privatizing various ML models.
 Pertu...
Analysis of DP redaction
Show plausible deniability via dist. of Nw & Sw for ε:
ε  0 : Nw decreases, Sw increases
ε  inf...
Improving data utility of DP text redaction
Task: redact text, but use additional structured information to
better preserv...
Analysis of Hyperbolic redaction
New method improves over
privacy and utility because
of ability to encode
meaningful stru...
Beyond
Accuracy
Performance and Cost
Fairness and Bias
Transparency and Explainability
Privacy
Security
Safety
Robustness
Fairness, Explainability &
Privacy: Opportunities
Fairness in ML
Application specific challenges
Conversational AI systems: Unique bias/fairness/ethics considerations
E.g.,...
Explainability in ML
Actionable explanations
Balance between explanations & model secrecy
Robustness of explanations to fa...
Privacy in ML
Privacy for highly sensitive data: model training & analytics using
secure enclaves, homomorphic encryption,...
Reflections
“Fairness, Explainability, and Privacy by
Design” when building AI products
Collaboration/consensus across key...
Acknowledgements
Amazon AWS AI team
Special thanks to Sergul Aydore, Satadal Bhattacharjee, William Brown, Sanjiv Das, Jas...
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
ThankYou
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Prochain SlideShare
Chargement dans…5
×

Privacy in AI/ML Systems: Practical Challenges and Lessons Learned

How do we protect the privacy of users when building large-scale AI based systems? How do we develop machine learning models and systems taking fairness, accuracy, explainability, and transparency into account? Model fairness and explainability and protection of user privacy are considered prerequisites for building trust and adoption of AI systems in high stakes domains. We will first motivate the need for adopting a “fairness, explainability, and privacy by design” approach when developing AI/ML models and systems for different consumer and enterprise applications from the societal, regulatory, customer, end-user, and model developer perspectives. We will then focus on the application of privacy-preserving AI techniques in practice through industry case studies. We will discuss the sociotechnical dimensions and practical challenges, and conclude with the key takeaways and open challenges.

  • Soyez le premier à commenter

Privacy in AI/ML Systems: Practical Challenges and Lessons Learned

  1. 1. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Krishnaram Kenthapadi Principal Scientist, Amazon AWS AI Privacy in AI/ML Systems Practical Challenges & Lessons Learned EMLNP PrivateNLP Workshop, Nov’2020
  2. 2. What is Privacy? • Right of/to privacy • “Right to be let alone” [L. Brandeis & S. Warren, 1890] • “No one shall be subjected to arbitrary interference with [their] privacy, family, home or correspondence, nor to attacks upon [their] honor and reputation.” [The United Nations Universal Declaration of Human Rights] • “The right of a person to be free from intrusion into or publicity concerning matters of a personal nature” [Merriam-Webster] • “The right not to have one's personal matters disclosed or publicized; the right to be left alone” [Nolo’s Plain-English Law Dictionary]
  3. 3. Data Privacy (or Information Privacy) • “The right to have some control over how your personal information is collected and used” [IAPP] • “Privacy has fast-emerged as perhaps the most significant consumer protection issue—if not citizen protection issue—in the global information economy” [IAPP]
  4. 4. Data Privacy vs. Security • Data privacy: use & governance of personal data • Data security: protecting data from malicious attacks & the exploitation of stolen data for profit • Security is necessary, but not sufficient for addressing privacy.
  5. 5. Data Privacy:Technical Problem Given a dataset with sensitive personal information, how can we compute and release functions of the dataset while protecting individual privacy? Credit: Kobbi Nissim
  6. 6. Massachusetts Group Insurance Commission (1997): Anonymized medical history of state employees William Weld vs Latanya Sweeney Latanya Sweeney (MIT grad student): $20 – Cambridge voter roll born July 31, 1945 resident of 02138
  7. 7. 64%Uniquely identifiable with ZIP + birth date + gender (in the US population) Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”, WPES 2006
  8. 8. A History of Privacy Failures … Credit: Kobbi Nissim,Or Sheffet
  9. 9. Lessons Learned … • Attacker’s advantage: Auxiliary information; high dimensionality; enough to succeed on a small fraction of inputs; active; observant … • Unanticipated privacy failures from new attack methods • Need for rigorous privacy notions & techniques
  10. 10. • Ethical challenges posed by AI systems • Inherent biases present in society • Reflected in training data • AI/ML models prone to amplifying such biases Algorithmic Bias
  11. 11. Laws against Discrimination Immigration Reform and Control Act Citizenship Rehabilitation Act of 1973; Americans with Disabilities Act of 1990 Disability status Civil Rights Act of 1964 Race Age Discrimination in Employment Act of 1967 Age Equal Pay Act of 1963; Civil Rights Act of 1964 Sex And more...
  12. 12. Fairness Privacy Transparency Explainability
  13. 13. Motivation & Business Opportunities • Regulatory. We need to understand why the ML model made a given decision and also whether the decision it made was free from bias, both in training and at inference • Business. Providing explanations to internal teams (loan officers, customer service rep, forecasting teams) and end users/customers • Data Science. Improving models, understanding whether a model is making inferences based on irrelevant data, etc.
  14. 14. 15© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | Scaling Fairness, Explainability & Privacy across the AWS ML Stack VISION SPEECH TEXT SEARCH NEW CHATBOTS PERSONALIZATION FORECASTING FRAUD NEW DEVELOPMENT NEW CONTACT CENTERS NEW Amazon SageMaker Ground Truth Augmented AI SageMaker Neo Built-in algorithms SageMaker Notebooks NEW SageMaker Experiments NEW Model tuning SageMaker Debugger NEW SageMaker Autopilot NEW Model hosting SageMaker Model Monitor NEW Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Inferentia FPGA Amazon Rekognition Amazon Polly Amazon Transcribe +Medical Amazon Comprehend +Medical Amazon Translate Amazon Lex Amazon Personalize Amazon Forecast Amazon Fraud Detector Amazon CodeGuru AI SERVICES ML SERVICES ML FRAMEWORKS & INFRASTRUCTURE Amazon Textract Amazon Kendra Contact Lens For Amazon Connect SageMaker Studio IDE NEW NEW NEW NEW NEW
  15. 15. LinkedIn operates the largest professional network on the Internet Tell your story 645M+ members 30M+ companies are represented on LinkedIn 90K+ schools listed (high school & college) 35K+ skills listed 20M+ open jobs on LinkedIn Jobs 280B Feed updates
  16. 16. Threat Models
  17. 17. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Threat Models
  18. 18. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Threat Models
  19. 19. Threat Models User Access Only • Users store their data • Noisy data or analytics transmitted Trusted Curator • Stored by organization • Managed only by a trusted curator/admin • Access only to noisy analytics or synthetic data External Threat • Stored by organization • Organization has access • Only privacy enabled models deployed
  20. 20. Privacy in AI @ LinkedIn PriPeARL: Framework to compute robust, privacy-preserving analytics
  21. 21. Analytics & Reporting Products at LinkedIn Profile View Analytics 23 Content Analytics Ad Campaign Analytics All showing demographics of members engaging with the product
  22. 22. Admit only a small # of predetermined query types Querying for the number of member actions, for a specified time period, together with the top demographic breakdowns Analytics & Reporting Products at LinkedIn
  23. 23. Admit only a small # of predetermined query types Querying for the number of member actions, for a specified time period, together with the top demographic breakdowns Analytics & Reporting Products at LinkedIn E.g., Title = “Senior Director” E.g., Clicks on a given ad
  24. 24. Privacy Requirements Attacker cannot infer whether a member performed an action E.g., click on an article or an ad Attacker may use auxiliary knowledge E.g., knowledge of attributes associated with the target member (say, obtained from this member’s LinkedIn profile) E.g., knowledge of all other members that performed similar action (say, by creating fake accounts)
  25. 25. Possible Privacy Attacks 27 Targeting: Senior directors in US, who studied at Cornell Matches ~16k LinkedIn members → over minimum targeting threshold Demographic breakdown: Company = X May match exactly one person → can determine whether the person clicks on the ad or not Require minimum reporting threshold Attacker could create fake profiles! E.g. if threshold is 10, create 9 fake profiles that all click. Rounding mechanism E.g., report incremental of 10 Still amenable to attacks E.g. using incremental counts over time to infer individuals’ actions Need rigorous techniques to preserve member privacy (not reveal exact aggregate counts)
  26. 26. Problem Statement Compute robust, reliable analytics in a privacy- preserving manner, while addressing the product needs.
  27. 27. Differential Privacy
  28. 28. Curator Defining Privacy
  29. 29. Defining Privacy 31 CuratorCurator + your data - your data
  30. 30. Differential Privacy 32 Databases D and D′ are neighbors if they differ in one person’s data. Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). Curator + your data - your data Dwork, McSherry, Nissim, Smith [TCC 2006] Curator
  31. 31. (ε, 𝛿)-Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). Differential Privacy 33 Curator Parameter ε quantifies information leakage ∀S: Pr[M(D)∊S] ≤ exp(ε) ∙ Pr[M(D′)∊S]+𝛿.Curator Parameter 𝛿 gives some slack Dwork, Kenthapadi, McSherry, Mironov, Naor [EUROCRYPT 2006] + your data - your data Dwork, McSherry, Nissim, Smith [TCC 2006]
  32. 32. Differential Privacy: Random Noise Addition If ℓ1-sensitivity of f : D → ℝn: maxD,D′ ||f(D) − f(D′)||1 = s, then adding Laplacian noise to true output f(D) + Laplacen(s/ε) offers (ε,0)-differential privacy. Dwork, McSherry, Nissim, Smith [TCC 2006]
  33. 33. PriPeARL: A Framework for Privacy-Preserving Analytics K. Kenthapadi, T. T. L. Tran, ACM CIKM 2018 35 Pseudo-random noise generation, inspired by differential privacy ● Entity id (e.g., ad creative/campaign/account) ● Demographic dimension ● Stat type (impressions, clicks) ● Time range ● Fixed secret seed Uniformly Random Fraction ● Cryptographic hash ● Normalize to (0,1) Random Noise Laplace Noise ● Fixed ε True Count Noisy Count To satisfy consistency requirements ● Pseudo-random noise → same query has same result over time, avoid averaging attack. ● For non-canonical queries (e.g., time ranges, aggregate multiple entities) ○ Use the hierarchy and partition into canonical queries ○ Compute noise for each canonical queries and sum up the noisy counts
  34. 34. PriPeARL System Architecture
  35. 35. Lessons Learned from Deployment (> 1 year) Semantic consistency vs. unbiased, unrounded noise Suppression of small counts Online computation and performance requirements Scaling across analytics applications Tools for ease of adoption (code/API library, hands-on how-to tutorial) help! Having a few entry points (all analytics apps built over Pinot)  wider adoption
  36. 36. Summary Framework to compute robust, privacy-preserving analytics Addressing challenges such as preserving member privacy, product coverage, utility, and data consistency Future Utility maximization problem given constraints on the ‘privacy loss budget’ per user E.g., noise with larger variance to impressions but less noise to clicks (or conversions) E.g., more noise to broader time range sub-queries and less noise to granular time range sub-queries Reference: K. Kenthapadi, T. Tran, PriPeARL: A Framework for Privacy- Preserving Analytics and Reporting at LinkedIn, ACM CIKM 2018.
  37. 37. Acknowledgements Team: AI/ML: Krishnaram Kenthapadi, Thanh T. L. Tran Ad Analytics Product & Engineering: Mark Dietz, Taylor Greason, Ian Koeppe Legal / Security: Sara Harrington, Sharon Lee, Rohit Pitke Acknowledgements Deepak Agarwal, Igor Perisic, Arun Swami
  38. 38. LinkedIn Salary
  39. 39. LinkedIn Salary (launched in Nov, 2016)
  40. 40. Data Privacy Challenges Minimize the risk of inferring any one individual’s compensation data Protection against data breach No single point of failure
  41. 41. Problem Statement How do we design LinkedIn Salary system taking into account the unique privacy and security challenges, while addressing the product requirements? K. Kenthapadi, A. Chudhary, and S. Ambler, LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers, IEEE PAC 2017 (arxiv.org/abs/1705.06976)
  42. 42. Title Region $$ User Exp Designer SF Bay Area 100K User Exp Designer SF Bay Area 115K ... ... ... Title Region $$ User Exp Designer SF Bay Area 100K De-identification Example Title Region Company Industry Years of exp Degree FoS Skills $$ User Exp Designer SF Bay Area Google Internet 12 BS Interactive Media UX, Graphics, ... 100K Title Region Industry $$ User Exp Designer SF Bay Area Internet 100K Title Region Years of exp $$ User Exp Designer SF Bay Area 10+ 100K Title Region Company Years of exp $$ User Exp Designer SF Bay Area Google 10+ 100K #data points > threshold? Yes ⇒ Copy to Hadoop (HDFS) Note: Original submission stored as encrypted objects.
  43. 43. System Architecture
  44. 44. Acknowledgements Team: AI/ML: Krishnaram Kenthapadi, Stuart Ambler, Xi Chen, Yiqun Liu, Parul Jain, Liang Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal Application Engineering: Ahsan Chudhary, Alan Yang, Alex Navasardyan, Brandyn Bennett, Hrishikesh S, Jim Tao, Juan Pablo Lomeli Diaz, Patrick Schutz, Ricky Yan, Lu Zheng, Stephanie Chou, Joseph Florencio, Santosh Kumar Kancha, Anthony Duerr Product: Ryan Sandler, Keren Baruch Other teams (UED, Marketing, BizOps, Analytics, Testing, Voice of Members, Security, …): Julie Kuang, Phil Bunge, Prateek Janardhan, Fiona Li, Bharath Shetty, Sunil Mahadeshwar, Cory Scott, Tushar Dalvi, and team Acknowledgements David Freeman, Ashish Gupta, David Hardtke, Rong Rong, Ram
  45. 45. Privacy Research @ Amazon - Sampler Work done by Oluwaseyi Feyisetan, Tom Diethe, Thomas Drake, Borja Belle
  46. 46. Simple but effective, privacy-preserving mechanism Task: subsample from dataset using additional information in privacy- preserving way. Building on existing exponential analysis of k-anonymity, amplified by sampling… Mechanism M is (β, ε, δ)-differentially private Model uncertainty via Bayesian NN ”Privacy-preserving Active Learning on Sensitive Data for User Intent Classification” [Feyisetan, Balle, Diethe, Drake; PAL 2019]
  47. 47. Differentially-private text redaction Task: automatically redact sensitive text for privatizing various ML models.  Perturb sentences but maintain meaning e.g. “goalie wore a hockey helmet”  “keeper wear the nhl hat” Apply metric DP and analysis of word embeddings to scramble sentences Mechanism M is d χ – differentially private Establish plausible deniability statistics: Nw := Pr[M(w ) = w ] Sw := Expected number of words output by M(w) “Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations” [Feyisetan, Drake, Diethe, Balle; WSDM 2020]
  48. 48. Analysis of DP redaction Show plausible deniability via dist. of Nw & Sw for ε: ε  0 : Nw decreases, Sw increases ε  inf : Nw increases, Sw decreases. Impact of accuracy given ε (epsilon) on multi-class classification and question answering tasks, respectively:
  49. 49. Improving data utility of DP text redaction Task: redact text, but use additional structured information to better preserve utility. Can we improve redaction for models that fail for extraneous words? ~Recall-sensitive Extend d χ privacy to hyperbolic embeddings [Tifrea 2018] via Hyperbolic: utilize high-dimensional geometry to infuse embeddings with graph structure E.g. uni- or bi-directional syllogisms from WebIsADb New privacy analysis of Poincaré model and sampling procedure Mechanism takes advantage of density in data to apply perturbations more precisely. “Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text” Feyisetan, Drake, Diethe; ICDM 2019 Tiling in Poincaré disk Hyperbolic Glove emb. projected into B2 Poincaré disk
  50. 50. Analysis of Hyperbolic redaction New method improves over privacy and utility because of ability to encode meaningful structure in embeddings. Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2 baselines Plausible deniability stat Nw (Pr[M(w ) = w) improved.
  51. 51. Beyond Accuracy Performance and Cost Fairness and Bias Transparency and Explainability Privacy Security Safety Robustness
  52. 52. Fairness, Explainability & Privacy: Opportunities
  53. 53. Fairness in ML Application specific challenges Conversational AI systems: Unique bias/fairness/ethics considerations E.g., Hate speech, Complex failure modes Beyond protected categories, e.g., accent, dialect Entire ecosystem (e.g., including apps such as Alexa skills) Two-sided markets: e.g., fairness to buyers and to sellers, or to content consumers and producers Fairness in advertising (externalities) Tools for ensuring fairness (measuring & mitigating bias) in AI lifecycle Pre-processing (representative datasets; modifying features/labels) ML model training with fairness constraints Post-processing Experimentation & Post-deployment
  54. 54. Explainability in ML Actionable explanations Balance between explanations & model secrecy Robustness of explanations to failure modes (Interaction between ML components) Application-specific challenges Conversational AI systems: contextual explanations Gradation of explanations Tools for explanations across AI lifecycle Pre & post-deployment for ML models Model developer vs. End user focused
  55. 55. Privacy in ML Privacy for highly sensitive data: model training & analytics using secure enclaves, homomorphic encryption, federated learning / on- device learning, or a hybrid Privacy-preserving model training, robust against adversarial membership inference attacks (Dynamic settings + Complex data / model pipelines) Privacy-preserving mechanisms for data marketplaces
  56. 56. Reflections “Fairness, Explainability, and Privacy by Design” when building AI products Collaboration/consensus across key stakeholders NYT / WSJ / ProPublica test :)
  57. 57. Acknowledgements Amazon AWS AI team Special thanks to Sergul Aydore, Satadal Bhattacharjee, William Brown, Sanjiv Das, Jason Gelman, Kevin Haas, Tyler Hill, Michael Kearns, Jalaja Kurubarahalli, Andrea Olgiati, Luca Melis, Aaron Roth, Sudipta Sengupta, Ankit Siva
  58. 58. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark ThankYou

×