Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary

The recently launched LinkedIn Salary product has been designed with the goal of providing compensation insights to the world's professionals and thereby helping them optimize their earning potential. We describe the overall design and architecture of the statistical modeling system underlying this product. We focus on the unique data mining challenges while designing and implementing the system, and describe the modeling components such as Bayesian hierarchical smoothing that help to compute and present robust compensation insights to users. We report on extensive evaluation with nearly one year of de-identified compensation data collected from over one million LinkedIn users, thereby demonstrating the efficacy of the statistical models. We also highlight the lessons learned through the deployment of our system at LinkedIn.

Presented at ACM International Conference on Information and Knowledge Management (ACM CIKM), 2017.

Recipient of Best Case Studies Paper Award at ACM CIKM, 2017.

Corresponding paper: Bringing Salary Transparency to the World: Computing Robust Compensation Insights via LinkedIn Salary, ACM CIKM, 2017 (available at https://arxiv.org/abs/1703.09845).

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary

  1. 1. Computing Robust Compensation Insights via LinkedIn Salary Krishnaram Kenthapadi AI @ LinkedIn (Joint work with Stuart Ambler, Liang Zhang, Deepak Agarwal)
  2. 2. Outline ▪ LinkedIn Salary Overview ▪ Challenges: Privacy, Modeling ▪ Bayesian Hierarchical Smoothing ▪ Outlier Detection
  3. 3. LinkedIn Salary (launched in Nov, 2016)
  4. 4. Salary Collection Flow via Email Targeting
  5. 5. Current Reach (November 2017) ▪ A few million responses out of several millions of members targeted – Targeted via emails since early 2016 ▪ Countries: US, CA, UK, DE ▪ Insights available for a large fraction of US monthly active users
  6. 6. ▪ Minimize the risk of inferring any one individual’s compensation data ▪ Protection against data breach – No single point of failure Data Privacy Challenges Achieved by a combination of techniques: encryption, access control, , aggregation, thresholding K. Kenthapadi, A. Chudhary, and S. Ambler, LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers, IEEE PAC 2017 (arxiv.org/abs/1705.06976)
  7. 7. Title Region $$ User Exp Designer SF Bay Area 100K User Exp Designer SF Bay Area 115K ... ... ... Title Region $$ User Exp Designer SF Bay Area 100K De-identification Example Title Region Company Industry Years of exp Degree Field of Study Skills $$ User Exp Designer SF Bay Area Google Internet 12 BS Interacti ve Media UX, Graphic s, ... 100K Title Region Industry $$ User Exp Designer SF Bay Area Internet 100K Title Region Years of exp $$ User Exp Designer SF Bay Area 10+ 100K Title Region Company Years of exp $$ User Exp Designer SF Bay Area Google 10+ 100K #data points > threshold? Yes ⇒ Copy to Hadoop (HDFS) Note: Original submission stored as encrypted objects.
  8. 8. ▪ Evaluation ▪ Modeling on de-identified data ▪ Robustness and stability ▪ Outlier detection Modeling Challenges
  9. 9. Problem Statement ▪How do we compute robust, reliable compensation insights based on de-identified compensation data, while addressing the product requirements such as coverage?
  10. 10. Salary Insights Architecture
  11. 11. Bayesian Hierarchical Smoothing ▪ Modeling on de-identified data ▪ Robustness and stability
  12. 12. Coverage vs Data Quality / Data Privacy Tradeoff ▪ Can we achieve them simultaneously? Better Coverage Better Data Quality & Data Privacy Threshold (min # data points for returning insights)
  13. 13. Bayesian Smoothing ▪ Large sample size cohorts (>= 20): Report empirical percentiles ▪ Small sample size cohorts (< 20): – Empirical percentiles unreliable & unstable – Idea: ▪ Exploit hierarchical structure ▪ “Borrow strength” from the ancestral cohort that has enough data and best fit – Bayesian hierarchical smoothing ▪ “Combine” cohort estimates with actual observed entries ▪ Greater weighting for observed data as #observed entries increases ▪ Cohorts with no data: – Build regression models to predict the salary insights
  14. 14. UX Designer, SF Bay Area, Internet Industry, 10+ yrs UX Designer, SF Bay Area, Internet Industry UX Designer, Internet Industry, 10+ yrs UX Designer, SF Bay Area, 10+ yrs SF Bay Area, Internet Industry, 10+yrs UX Designer, SF Bay Area UX Designer, 10+ yrs UX Designer, Internet Industry Internet Industry, 10+yrs UX Designer SF Bay Area Internet Industry All data 10+yrs ... ?
  15. 15. UX Designer, SF Bay Area, Internet Industry, 10+ yrs UX Designer, SF Bay Area, Internet Industry UX Designer, Internet Industry, 10+ yrs UX Designer, SF Bay Area, 10+ yrs SF Bay Area, Internet Industry, 10+yrs UX Designer, SF Bay Area UX Designer, 10+ yrs UX Designer, Internet Industry Internet Industry, 10+yrs UX Designer SF Bay Area Internet Industry All data 10+yrs ... Available ancestors with enough data
  16. 16. UX Designer, SF Bay Area, Internet Industry, 10+ yrs UX Designer, SF Bay Area, Internet Industry UX Designer, Internet Industry, 10+ yrs UX Designer, SF Bay Area, 10+ yrs SF Bay Area, Internet Industry, 10+yrs UX Designer, SF Bay Area UX Designer, 10+ yrs UX Designer, Internet Industry Internet Industry, 10+yrs UX Designer SF Bay Area Internet Industry All data 10+yrs ... Best ancestor: ancestor that results in max (log) likelihood for the observed entries Prior distribution Compute posterior distribution based on observed entries
  17. 17. Bayesian Smoothing [Assume: the compensation data follows a log-normal distribution] – Use Gaussian-gamma distribution as the conjugate prior Steps – Apply logarithmic transformation to all data entries – Find the “best” ancestral cohort – Obtain prior log-normal distribution from this cohort – Compute posterior log-normal distribution based on observed entries for the cohort of interest
  18. 18. Validation of Log-normal Assumption
  19. 19. Regression Model for Title-Region Cohorts ▪ Smoothing motivation: – Only 30% of title-region cohorts have 30+ data points – Title-only or region-only parent cohorts not ideal ▪ Regression model used to obtain the prior distribution – instead of falling back to the parent cohort ▪ Inference motivation: – Infer salary insights for title-region cohorts with no data
  20. 20. Offline Evaluation of Smoothing ▪ Observed entries => training (90%) + test (10%) ▪ Goodness-of-fit analysis using log-likelihood of test data ▪ Quantile coverage test for statistical consistency (Non- parametric) – Fraction of the test data that lies between 10th and 90th percentiles of predicted insights – Ideally: 80% – Cohorts with >=5 samples: Smoothing (83%), Empirical (71%) – Cohorts with 3-4 samples: Smoothing (86%), Empirical (39%)
  21. 21. Outlier Detection
  22. 22. Outlier Detection using BLS OES Dataset ⇒ Need to map to LinkedIn taxonomy
  23. 23. Mapping BLS OES Dataset to LinkedIn Taxonomy ▪ BLS occupations coarser – 805 special occupation codes vs 25K standardized titles ▪ Title mapping: – BLS Special occupation code (SOC) --> O*Net alternate titles --> LinkedIn standardized titles ▪ BLS regions finer ▪ Region mapping: – BLS regions --> Zip codes --> LI regions ▪ Coverage for 6.5K standardized titles (and nearly all (~285) US region codes), 1.5M <titleId, region code> pairs
  24. 24. Outlier Detection: Box and Whisker Method ▪ Cold-start: mapped BLS data ▪ Then, with member submitted compensation entries ▪ Floored at federal minimum wage
  25. 25. Deployment Challenges & Lessons Learned ▪ Extensible APIs to support evolving product needs ▪ Lack of good public “ground truth” datasets ▪ Coverage vs. robustness tradeoffs via simulations ▪ Choosing smoothing threshold
  26. 26. Summary & Reflections ▪ LinkedIn Salary: a new internet application – Robust, reliable compensation insights via statistical modeling techniques ▪ Bayesian Hierarchical Smoothing ▪ Outlier Detection – Empirical evaluation & deployment lessons ▪ Future Directions – Career marketplace efficiency using compensation insights – Detecting inconsistencies in the insights across cohorts ▪ Position transition graphs, salaries from job postings, … – Addressing sample selection bias & response bias
  27. 27. Thanks & Pointers ▪ Related paper: K. Kenthapadi, A. Chudhary, and S. Ambler, LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers, IEEE PAC 2017 (arxiv.org/abs/1705.06976) ▪ Team: Careers Engineering  Ahsan Chudhary  Alan Yang  Alex Navasardyan  Brandyn Bennett  Hrishikesh S  Jim Tao  Juan Pablo Lomeli Diaz  Lu Zheng  Patrick Schutz  Ricky Yan  Stephanie Chou  Joseph Florencio  Santosh Kumar Kancha  Anthony Duerr Data Relevance Engineering  Krishnaram Kenthapadi, Stuart Ambler, Xi Chen, Yiqun Liu, Parul Jain, Liang Zhang, Ganesh Venkataraman, Tim Converse, Deepak Agarwal Product Managers: Ryan Sandler, Keren Baruch UED: Julie Kuang Marketing: Phil Bunge Business Operations: Prateek Janardhan BA: Fiona Li Testing: Bharath Shetty ProdOps/VOM: Sunil Mahadeshwar Security: Cory Scott, Tushar Dalvi, and team linkedin.com/salary

×