Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

Mayur is head of the Data Analytics Group in the Global Compliance Division. He joined Goldman Sachs as a managing director in 2014.
Prior to joining the firm, Mayur worked at Google, where he designed search algorithms for more than seven years. Previously, he was an assistant professor of computer science at the University of Missouri.
Mayur earned a PhD in Computer Science from the University of Rochester in 2004 and a BTech in Computer Science and Engineering from the Indian Institute of Technology, Delhi, in 1999.

Abstract Summary:

Surveillance platforms for bank compliance
Bank compliance uses models to look for outlier events such as insider trading, spoofing, front running, etc. With the exponential increase in the size of the data and a growing need to use such models, a key question is: How do we scale these models so they run efficiently and at the same time detect outlier events with good precision and recall?

In this talk, we will describe our experience building, from scratch, a Hadoop-based platform for surveillance.

  • Soyez le premier à commenter

Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

  1. 1. Surveillance Platform for Bank Compliance Mayur Thakur, Goldman Sachs
  2. 2. 2 • “Banks pay out £166bn over six years: a history of banking misdeeds and fines” – The Guardian • “Banks 'pay 60%' of profits in fines and customer payments” – BBC News • “Deutsche Bank to Pay $2.5 Billion to Settle Libor Investigation” – The Wall Street Journal • “$1.2 Billion Fine for Hedge Fund SAC Capital in Insider Case” – NY Times Stakes Are High
  3. 3. Key Technical Challenges  Diverse data sets and formats (sql, flatfiles, proprietary, etc)  Size of data, updated frequently • ~1B* pieces of text per year • ~1B edges in a graph • 100s of millions of trading events in a day  Data from past can change (e.g., manual trade correction) • Causes a cascade of changes  Surveillance decisions need to be debuggable • Why was trade X on Oct 25, 2015 not flagged?  Not real time; often need time guarantees (say, T+1) * All numbers are “orders of magnitudes” 3
  4. 4. Surveillance Architechture 4 SQL 1 Surv. 1 SQL n Flatfile 1 Flatfile m Prop 1 Prop k HDFS 1 HDFS q Flattened 1 Flattened 2 Flattened n Preprocessing pipeline 1 Preprocessing pipeline 2 Preprocessing pipeline n Alerts… Bookkeeping
  5. 5. Spoofing Illustration 5
  6. 6. A Real World Spoofing Case Navinder Singh Sarao was accused of spoofing  …and even contributing to the flash crash of 2010 Sarao pled guilty to spoofing in Nov 2016 He allegedly made $40M in illegal profit over years. 6
  7. 7. Review of Regulatory Cases 7 – Analyzed six regulatory enforcement cases for related to spoofing – Identified common factors indicative of spoofing behavior • Creating false impression of demand by placing spoof orders on opposite side to trigger a price movement (“order imbalance”) • Cancellation of spoof orders within short time after pivot execution (“time to cancel post execution”) Case Factors Order Imbalance Time to Cancel Post Execution ( > 2.5 times ) ( < 1 sec ) Sarao / Flash Crash a a Hold Brothers a a Coscia/ Panther a a Visionary Trading NA a Swift a 5 secs 3 Red a a
  8. 8. Transactions Data Pipeline  Spoofing implementation has 2 parts: data preprocessing and surveillance logic  Data preprocessing pipeline is reused for multiple surveillances  ~100M orders, 1B mkt data points, 100K products, multiple order mgmt. system 8 Order 1 Related Transactions Spoofing Orders n Exec. 1 Exec m Market 1 Market k Account Product Flattened Order Flattened Exec Flattened Market Order Processing Pipeline Exec. Processing Pipeline Mkt. Processing Pipeline Front running Surv. n… Alerts Alerts Alerts…
  9. 9. Related Transactions Table 9 Related Transactions Pivot Exec Orders Execs/Cancels MktData 216.8, 216.9, … One row of the related transactions table contains information about one pivot execution and all the activity around the time of that execution. X X X X
  10. 10. Search Problem 10 Given a semi-structured corpus of about a 1B documents in a hadoop cluster, design a search engine over YARN that is fast and satisfies the investigative needs of a variety of users. Unique Challenges  Cannot move data outside of an already existing hadoop cluster  Support deep scoring algorithms specifically for GS-specific signals (colloquial language, trades, etc)  Unstructured and structured signals
  11. 11. Search Workflow 11 Search Master Ranker Fast Index Servers Slow Index Servers HBase Web Client Yarn containers HDFS • Implemented as YARN apps • Auth enabled • Slow index Servers can scale as much as HBase # indexed documents > 1Billion # indexed tokens > 500 billion Current Index Size Runs in several TBs (Memory and Disk)