SlideShare une entreprise Scribd logo
1  sur  38
A day in the life of 
a functional data 
scientist 
Richard Minerich, Director of R&D at Bayard Rock 
@Rickasaurus
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/functional-data-scientist 
InfoQ.com: News & Community Site 
• 750,000 unique visitors/month 
• Published in 4 languages (English, Chinese, Japanese and Brazilian 
Portuguese) 
• Post content from our QCon conferences 
• News 15-20 / week 
• Articles 3-4 / week 
• Presentations (videos) 12-15 / week 
• Interviews 2-3 / week 
• Books 1 / month
Presented at QCon New York 
www.qconnewyork.com 
Purpose of QCon 
- to empower software development by facilitating the spread of 
knowledge and innovation 
Strategy 
- practitioner-driven conference designed for YOU: influencers of 
change and innovation in your teams 
- speakers and topics driving the evolution and innovation 
- connecting and catalyzing the influencers and innovators 
Highlights 
- attended by more than 12,000 delegates since 2007 
- held in 9 cities worldwide
Projecting onto a 2D Plane
The Pairwise Entity Resolution Process 
Blocking 
• Two Datasets (Customer Data and Sanctions) 
• Pairs of Somehow Similar Records 
Scoring 
• Pairs of Records 
• Probability of Representing Same Entity 
Review 
• Records, Probability, Similarity Features 
• True/False Labels (Mostly by Hand)
Blocking
Scoring: 
Risk vs Probability 
Likely to (The Ideal) 
Launder Money 
Probably the 
Same Person
The Reality (Dominated by Garbage) 
Tiny Bump 
Upper 937 
Threshold 
161 
161,358
Let’s dig into a single point 
Jimmy Cournoyer 
El: 95/ SI:16
Citation Network (Safe View)
Relationship Network (Safe View)
British 
Columbia 
Rizzuto Crime Family 
Jimmy “Cosmo” 
“Superman” 
Cournoyer 
Quebec 
New York/NYC 
Bonanno Crime Family 
John “Big Man” 
Venizelos 
Reinvested in Cocaine 
California 
Flow of Drugs 
Hells Angels 
El Chapo 
Sinaloa Cartel
Jorge HankRhon 
Family & Friends 
$100s Millions 
Citibank, CH 
Brother Murdered
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
Munging Data Redoing Work / 
Investigating Problems 
Fun Algorithms 
% Time Spent
Disgustingly Bad but Fairly Large Datasets 
▪ Both Wide (many fields) and Tall (many records) 
▪ From different systems (different encodings) 
▪ Missing data 
▪ Poorly merged data 
▪ Extra data 
▪ Non-unique IDs 
Every client is awful in a 
completely different way. 
NAME LARRY O BRIAN 
STATE CANADA 
CITY 121 Buffalo Drive, Montreal, 
Quebec H3G 1Z2 
ADDRESS NULL 
ZIP 00000 
DOB 10/24/80; 1/1/1979
SAM – Building for Bad Data 
▪ Lazy Pure Functional Core 
▪ Programmable Data Cleaning 
▪ Programmable ETL 
▪ Ad-Hoc Behaviors 
All with an F# Core and Barb 
for scripting. 
UI (C#) & 
Analysis (C#) 
Glue (F# and Barb) 
Data & 
Config 
In 
Data 
Out 
Algorithms (F#)
Other Kinds of Problems 
(sometimes even my fault) 
▪ Extra / Missing Data (e.g. incorrect subset or incorrect joins) 
▪ Wrong version of data (e.g. bad sync in SQL) 
▪ Bad configuration of dependencies 
The data lives in a locked down environment and so feedback cycles 
are slow. 
Lesson: Be Paranoid
F# Tools From Bayard Rock 
http://github.com/BayardRock 
Tokens Classification 
Pegasus Airlines ORGANIZATION 
Istanbul LOCATION 
Sochi LOCATION 
Russia LOCATION 
Turkey LOCATION 
Transportation 
Ministry 
ORGANIZATION
FSharpWebIntellisense 
https://github.com/BayardRock/FSharpWebIntellisense
iFSharp Notebook 
https://github.com/BayardRock/IfSharp
Barb, a simple .net record query language 
Name.Contains "John“ and (Age > 20 or Weight > 200) 
https://github.com/Rickasaurus/Barb
MITIE Dot Net (a wrapper for MIT’s MITIE) 
A Pegasus Airlines plane landed at 
an Istanbul airport Friday after a 
passenger "said that there was a 
bomb on board" and wanted the 
plane to land in Sochi, Russia, the 
site of the Winter Olympics, said 
officials with Turkey's 
Transportation Ministry. 
https://github.com/BayardRock/MITIE-Dot-Net 
Tokens Classification 
Pegasus Airlines ORGANIZATION 
Istanbul LOCATION 
Sochi LOCATION 
Russia LOCATION 
Turkey LOCATION 
Transportation 
ORGANIZATION 
Ministry
Other F# Community Tools (Not by Us) 
▪ Data Type Providers (SQL, OData, CSV, etc..) 
▪ Language Type Providers (R, Matlab, Python soon) 
▪ Deedle (like Pandas but for F#) 
▪ F# Charting
The Magic of Type Providers 
type Netflix = ODataService<"http://odata.netflix.com"> 
let avatarTitles = 
query { for t in netflix.Titles do 
where (t.Name.Contains "Avatar") 
sortBy t.Name 
take 100 }
How it works! 
Compiler Type Provider 
Types 
Erased Types 
The 
World! 
Type Providers! 
Libraries For Free!
Deedle (Like Python’s pandas but for F#) 
▪ Designed with Data Type Providers in Mind 
▪ Interops with the R Type Provider
But what about algorithmic code?
Ranking vs Regression 
▪ Regression - you’re trying to guess a number, only distance matters 
▪ May do a very bad job at ordering 
▪ In Ranking you’re trying to figure out some order, only order matters 
▪ May do a very bad job at providing a meaningful number 
Example: You’re a doctor with 20 spots open and 100 patents who 
want to see you today, which method would be the best for selecting 
20?
Regression 
푦 = 푋훽 + 휀 
y is labels 
X is features 
훽 is weights 
휀 is errors
“OLS” Regression via Gradient Descent in F#
Simple Ranking? You Can Use Regression. 
▪ The features are the difference in would-be regression features 
▪ The value to predict is the difference in rank 
Select 2 labeled samples randomly => (x1,y1) (x2,y2) 
x = x1 – x2 
y = y1 – y2 
Sample 1 Sample 2 Result 
Names? 1 1 0 
Addresses? 1 0 1 
DOB? 0 1 -1 
Same Person? 0 0 0
Simple Ranking in F#
Combined Ranking and 
Regression – D. Sculley 
You can improve your regression with 
ranking, and your ranking with regression. 
The best of both worlds!
Combined Ranking and Regression – 
D. Sculley @ Google, Inc
Thank You! 
Check out the NYC F# User Group: 
http://www.meetup.com/nyc-fsharp 
Find out more about F#: 
http://fsharp.org 
Contact me on twitter: 
@Rickasaurus
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/functional 
-data-scientist

Contenu connexe

Similaire à A Day in the Life of a Functional Data Scientist

How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
New York City College of Technology Computer Systems Technology Colloquium
 

Similaire à A Day in the Life of a Functional Data Scientist (20)

[Conference] Cognitive Graph Analytics on Company Data and News
[Conference] Cognitive Graph Analytics on Company Data and News[Conference] Cognitive Graph Analytics on Company Data and News
[Conference] Cognitive Graph Analytics on Company Data and News
 
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
OrientDB & Node.js Overview - JS.Everywhere() KW
OrientDB & Node.js Overview - JS.Everywhere() KWOrientDB & Node.js Overview - JS.Everywhere() KW
OrientDB & Node.js Overview - JS.Everywhere() KW
 
diadem-vldb-2015
diadem-vldb-2015diadem-vldb-2015
diadem-vldb-2015
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
 
New Era of Software with modern Application Security v1.0
New Era of Software with modern Application Security v1.0New Era of Software with modern Application Security v1.0
New Era of Software with modern Application Security v1.0
 
Talent Sleuthing in the Intelligence Community - Jo Weech; recruitDC Spring 2018
Talent Sleuthing in the Intelligence Community - Jo Weech; recruitDC Spring 2018Talent Sleuthing in the Intelligence Community - Jo Weech; recruitDC Spring 2018
Talent Sleuthing in the Intelligence Community - Jo Weech; recruitDC Spring 2018
 
Terry Reese - The world beyond MARC: let's focus on asking the right questions
Terry Reese - The world beyond MARC: let's focus on asking the right questionsTerry Reese - The world beyond MARC: let's focus on asking the right questions
Terry Reese - The world beyond MARC: let's focus on asking the right questions
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 
Foreign Direct Investment 201. April 19 2018
Foreign Direct Investment 201. April 19 2018Foreign Direct Investment 201. April 19 2018
Foreign Direct Investment 201. April 19 2018
 
Metadata and the Power of Pattern-Finding
Metadata and the Power of Pattern-FindingMetadata and the Power of Pattern-Finding
Metadata and the Power of Pattern-Finding
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Sourcing from unconventional sources - sosu v 281020
Sourcing from unconventional sources - sosu v 281020Sourcing from unconventional sources - sosu v 281020
Sourcing from unconventional sources - sosu v 281020
 
(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages
 
Linked data at globo.com
Linked data at globo.comLinked data at globo.com
Linked data at globo.com
 
Entity Search: The Last Decade and the Next
Entity Search: The Last Decade and the NextEntity Search: The Last Decade and the Next
Entity Search: The Last Decade and the Next
 

Plus de C4Media

Plus de C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

A Day in the Life of a Functional Data Scientist

  • 1. A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /functional-data-scientist InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
  • 5. Projecting onto a 2D Plane
  • 6. The Pairwise Entity Resolution Process Blocking • Two Datasets (Customer Data and Sanctions) • Pairs of Somehow Similar Records Scoring • Pairs of Records • Probability of Representing Same Entity Review • Records, Probability, Similarity Features • True/False Labels (Mostly by Hand)
  • 8. Scoring: Risk vs Probability Likely to (The Ideal) Launder Money Probably the Same Person
  • 9. The Reality (Dominated by Garbage) Tiny Bump Upper 937 Threshold 161 161,358
  • 10. Let’s dig into a single point Jimmy Cournoyer El: 95/ SI:16
  • 11.
  • 14. British Columbia Rizzuto Crime Family Jimmy “Cosmo” “Superman” Cournoyer Quebec New York/NYC Bonanno Crime Family John “Big Man” Venizelos Reinvested in Cocaine California Flow of Drugs Hells Angels El Chapo Sinaloa Cartel
  • 15. Jorge HankRhon Family & Friends $100s Millions Citibank, CH Brother Murdered
  • 16. 0.6 0.5 0.4 0.3 0.2 0.1 0 Munging Data Redoing Work / Investigating Problems Fun Algorithms % Time Spent
  • 17. Disgustingly Bad but Fairly Large Datasets ▪ Both Wide (many fields) and Tall (many records) ▪ From different systems (different encodings) ▪ Missing data ▪ Poorly merged data ▪ Extra data ▪ Non-unique IDs Every client is awful in a completely different way. NAME LARRY O BRIAN STATE CANADA CITY 121 Buffalo Drive, Montreal, Quebec H3G 1Z2 ADDRESS NULL ZIP 00000 DOB 10/24/80; 1/1/1979
  • 18. SAM – Building for Bad Data ▪ Lazy Pure Functional Core ▪ Programmable Data Cleaning ▪ Programmable ETL ▪ Ad-Hoc Behaviors All with an F# Core and Barb for scripting. UI (C#) & Analysis (C#) Glue (F# and Barb) Data & Config In Data Out Algorithms (F#)
  • 19. Other Kinds of Problems (sometimes even my fault) ▪ Extra / Missing Data (e.g. incorrect subset or incorrect joins) ▪ Wrong version of data (e.g. bad sync in SQL) ▪ Bad configuration of dependencies The data lives in a locked down environment and so feedback cycles are slow. Lesson: Be Paranoid
  • 20. F# Tools From Bayard Rock http://github.com/BayardRock Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation Ministry ORGANIZATION
  • 23. Barb, a simple .net record query language Name.Contains "John“ and (Age > 20 or Weight > 200) https://github.com/Rickasaurus/Barb
  • 24. MITIE Dot Net (a wrapper for MIT’s MITIE) A Pegasus Airlines plane landed at an Istanbul airport Friday after a passenger "said that there was a bomb on board" and wanted the plane to land in Sochi, Russia, the site of the Winter Olympics, said officials with Turkey's Transportation Ministry. https://github.com/BayardRock/MITIE-Dot-Net Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation ORGANIZATION Ministry
  • 25. Other F# Community Tools (Not by Us) ▪ Data Type Providers (SQL, OData, CSV, etc..) ▪ Language Type Providers (R, Matlab, Python soon) ▪ Deedle (like Pandas but for F#) ▪ F# Charting
  • 26. The Magic of Type Providers type Netflix = ODataService<"http://odata.netflix.com"> let avatarTitles = query { for t in netflix.Titles do where (t.Name.Contains "Avatar") sortBy t.Name take 100 }
  • 27. How it works! Compiler Type Provider Types Erased Types The World! Type Providers! Libraries For Free!
  • 28. Deedle (Like Python’s pandas but for F#) ▪ Designed with Data Type Providers in Mind ▪ Interops with the R Type Provider
  • 29. But what about algorithmic code?
  • 30. Ranking vs Regression ▪ Regression - you’re trying to guess a number, only distance matters ▪ May do a very bad job at ordering ▪ In Ranking you’re trying to figure out some order, only order matters ▪ May do a very bad job at providing a meaningful number Example: You’re a doctor with 20 spots open and 100 patents who want to see you today, which method would be the best for selecting 20?
  • 31. Regression 푦 = 푋훽 + 휀 y is labels X is features 훽 is weights 휀 is errors
  • 32. “OLS” Regression via Gradient Descent in F#
  • 33. Simple Ranking? You Can Use Regression. ▪ The features are the difference in would-be regression features ▪ The value to predict is the difference in rank Select 2 labeled samples randomly => (x1,y1) (x2,y2) x = x1 – x2 y = y1 – y2 Sample 1 Sample 2 Result Names? 1 1 0 Addresses? 1 0 1 DOB? 0 1 -1 Same Person? 0 0 0
  • 35. Combined Ranking and Regression – D. Sculley You can improve your regression with ranking, and your ranking with regression. The best of both worlds!
  • 36. Combined Ranking and Regression – D. Sculley @ Google, Inc
  • 37. Thank You! Check out the NYC F# User Group: http://www.meetup.com/nyc-fsharp Find out more about F#: http://fsharp.org Contact me on twitter: @Rickasaurus
  • 38. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/functional -data-scientist