SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Big Data/DIG
Domain-Specific Insight Graphs
Pedro Szekely
University of Southern California
www.isi.edu/~szekely
Connecting The Dots
Using the Web To Solve Hard Problems
Hard Problems
State of the Art
Our Solution
Impact
Hard Problems
Healthcare
Research investment
Human trafficking
…
Human Trafficking
Illegal drugs
Arms trafficking
Human trafficking
Illegal Industries
$32 billion
profit per year
14
Average Age of Entry To Prostitution in the US
$150,000
PIMP’s Profit Per Child Per Year
$45,000,000
Advertising Budget On the Web
Human Trafficking on the Web
Thousands of Web sites
Millions of pages
Hard Problems
State of the Art
Our Solution
Impact
Google Finds “DOTS”
Recipe
“Dot”
Nutrition
“Dot”
Google finds dots
User finds connections
System Objectives
1.  find all the dots
2.  find all the connections
Hard Problems
State of the Art
Our Solution
Impact
1.  Downloads all relevant pages
2.  Extracts & cleans the data
3.  Discovers connections
4.  Builds unified database
5.  Creates query & analysis portal
1.  Go to Web site
2.  Download page
3.  Follow links
4.  Wait, then repeat
24/7
Web Crawling Software
2,000 Pages/Hour -- 50,000,000 pages Total
Data Extraction
“YOU don't wanna miss out on ME :)
Perfect lil booty Green eyes Long
curly black hair Im a Irish,Armenian
and Filipino mixed princess :) ❤ Kim
❤ 7○7~7two7~7four77 ❤ HH 80
roses ❤ Hour 120 roses ❤ 15 mins
60 roses”
name: Kim
eye-color: green
hair-color: black
phone: 707-727-7477
rate: $60/15min
$80/30min
$120/60min
Crowd-SourcED Annotations
“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes
Long curly black hair Im a Irish,Armenian and Filipino mixed princess :)
Green O eye color O hair color
black O eye color O hair color
2 cents/sentence
Automatic Construction of Extractors
5,000 annotations
Machine
Learning
Ready-to-use
Extraction
Software
$100, 1 day
Technology: Conditional Random Fields
Data Cleaning
AD Weight
1  130
2  480
3  133lbs
4  BBW
5  52 kg
6  110 pounds
AD Weight (Kg)
1  59
2 
3  60
4 
5  52
6  50
Using Extracted Data to Connect the Dots
Mary Lucy
222-0000 777-0000
Police Database
Bad Guy: 777-0000
Technology: Karma Information Integration Toolkit
Using Text Similarity to Connect the Dots
E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S
LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S
L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S
Technology: MinHash/LSH
Using Image Similarity to Connect the Dots
20 Million Images Technology: Deep Learning
Create Unified Database
50 Million Ads
Technologies: Karma, Hadoop, Hive, Elastic-Search
20 Computers, 2 Hours 4 Billion Records
Hard Problems
State of the Art
Our Solution
Impact
Deployed to Law
Enforcement and NGOs
Organizations
University of
Southern California
Columbia University
InferLink
NASA JPL
Next Century
Researchers
Pedro Szekely (PI),
Shih-Fu Chang
Tao Chen
Kevin Knight
Craig Knoblock
Daniel Marcu
Chris Mattmann
Steve Minton
Prem Natarajan
Andrew Philpot
MikeTamayo
Engineers
Brian Amanatullah
Rachel Artiss
David Flynt
Dipsy Kapoor,
Students
Jason Slepicka
Amandeep Singh
ChengyeYin
Subessware
Karunamoorthy

Contenu connexe

En vedette

Big dig powerpoint
Big dig powerpointBig dig powerpoint
Big dig powerpoint
chrisminer01
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
EMC
 
2007.02.12 Lecture - Dr. Maher Stino - Landscape Architecture Design in the...
2007.02.12   Lecture - Dr. Maher Stino - Landscape Architecture Design in the...2007.02.12   Lecture - Dr. Maher Stino - Landscape Architecture Design in the...
2007.02.12 Lecture - Dr. Maher Stino - Landscape Architecture Design in the...
Sites International
 
Pipelining and co processor.
Pipelining and co processor.Pipelining and co processor.
Pipelining and co processor.
Piyush Rochwani
 

En vedette (20)

Grieman
GriemanGrieman
Grieman
 
Big dig powerpoint
Big dig powerpointBig dig powerpoint
Big dig powerpoint
 
#FlipMyFunnel Boston 2016 - Lindsy Lettre and Adam New-Waterson - The Big Dig...
#FlipMyFunnel Boston 2016 - Lindsy Lettre and Adam New-Waterson - The Big Dig...#FlipMyFunnel Boston 2016 - Lindsy Lettre and Adam New-Waterson - The Big Dig...
#FlipMyFunnel Boston 2016 - Lindsy Lettre and Adam New-Waterson - The Big Dig...
 
Android os
Android osAndroid os
Android os
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
8086 Microprocessor
8086 Microprocessor8086 Microprocessor
8086 Microprocessor
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Biometrics based key generation
Biometrics based key generationBiometrics based key generation
Biometrics based key generation
 
2007.02.12 Lecture - Dr. Maher Stino - Landscape Architecture Design in the...
2007.02.12   Lecture - Dr. Maher Stino - Landscape Architecture Design in the...2007.02.12   Lecture - Dr. Maher Stino - Landscape Architecture Design in the...
2007.02.12 Lecture - Dr. Maher Stino - Landscape Architecture Design in the...
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Project Complexity (Case study)
Project Complexity (Case study)Project Complexity (Case study)
Project Complexity (Case study)
 
05 multiply divide
05 multiply divide05 multiply divide
05 multiply divide
 
Raid
Raid Raid
Raid
 
Air pollution in mumbai
Air pollution in mumbaiAir pollution in mumbai
Air pollution in mumbai
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 2
Unit 2Unit 2
Unit 2
 
Pipelining and co processor.
Pipelining and co processor.Pipelining and co processor.
Pipelining and co processor.
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Serial transmission
Serial transmissionSerial transmission
Serial transmission
 
06 floating point
06 floating point06 floating point
06 floating point
 

Similaire à Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC

Presentation
PresentationPresentation
Presentation
LHSICT
 
Chapter 12: Computer Mediated Communicationcmc
Chapter 12: Computer Mediated CommunicationcmcChapter 12: Computer Mediated Communicationcmc
Chapter 12: Computer Mediated Communicationcmc
Ray Brannon
 
Cyber Wellness Program for the philippines
Cyber Wellness Program  for the philippinesCyber Wellness Program  for the philippines
Cyber Wellness Program for the philippines
Sonnie Santos
 
Hum 140: Social Media - Cyber fraud
Hum 140: Social Media - Cyber fraudHum 140: Social Media - Cyber fraud
Hum 140: Social Media - Cyber fraud
Ray Brannon
 
Job Seeker's Guide to the Internet
Job Seeker's Guide to the InternetJob Seeker's Guide to the Internet
Job Seeker's Guide to the Internet
tcellsworth
 
CSA "Donors Evolve from Everywhere"
CSA "Donors Evolve from Everywhere"CSA "Donors Evolve from Everywhere"
CSA "Donors Evolve from Everywhere"
jayblove
 

Similaire à Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC (20)

Networked Nonprofit: Care2 Webinar
Networked Nonprofit: Care2 WebinarNetworked Nonprofit: Care2 Webinar
Networked Nonprofit: Care2 Webinar
 
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
Data-Driven Enterprise off Your Beat by Manuel Torres - Monroe, La., NewsTrai...
 
Presentation
PresentationPresentation
Presentation
 
Technology to Improve Your (Business) Life
Technology to Improve Your (Business) LifeTechnology to Improve Your (Business) Life
Technology to Improve Your (Business) Life
 
Risky business cybersecurity 1
Risky business cybersecurity 1Risky business cybersecurity 1
Risky business cybersecurity 1
 
Chapter 12: Computer Mediated Communicationcmc
Chapter 12: Computer Mediated CommunicationcmcChapter 12: Computer Mediated Communicationcmc
Chapter 12: Computer Mediated Communicationcmc
 
Brookes 6 24 Alabama Webinar Presentation
Brookes 6 24 Alabama Webinar PresentationBrookes 6 24 Alabama Webinar Presentation
Brookes 6 24 Alabama Webinar Presentation
 
8 Breakthrough Strategies Seminar 3.2.09
8 Breakthrough Strategies Seminar 3.2.098 Breakthrough Strategies Seminar 3.2.09
8 Breakthrough Strategies Seminar 3.2.09
 
How To Protect Yourself From Identity Theft
How To Protect Yourself From Identity TheftHow To Protect Yourself From Identity Theft
How To Protect Yourself From Identity Theft
 
Cyber Wellness Program for the philippines
Cyber Wellness Program  for the philippinesCyber Wellness Program  for the philippines
Cyber Wellness Program for the philippines
 
Essay On Managing Diversity In The Workplace
Essay On Managing Diversity In The WorkplaceEssay On Managing Diversity In The Workplace
Essay On Managing Diversity In The Workplace
 
Users and behaviors social internet: Safety & Security
Users and behaviors social internet: Safety & SecurityUsers and behaviors social internet: Safety & Security
Users and behaviors social internet: Safety & Security
 
Hum 140: Social Media - Cyber fraud
Hum 140: Social Media - Cyber fraudHum 140: Social Media - Cyber fraud
Hum 140: Social Media - Cyber fraud
 
Users and Behaviors- Social Internet
Users and Behaviors- Social InternetUsers and Behaviors- Social Internet
Users and Behaviors- Social Internet
 
Economics of Bribery by @EricPesik
Economics of Bribery by @EricPesikEconomics of Bribery by @EricPesik
Economics of Bribery by @EricPesik
 
Engage Engineering Entrepreneurship Students to Take Action
Engage Engineering Entrepreneurship Students to Take ActionEngage Engineering Entrepreneurship Students to Take Action
Engage Engineering Entrepreneurship Students to Take Action
 
Job Seeker's Guide to the Internet
Job Seeker's Guide to the InternetJob Seeker's Guide to the Internet
Job Seeker's Guide to the Internet
 
IST Presentation
IST PresentationIST Presentation
IST Presentation
 
CSA "Donors Evolve from Everywhere"
CSA "Donors Evolve from Everywhere"CSA "Donors Evolve from Everywhere"
CSA "Donors Evolve from Everywhere"
 
What Every Recruiter Needs To Know About Social Media
What Every Recruiter Needs To Know About Social MediaWhat Every Recruiter Needs To Know About Social Media
What Every Recruiter Needs To Know About Social Media
 

Plus de ETCenter

Plus de ETCenter (20)

Securing Content in the Cloud
Securing Content in the CloudSecuring Content in the Cloud
Securing Content in the Cloud
 
Building Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWSBuilding Highly Scalable Immersive Media Solutions on AWS
Building Highly Scalable Immersive Media Solutions on AWS
 
How broadcasters can get in the VR game with sports
How broadcasters can get in the VR game with sportsHow broadcasters can get in the VR game with sports
How broadcasters can get in the VR game with sports
 
Improve Efficiency by Double Digits – Leveraging Artificial Intelligence and ...
Improve Efficiency by Double Digits – Leveraging Artificial Intelligence and ...Improve Efficiency by Double Digits – Leveraging Artificial Intelligence and ...
Improve Efficiency by Double Digits – Leveraging Artificial Intelligence and ...
 
Looking beyond the script
Looking beyond the scriptLooking beyond the script
Looking beyond the script
 
Cloud Apps for Media Processing: IMF Packaging-on-Demand
Cloud Apps for Media Processing: IMF Packaging-on-DemandCloud Apps for Media Processing: IMF Packaging-on-Demand
Cloud Apps for Media Processing: IMF Packaging-on-Demand
 
IP for Sports broadcast
IP for Sports broadcast IP for Sports broadcast
IP for Sports broadcast
 
The distributive aspect of cloud on the digital world
The distributive aspect of cloud on the digital worldThe distributive aspect of cloud on the digital world
The distributive aspect of cloud on the digital world
 
Cloud Transition Patterns for Media Enterprises
Cloud Transition Patterns for Media EnterprisesCloud Transition Patterns for Media Enterprises
Cloud Transition Patterns for Media Enterprises
 
Hacking IoT: the new threat for content assets
Hacking IoT: the new threat for content assetsHacking IoT: the new threat for content assets
Hacking IoT: the new threat for content assets
 
BLOCKCHAIN & THE HOLLYWOOD SUPPLY CHAIN
BLOCKCHAIN & THE HOLLYWOOD SUPPLY CHAINBLOCKCHAIN & THE HOLLYWOOD SUPPLY CHAIN
BLOCKCHAIN & THE HOLLYWOOD SUPPLY CHAIN
 
Graymeta C4 use case, Deduplication
Graymeta C4 use case, DeduplicationGraymeta C4 use case, Deduplication
Graymeta C4 use case, Deduplication
 
WRAST, Worldwide Repository for Assets. Project Cloud QTR meeting @ Disney/ABC
WRAST, Worldwide Repository for Assets. Project Cloud QTR meeting @ Disney/ABC  WRAST, Worldwide Repository for Assets. Project Cloud QTR meeting @ Disney/ABC
WRAST, Worldwide Repository for Assets. Project Cloud QTR meeting @ Disney/ABC
 
Object storage is awesome.. ETC "Project Cloud" QTR meeting @ Disney/ABC
Object storage is awesome..  ETC "Project Cloud" QTR meeting @ Disney/ABC Object storage is awesome..  ETC "Project Cloud" QTR meeting @ Disney/ABC
Object storage is awesome.. ETC "Project Cloud" QTR meeting @ Disney/ABC
 
Federated identity, Project Cloud QTR meeting @ Disney/ABC
Federated identity, Project Cloud QTR meeting @ Disney/ABC Federated identity, Project Cloud QTR meeting @ Disney/ABC
Federated identity, Project Cloud QTR meeting @ Disney/ABC
 
Security + Cloud: What studios and vendors need to consider when adopting clo...
Security + Cloud: What studios and vendors need to consider when adopting clo...Security + Cloud: What studios and vendors need to consider when adopting clo...
Security + Cloud: What studios and vendors need to consider when adopting clo...
 
"The Suitcase" Project Cloud QTR meeting presentation @ Disney/ABC
"The Suitcase"  Project Cloud QTR meeting presentation @ Disney/ABC"The Suitcase"  Project Cloud QTR meeting presentation @ Disney/ABC
"The Suitcase" Project Cloud QTR meeting presentation @ Disney/ABC
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
An Introduction to Data Gravity by John Tkaczewski of FileCatalyst
An Introduction to Data Gravity by John Tkaczewski of FileCatalystAn Introduction to Data Gravity by John Tkaczewski of FileCatalyst
An Introduction to Data Gravity by John Tkaczewski of FileCatalyst
 
This Is Not Your Parent’s Storage: Transitioning to Cloud Object Storage by I...
This Is Not Your Parent’s Storage: Transitioning to Cloud Object Storage by I...This Is Not Your Parent’s Storage: Transitioning to Cloud Object Storage by I...
This Is Not Your Parent’s Storage: Transitioning to Cloud Object Storage by I...
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Big Data/DIG: Domain-Specific Insight Graphs by Pedro Szekely of ISI/USC