SlideShare une entreprise Scribd logo
1  sur  39
Mass Declassification What If? Jeff Jonas,  IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email_address] September 23, 2010
The Ask ,[object Object],[object Object],[object Object],[object Object]
The Problem at Hand ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
In Today’s Session ,[object Object],[object Object],[object Object],[object Object],[object Object]
Context Accumulating Systems
From Pixels to Pictures to Insight  Observations Context Relevance Consumer (An analyst, a system,  the sensor itself, etc.) Contextualization
[object Object],[object Object]
Without Context [email_address]
Consequences ,[object Object],[object Object],[object Object],[object Object],[object Object]
Context Accumulation Trusted Supplier Job  Applicant Stolen  Identity  Known Terrorist [email_address]
Puzzle Metaphor Primer ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How Context Accumulates ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
False Negatives Overstate The Universe Observations Unique Identities True Population
Counting Is Difficult Mark Smith 6/12/1978 443-43-0000 Mark R Smith (707) 433-0000 DL: 00001234 File 1 File 2
The Rise and Fall of a Population Observations Unique Identities True Population
Data Triangulation  Mark Smith 6/12/1978 443-43-0000 Mark R Smith (707) 433-0000 DL: 00001234 File 1 File 2 Mark Randy Smith 443-43-0000 DL: 00001234 New Record
Increasing Accuracy and Performance Observations Unique Identities True Population
“ Expert Counting” is Fundamental to Prediction ,[object Object],[object Object],[object Object],[object Object]
Mass Declassification Predictions
Mass Declassification Predictions ,[object Object],[object Object],[object Object]
Using What Data Points? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Open Source Discovery/Scoring ,[object Object],[object Object],[object Object],[object Object],[object Object]
Context Accumulation FOIA March 2010 Open Source Reference Dirty  Word Classified –  Asserted Mufasa 7 Warhead
Context Accumulation + Statistics ,[object Object],[object Object],[object Object],[object Object],[object Object],Declassification dispositions … becoming a force multiplier. The more human dispositions, the more automated dispositions. Human Triage Auto Triage 5,000 20 10,000 4,000 100,000 65,000 1,000,000 17,000,000
Policy Questions ,[object Object],[object Object],[object Object],[object Object]
Strawman Architecture
Strawman Architecture 450M Docs Historical  Dispositions DirtyWords Etc. Feature  Extraction  & Classification Context  Accumulation Predictions(*) Workflow System (*) Recommendations: Equity of, Disposition, Priority Dispositions
Another Idea: Crowd Sourcing ,[object Object],[object Object]
Another Idea: Better Classification  ,[object Object],[object Object]
Challenges
Challenges ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Closing Thoughts
Closing Thoughts ,[object Object],[object Object],[object Object],[object Object]
Worst Case Scenario ,[object Object],[object Object]
Related Blog Posts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Blogging At: www.JeffJonas.TypePad.com Information Management Privacy National Security  and Triathlons Questions?
Mass Declassification What If? Jeff Jonas,  IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email_address] September 23, 2010

Contenu connexe

Similaire à Mass declassification sept 23 2010v2.1

Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Integrating and publishing public safety data using semantic technologies
Integrating and publishing public safety data using semantic technologiesIntegrating and publishing public safety data using semantic technologies
Integrating and publishing public safety data using semantic technologies
Alvaro Graves
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
mark madsen
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
butest
 

Similaire à Mass declassification sept 23 2010v2.1 (20)

Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Qualitative Legal Prediction - Prof. Daniel Katz
Qualitative Legal Prediction - Prof. Daniel KatzQualitative Legal Prediction - Prof. Daniel Katz
Qualitative Legal Prediction - Prof. Daniel Katz
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Integrating and publishing public safety data using semantic technologies
Integrating and publishing public safety data using semantic technologiesIntegrating and publishing public safety data using semantic technologies
Integrating and publishing public safety data using semantic technologies
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 
How new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-finalHow new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-final
 
Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New Boss
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
ACEDS - ZyLAB webinar - AI Based eDiscovery Analytics
ACEDS - ZyLAB webinar - AI Based eDiscovery AnalyticsACEDS - ZyLAB webinar - AI Based eDiscovery Analytics
ACEDS - ZyLAB webinar - AI Based eDiscovery Analytics
 
The Future of Advanced Analytics
The Future of Advanced AnalyticsThe Future of Advanced Analytics
The Future of Advanced Analytics
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Creating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With PurposeCreating a Data-Driven Government: Big Data With Purpose
Creating a Data-Driven Government: Big Data With Purpose
 
Data Visualizations in Cyber Security: Still Home of the WOPR?
Data Visualizations in Cyber Security: Still Home of the WOPR?Data Visualizations in Cyber Security: Still Home of the WOPR?
Data Visualizations in Cyber Security: Still Home of the WOPR?
 
Career_Jobs_in_Data_Science.pptx
Career_Jobs_in_Data_Science.pptxCareer_Jobs_in_Data_Science.pptx
Career_Jobs_in_Data_Science.pptx
 
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Mass declassification sept 23 2010v2.1

Notes de l'éditeur

  1. Here is a look at the DeepQA architecture. This is like looking inside the brain of the Watson system from about 30,000 feet high. Remember, natural language is ambiguous, polysemous, tacit and its meaning is often highly contextual. Bottom line -- the computer needs to consider many possible meanings, attempting to find the inference paths that are most confidently supported by the data. The primary computational principle supported by the DeepQA architecture is to assume and maintain multiple interpretations of the question, to generate many plausible answers or hypotheses and to collect and process many different evidence paths that might support or refute those hypotheses. Each component in the system adds assumptions about what the question means or what the content means or what the answer might be or why it might be correct. DeepQA is implemented as an extensible architecture and was designed from the outset to support interoperability across independently developed analytics. For this reason it was implemented using UIMA, a framework and OASIS standard for interoperable text and multi-modal analysis contributed by IBM to the open-source community and now an Apache Project (http://uima.apache.org) Over 100 different algorithms, implemented as UIMA components, were developed, advanced and integrated into this architecture to build Watson . In the first step, Question and Category analysis , parsing algorithms decompose the question into its grammatical or syntactic components. Other algorithms here will identify and tag specific semantic entities like names, places or dates. In particular the type of thing being asked for, if is indicated at all, will be identified. We call this the LAT or Lexical Answer Type, like this “FISH”, this “CHARACTER” or “COUNTRY”. In Query Decomposition, different assumptions are made about if and how the question might be decomposed into sub questions. The original and each identified sub part follow parallel paths through the system. In Hypothesis Generation, DeepQA does a variety of very broad searches for each of several interpretations of the question. These searches are performed over a combination of unstructured data, natural language documents, and structured data, available knowledge bases. The goal of this step is to generate possible answers to the question and/or its sub parts. At this point there is not a lot of confidence in these possible answers since little intelligence has been applied to understanding the content that might relate to the question. The focus is on generating a broad set of hypotheses, – or for this application what we call “Candidate Answers”. To implement this step for Watson we used multiple open-source text and KB search components. DeepQA, acknowledges that resources are ultimately limited. And some parameterized judgment about which candidate answers are worth pursuing further must be made given constrains on time and available hardware. Based on a trained threshold for optimizing the tradeoff between accuracy and latency, DeepQA uses soft filtering -- it uses different light-weight algorithms to judge which candidates are worth gathering evidence for and which should get less attention and continue through the computation as-is. In contrast, if this were a hard-filter those candidates falling below the filter would be eliminated from consideration entirely at this point. In Hypothesis & Evidence Scoring the candidate answers are scored independently of any additional evidence by deeper analysis algorithms. This may for example include Typing Algorithms. These are algorithms that produce a score indicating how likely it is that a candidate answer is an instance of the Lexical Answer Type determined in the first step – for example Country, Agent, Character, City, Slogan, Book etc. Many of these algorithms may fire using different resources and techniques to come up with a score. What is the likelihood that “Washington” for example, refers to a “General” or a “Capital” or a “State” or a “Mountain” or a “Father” or a “Founder”? Evidence , in this case, more documents, passages and more structured facts, are collected for the many candidate answers. Each of these pieces of evidence are subjected to many independently developed algorithms that deeply analyze the evidentiary passages, for example, and score the likelihood that the passage supports or refutes the correctness of the candidate answer. In the Synthesis step, if the question had been decomposed into sub-parts, one or more synthesis algorithms will fire, with varying levels of certainty, They will apply methods for inferring a coherent final answer from the constituent elements derived from the questions sub-parts. Finally, arriving at the last step, Final Merging and Ranking, are many possible answers, each paired with many pieces of evidence and each of these scored by many algorithms to produce hundreds of feature scores. All giving some evidence for the correctness of each candidate answer. Trained models are applied to weigh the relative importance of these feature scores. These models are trained with ML methods to predict, based on past performance, how best to combine all this scores to produce final, single confidence numbers for each candidate answer and to produce the final ranking of all candidates. The answer with the strongest confidence would be Watson’s final answer. And Watson would try to buzz-in provided that top answer’s confidence was above a certain threshold. ----------------------- The DeepQA system defers commitments and carries possibilities through the entire process while searching for increasing broader contextual evidence and more credible inferences to support the most likely candidate answers. All the algorithms used to interpret questions, generate candidate answers, score answers, collection evidence and score evidence are loosely coupled but work holistically by virtue of DeepQA’s pervasive machine learning infrastructure. No one component could realize its impact on end-to-end performance without being integrated and trained with the other components AND they are all evolving simultaneously. In fact what had 10% impact on some metric one day, 1 month later might only contribute 2% to overall performance due to evolving component algorithms and interactions. This is why the system as it develops is regularly trained, evaluated and retrained. DeepQA is a complex system architecture designed to incrementally extend both in data and algorithms to deal with the challenges of natural language processing applications and to adapt to new domains of knowledge. The Jeopardy! Challenge has greatly inspired its design and implementation for the Watson system. -David A. Ferrucci