SlideShare une entreprise Scribd logo
1  sur  72
Analyzing Unstructured Data
         for Stories
      eugenewu@mit.edu
What Am I Talking About?
• Example

• Structured Data 101

• Structured Data Continuum

• More Examples
http://projects.propublica.org/drywall/
http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1057. Plaintiff-Intervenor, Brenda
Owens, is a citizen of Alabama and owns real property
located at 2105 Lane Avenue, Birmingham, Alabama
35217. Plaintiff is participating as a class representative in
the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1058. Plaintiffs-Intervenors, Daniel
and Nicole Smith are citizens of Alabama and together
own real property located at 766 Tabernacle Road,
Monroeville, Alabama

http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1057. Plaintiff-Intervenor, Brenda
Owens, is a citizen of Alabama and owns real property
located at 2105 Lane Avenue, Birmingham, Alabama
35217. Plaintiff is participating as a class representative in
the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1058. Plaintiffs-Intervenors, Daniel
and Nicole Smith are citizens of Alabama and together
own real property located at 766 Tabernacle
Road, Monroeville, Alabama

http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
541 Lynn Hurst Court,
Montgomery, Alabama 36117



37.0625, -95.677068
541 Lynn Hurst Court,
Montgomery, Alabama 36117



37.0625, -95.677068



Jefferson County
http://projects.propublica.org/drywall/
http://projects.propublica.org/drywall/
Scanned
Documents

Addresses

Google Maps
Scanned       Unstructured
Documents     Information

Addresses     Structured Data

Google Maps   Visualization
Scanned       Unstructured
Documents     Information

Addresses     Structured Data

Google Maps   Visualization
Scanned       Unstructured
Documents     Information

Addresses     Structured Data

Google Maps   Visualization
Scanned        Unstructured
Documents      Information

Addresses      Structured Data

Google Maps    Visualization



       Who cares?
       What is it?
Who Cares?
Software              Store
                        Databases
                        PANDA

Visualization         Analyze
                        Fusion tables
                        Excel
                        Databases
Mashups                 R/Python/Ruby
Who Cares?
Software



Visualization



Mashups
Who Cares?
Software             Tainted House Data
                       + Economic Data
                       + Health Stats
Visualization          + Crime Stats
                       + Corruption Data


Mashups
Structured Data
Structured Data
Attribute
   Name
   Data type

Consistent
Structured Data
Attribute
  Name
  Data type

Consistent
Structured Data
Attribute
   Name
   Data type

Consistent
Structured Data
Attribute            Florida’s Lee County
                     has 1518 addresses
   Name
   Data type

Consistent
Structured Data
Attribute
   Name
   Data type

Consistent
Structured Data
Attribute
   Name
   Data type

Consistent
Structured Data
Attribute               Numeric
                        (integers, dollars,…)

   Name
                        Date/Time
   Data type
                        Lat, Lon

Consistent
Structured Data
Attribute               Numeric
                        (integers, dollars,…)

   Name
                        Date/Time
   Data type
                        Lat, Lon

Consistent              Structured strings
                        (Florida)
Structured Data
Attribute            FLORIDA
   Name              FL
                     Flroida
   Data type
                     FloridaState
                     Florida’s
Consistent
Structured Data
Attribute            FLORIDA        5
   Name              FL             10
                     Flroida        1
   Data type
                     FloridaState   1
                     Florida’s      1
Consistent
Structured Data
Attribute
   Name
   Data type

Consistent
What Am I Talking About?

• Structured Data 101

• Structured Data Continuum

• More Examples
unstructured               structured




               Continuum
Images
       Images


 unstructured                                                                                                                      structured




http://www.whatisstephenharperreading.ca/2010/03/01/book-number-76-one-day-in-the-life-of-ivan-denisovich-by-alexander-solzhenitsyn/
Images
   Images      Text Blob


unstructured                                           structured


  1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
  are citizens of Alabama and together own real property
  located at 541 Lynn Hurst Court, Montgomery, Alabama
  36117. Plaintiffs are participating as class representatives
  in the class and subclasses as set forth in the schedules
  accompanying this complaint which are incorporated herein
  by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a
  citizen of Alabama and owns real property located at 2105
  Lane Avenue, Birmingham, Alabama 35217. Plaintiff is
  participating as a class representative in the
Images
   Images      Text Blob   Email


unstructured                       structured
Images
   Images           Text Blob               Email


unstructured                                            structured




               Subject   Re: IRE conference in Boston


                 Date    June 1, 3:08PM


                 From    jaimi@ire.org
Images
   Images      Text Blob   Email   Excel


unstructured                        structured
Images      Text Blob   Email   Excel


unstructured                        structured
Images      Text Blob   Email   Excel


unstructured                        structured




 “It’s sunny
  in texas”
Images      Text Blob              Email         Excel


unstructured                                         structured




 “It’s sunny                  Tweet        Weather Location
                           It’s sunny in   Sunny   Texas
  in texas”                texas
Images      Text Blob              Email         Excel


unstructured                                         structured




 “It’s sunny                  Tweet        Weather Location
                           It’s sunny in   Sunny   (37.06,
  in texas”                texas                   -95.67)
Whe    You have unstructured data
  n


       What structure do I need?
Ask


       Attributes with simple types
Find
What Am I Talking About?

• Structured Data 101

• Structured data continuum

• More Examples
2011 State of the Union




http://www.boston.com/news/politics/specials/obama_state_of_the_union_word_cloud/
Name   Type/Meaning

Word   String
Mr. Speaker, Mr. Vice President,
members of Congress,
distinguished guests, and fellow
Americans:

Tonight I want to begin by
congratulating the men and
women of the 112th Congress, as
well as your new Speaker, John
Boehner. And as we mark this
occasion, we're also mindful of
the empty chair in this chamber,
and we pray for the health of our
colleague -- and our friend --
Gabby Giffords.

It's no secret that those of us here
tonight have had our differences
over the last two years. The
debates have been contentious;
we have fought fiercely for our
beliefs. And that's a good thing.
Mr. Speaker, Mr. Vice President,
members of Congress,
distinguished guests, and fellow                   Word
Americans:
                                       Mr
                                       Speaker
Tonight I want to begin by
congratulating the men and             Vice
women of the 112th Congress, as        President
well as your new Speaker, John
                                       Members
Boehner. And as we mark this
occasion, we're also mindful of        Congress
the empty chair in this chamber,       Distinguished
and we pray for the health of our
                                       Guests
colleague -- and our friend --
Gabby Giffords.                        Americans
                                       People
It's no secret that those of us here   Jobs
tonight have had our differences
                                       New
over the last two years. The
debates have been contentious;         years
we have fought fiercely for our
beliefs. And that's a good thing.
Bin Laden Tweets/Sec




http://www.flickr.com/photos/twitteroffice/5681263084/
Name   Type/Meaning

Time   Time
Deadly Day in Baghdad




http://www.nytimes.com/interactive/2010/10/24/world/1024-surge-graphic.html?pagewanted=all
Name         Type/Meaning

Location     Lat, Lon

Body Count   Number
http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all
14, 12
                                                                             Killed in Action


                                                                             Lat Lon



http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all
Sentiment of NZ Earthquake




http://twitinfo.csail.mit.edu/detail/4/
Name        Type/Meaning

Happiness   -1 to 1
Pattern
Matching
           Great, 7AM
           meeting


            7:00AM
Interpret
Meaning
            Great, 7AM
            meeting


            Not Happy
Interpret
Meaning
               Great, 7AM
  It’s still   meeting
      new

                Happy!
Interpret
Meaning

  It’s still
      new
Interpret
Meaning
               Earthquakes
  It’s still
      new

 Lack of
 context
Extracting meaning is
by far the most difficult
What if it’s just unstructured?
CrowdSourcing

Lots of humans do
tasks computers suck
at

Training

Quality Issues
Dealing with Forms
Dealing with Forms
Entity Information
Pattern Matching
• Regex
  – Describe and find patterns
  – Killed in action


     (?P<n>d{1,3})(s[A-Z]{1,3})?sKIA
DBTruck demo?
Structure = Super Valuable
Structure = Super Valuable


When       You have unstructured data
 Ask       What structure do I need?
Find       Attributes with simple types
Structure = Super Valuable


When       You have unstructured data
 Ask       What structure do I need?
Find       Attributes with simple types


        tinyurl.com/iredatatipsheet

              eugenewu@mit.edu
                   @sirrice

Contenu connexe

Similaire à IRE 2012 Unstructured Data Talk

EOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked LeaksEOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked LeaksJeff Jonas
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya SweeneyDataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeneydatascienceiqss
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with DataRitvvij Parrikh
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...bakers84
 
Highly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementHighly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementLucidworks (Archived)
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
How To Write Faster 11 Steps (With Pictures) - WikiHow - How To Writ
How To Write Faster 11 Steps (With Pictures) - WikiHow - How To WritHow To Write Faster 11 Steps (With Pictures) - WikiHow - How To Writ
How To Write Faster 11 Steps (With Pictures) - WikiHow - How To WritAmy Miller
 
Client diagnosis project final 813
Client diagnosis project final 813Client diagnosis project final 813
Client diagnosis project final 813ecolby
 
Leadership Essay Example In 2021 Essay Examples, Es
Leadership Essay Example In 2021 Essay Examples, EsLeadership Essay Example In 2021 Essay Examples, Es
Leadership Essay Example In 2021 Essay Examples, EsKelly Gomez
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SEmily Nimsakont
 
Construction of Authority Information for Personal Names Focused on the Forme...
Construction of Authority Information for Personal Names Focused on the Forme...Construction of Authority Information for Personal Names Focused on the Forme...
Construction of Authority Information for Personal Names Focused on the Forme...tmra
 
Defrag 2010-distrib
Defrag 2010-distribDefrag 2010-distrib
Defrag 2010-distribJeff Jonas
 
Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Dallan Quass
 
FSDN conversations
FSDN conversationsFSDN conversations
FSDN conversationsvhepworth
 
Essay On Summer Vacation In Past Tense - University Poin
Essay On Summer Vacation In Past Tense - University PoinEssay On Summer Vacation In Past Tense - University Poin
Essay On Summer Vacation In Past Tense - University PoinAlissa Cruz
 

Similaire à IRE 2012 Unstructured Data Talk (20)

EOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked LeaksEOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked Leaks
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya SweeneyDataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...
 
Highly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementHighly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law Enforcement
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
How To Write Faster 11 Steps (With Pictures) - WikiHow - How To Writ
How To Write Faster 11 Steps (With Pictures) - WikiHow - How To WritHow To Write Faster 11 Steps (With Pictures) - WikiHow - How To Writ
How To Write Faster 11 Steps (With Pictures) - WikiHow - How To Writ
 
Client diagnosis project final 813
Client diagnosis project final 813Client diagnosis project final 813
Client diagnosis project final 813
 
Leadership Essay Example In 2021 Essay Examples, Es
Leadership Essay Example In 2021 Essay Examples, EsLeadership Essay Example In 2021 Essay Examples, Es
Leadership Essay Example In 2021 Essay Examples, Es
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
The Next Decade in Web Design
The Next Decade in Web DesignThe Next Decade in Web Design
The Next Decade in Web Design
 
Sattose talk
Sattose talkSattose talk
Sattose talk
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar S
 
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
 
Construction of Authority Information for Personal Names Focused on the Forme...
Construction of Authority Information for Personal Names Focused on the Forme...Construction of Authority Information for Personal Names Focused on the Forme...
Construction of Authority Information for Personal Names Focused on the Forme...
 
Defrag 2010-distrib
Defrag 2010-distribDefrag 2010-distrib
Defrag 2010-distrib
 
Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)
 
FSDN conversations
FSDN conversationsFSDN conversations
FSDN conversations
 
Essay On Summer Vacation In Past Tense - University Poin
Essay On Summer Vacation In Past Tense - University PoinEssay On Summer Vacation In Past Tense - University Poin
Essay On Summer Vacation In Past Tense - University Poin
 

Dernier

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Dernier (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

IRE 2012 Unstructured Data Talk

  • 1. Analyzing Unstructured Data for Stories eugenewu@mit.edu
  • 2. What Am I Talking About? • Example • Structured Data 101 • Structured Data Continuum • More Examples
  • 5. 1056. Plaintiffs - Intervenors, Robert and Tasha Lambert are citizens of Alabama and together own real property located at 541 Lynn Hurst Court, Montgomery, Alabama 36117. Plaintiffs are participating as class representatives in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a citizen of Alabama and owns real property located at 2105 Lane Avenue, Birmingham, Alabama 35217. Plaintiff is participating as a class representative in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1058. Plaintiffs-Intervenors, Daniel and Nicole Smith are citizens of Alabama and together own real property located at 766 Tabernacle Road, Monroeville, Alabama http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
  • 6. 1056. Plaintiffs - Intervenors, Robert and Tasha Lambert are citizens of Alabama and together own real property located at 541 Lynn Hurst Court, Montgomery, Alabama 36117. Plaintiffs are participating as class representatives in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a citizen of Alabama and owns real property located at 2105 Lane Avenue, Birmingham, Alabama 35217. Plaintiff is participating as a class representative in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1058. Plaintiffs-Intervenors, Daniel and Nicole Smith are citizens of Alabama and together own real property located at 766 Tabernacle Road, Monroeville, Alabama http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
  • 7. 541 Lynn Hurst Court, Montgomery, Alabama 36117 37.0625, -95.677068
  • 8. 541 Lynn Hurst Court, Montgomery, Alabama 36117 37.0625, -95.677068 Jefferson County
  • 12. Scanned Unstructured Documents Information Addresses Structured Data Google Maps Visualization
  • 13. Scanned Unstructured Documents Information Addresses Structured Data Google Maps Visualization
  • 14. Scanned Unstructured Documents Information Addresses Structured Data Google Maps Visualization
  • 15. Scanned Unstructured Documents Information Addresses Structured Data Google Maps Visualization Who cares? What is it?
  • 16. Who Cares? Software Store Databases PANDA Visualization Analyze Fusion tables Excel Databases Mashups R/Python/Ruby
  • 18. Who Cares? Software Tainted House Data + Economic Data + Health Stats Visualization + Crime Stats + Corruption Data Mashups
  • 20. Structured Data Attribute Name Data type Consistent
  • 21. Structured Data Attribute Name Data type Consistent
  • 22. Structured Data Attribute Name Data type Consistent
  • 23. Structured Data Attribute Florida’s Lee County has 1518 addresses Name Data type Consistent
  • 24. Structured Data Attribute Name Data type Consistent
  • 25. Structured Data Attribute Name Data type Consistent
  • 26. Structured Data Attribute Numeric (integers, dollars,…) Name Date/Time Data type Lat, Lon Consistent
  • 27. Structured Data Attribute Numeric (integers, dollars,…) Name Date/Time Data type Lat, Lon Consistent Structured strings (Florida)
  • 28. Structured Data Attribute FLORIDA Name FL Flroida Data type FloridaState Florida’s Consistent
  • 29. Structured Data Attribute FLORIDA 5 Name FL 10 Flroida 1 Data type FloridaState 1 Florida’s 1 Consistent
  • 30. Structured Data Attribute Name Data type Consistent
  • 31. What Am I Talking About? • Structured Data 101 • Structured Data Continuum • More Examples
  • 32. unstructured structured Continuum
  • 33. Images Images unstructured structured http://www.whatisstephenharperreading.ca/2010/03/01/book-number-76-one-day-in-the-life-of-ivan-denisovich-by-alexander-solzhenitsyn/
  • 34. Images Images Text Blob unstructured structured 1056. Plaintiffs - Intervenors, Robert and Tasha Lambert are citizens of Alabama and together own real property located at 541 Lynn Hurst Court, Montgomery, Alabama 36117. Plaintiffs are participating as class representatives in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a citizen of Alabama and owns real property located at 2105 Lane Avenue, Birmingham, Alabama 35217. Plaintiff is participating as a class representative in the
  • 35. Images Images Text Blob Email unstructured structured
  • 36. Images Images Text Blob Email unstructured structured Subject Re: IRE conference in Boston Date June 1, 3:08PM From jaimi@ire.org
  • 37. Images Images Text Blob Email Excel unstructured structured
  • 38. Images Text Blob Email Excel unstructured structured
  • 39. Images Text Blob Email Excel unstructured structured “It’s sunny in texas”
  • 40. Images Text Blob Email Excel unstructured structured “It’s sunny Tweet Weather Location It’s sunny in Sunny Texas in texas” texas
  • 41. Images Text Blob Email Excel unstructured structured “It’s sunny Tweet Weather Location It’s sunny in Sunny (37.06, in texas” texas -95.67)
  • 42. Whe You have unstructured data n What structure do I need? Ask Attributes with simple types Find
  • 43. What Am I Talking About? • Structured Data 101 • Structured data continuum • More Examples
  • 44. 2011 State of the Union http://www.boston.com/news/politics/specials/obama_state_of_the_union_word_cloud/
  • 45. Name Type/Meaning Word String
  • 46. Mr. Speaker, Mr. Vice President, members of Congress, distinguished guests, and fellow Americans: Tonight I want to begin by congratulating the men and women of the 112th Congress, as well as your new Speaker, John Boehner. And as we mark this occasion, we're also mindful of the empty chair in this chamber, and we pray for the health of our colleague -- and our friend -- Gabby Giffords. It's no secret that those of us here tonight have had our differences over the last two years. The debates have been contentious; we have fought fiercely for our beliefs. And that's a good thing.
  • 47. Mr. Speaker, Mr. Vice President, members of Congress, distinguished guests, and fellow Word Americans: Mr Speaker Tonight I want to begin by congratulating the men and Vice women of the 112th Congress, as President well as your new Speaker, John Members Boehner. And as we mark this occasion, we're also mindful of Congress the empty chair in this chamber, Distinguished and we pray for the health of our Guests colleague -- and our friend -- Gabby Giffords. Americans People It's no secret that those of us here Jobs tonight have had our differences New over the last two years. The debates have been contentious; years we have fought fiercely for our beliefs. And that's a good thing.
  • 49. Name Type/Meaning Time Time
  • 50. Deadly Day in Baghdad http://www.nytimes.com/interactive/2010/10/24/world/1024-surge-graphic.html?pagewanted=all
  • 51. Name Type/Meaning Location Lat, Lon Body Count Number
  • 53. 14, 12 Killed in Action Lat Lon http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all
  • 54. Sentiment of NZ Earthquake http://twitinfo.csail.mit.edu/detail/4/
  • 55. Name Type/Meaning Happiness -1 to 1
  • 56.
  • 57. Pattern Matching Great, 7AM meeting 7:00AM
  • 58. Interpret Meaning Great, 7AM meeting Not Happy
  • 59. Interpret Meaning Great, 7AM It’s still meeting new Happy!
  • 61. Interpret Meaning Earthquakes It’s still new Lack of context
  • 62. Extracting meaning is by far the most difficult
  • 63. What if it’s just unstructured?
  • 64. CrowdSourcing Lots of humans do tasks computers suck at Training Quality Issues
  • 68. Pattern Matching • Regex – Describe and find patterns – Killed in action (?P<n>d{1,3})(s[A-Z]{1,3})?sKIA
  • 70. Structure = Super Valuable
  • 71. Structure = Super Valuable When You have unstructured data Ask What structure do I need? Find Attributes with simple types
  • 72. Structure = Super Valuable When You have unstructured data Ask What structure do I need? Find Attributes with simple types tinyurl.com/iredatatipsheet eugenewu@mit.edu @sirrice

Notes de l'éditeur

  1. Hi I’m eugenewu.I was asked to talk about unstructured data, and after some thought, I figured I’ll..
  2. Actually talk about structured dataIn particular, I want you to walk away with three thingsWhat is SD and why you should careHow to think about structured data in contrast to unstructured data. Specifically that data isn’t just …finallyA bunch of stories and visualizations and quick stories of how the author went from unstructured to structured dataLet me start with an example before talking about what structure means
  3. Jeff Larson and Joaquin Sapien, ProPublica and Aaron Kessler, Sarasota Herald Tribunedid a really nice data journalism piece on the impact of tainted drywall on home ownerslot of homes built using drywall from China, emitted foul odors and frequently caused mysterious electronics failures. health problem in residentsAnd produced a really nice visualization of the counties affected by the tainted drywall. Darker blue = more tainted homesLet’s walk through How they went from unstructured data to this visualization?
  4. They started with court documents from class action lawsuits and tax forms
  5. And extracted the plain textFor example, This is a partial list of plaintiffs. There were about 2000 in this document
  6. And they manually extracted the state and address information from the text.
  7. They then geocoded the addresses to get latitude longitude information,
  8. and finally the county that house belongs in.Doing this process for nearly 7000 addresses
  9. reveals the number of tainted homes in each of the 150 counties.This table is imported into a visualization tool to construct…
  10. The map that is shown on the propublica page.That was a fairly large number of steps.
  11. If we take a quick look at their process, we can grossly simplify it down to the following steps.Take text from docsSpecifically address informationPlot it on google maps
  12. And stepping back, to bring this back into the context of this talk, they start with unstructured information extract specific structured data, and visualize
  13. What we’ll talk about in this talk is how to go from unstructured information to structured data.
  14. But the first thing to do is to describe…
  15. why the heck we carewhat structured data is
  16. Who cares? Structured data makes your life easier in a number of ways.There’s lots of software Databases, panda to help you store and analyze structured data
  17. In a similar vein, practically all visualization tools expect your data in some kind of structured format.
  18. It can easily take a long of time to extract structured data from your documents. But now that you’ve got structured data about tainted homes in each county it can be easier to create mashups with other data.In contrast, there are not a lot of tools that work with unstructured data.
  19. The canonical example of structured data is a table like this, that I’m sure you’ve seen either on the web in the wild, or on sites like google fusion tables. What makes structured data .. Structured?
  20. For practical purposes, think of structured
  21. as a bunch of attributesFor example each of 3 columns.Each attribute has a name and a data type
  22. Why are names important?Let’s say you want to create that propublica map of each county
  23. If I just stored the data in the table in a text life like on the right, Google maps has no idea what its trying to plot.I can’t point a map at that tex.
  24. What I can say is “create a map and use county”. Since the attribute has a name the map can easily get the county names
  25. The data type embodies the “meaning” of the attribute. It says “what does this attribute represent?”The more specific you can be, the better.
  26. If the data type is a number then we can sort it, or take the sum or average.If we know it’s a type of numer (date/time) then we can use the hour, or month dataLat, lon can be plotted on a mapNon-numeric but still important are structured strings
  27. Non-numeric but still important are structured strings. These are special because for any given thing like florida, there’s only one way to spell it.
  28. This is important because something like florida could be spelled in numerous ways. The computer doesn’t know how to reconcile the differences.If we wanted the total number of tainted walls in florida, we would end up with
  29. Getting a program to extract florida in a single unambiguous way is generally pretty hard, but its important.
  30. Finally they should be consistent. In the sense that each row in your table, or each document in your dataset contains these attributesSometimes your strucutred data may not be in this kind of tabular format, but rather data attached to individual documents.
  31. Hopefully I’ve convinced you that structured data is a good idea.Now I want to describe how sturctured data relates to unstructured data…
  32. Specifically that Data isn’t unstructured or structured. It all lies on a continuum.I want to give you examples that span this spectrum and what data we may want out of them.
  33. The name of the is moving towards the right,
  34. Concretely, let’s say we have a bunch of tweets and we want to understand how the weather reported by the twitterverse differs across geographic areas.
  35. We want to extract two pieces of structured data. Weather is a string containing “sunny”Location is a string corresponding to locationOr we could extract even more specific data type
  36. By using ageocoding app to turn string texas into the latitude longitude coordinates.
  37. I’ve summarized the process into something that helps calm my nerves, which iswhen I have …Is to ask What structure? Is it dollars? Adddresses?That helps target my search for finding…
  38. I figure it would be nice to end with more examples.
  39. http://www.wordle.net/createLast year, the globe produced a world cloud of Obama’s state of the union speech
  40. An attribute that represents a single word in the speech. Perhaps with the punctuation removed
  41. So we would start from the speech text and
  42. Construct this single attribute table
  43. Twitter released this graphic of the number of tweets per second referencing bin laden when he was captured earlier last year.
  44. In this case tweets already contain the information we want – time.
  45. Per capita availability of boneless, trimmed meat
  46. We need to extract two pieces of info. Similar to the iraq map, we need location information, but this time shapes of regions rather than single latlon coordinates.The nice part of this data is that it is often considered important, and can be found in a consistent location in the documents
  47. Another example is the Deadly Day in Baghdad visualization produced byJACOB HARRIS and others the NYTimes, depicts the distribution of deaths in baghdad for a single day.Location of circle is latlon of where it happenedSize is how many peolp
  48. This is an example of a wikileaks document the NYTimes had to work with.
  49. KIA = killed in action. In this case, NYTimes extracted the data by hand. And sometimes this may be the case.But if the documents all looked like this (KIA at the top, WHERE:), it _may_ be possible to use pattern matching to extract this data.
  50. Since much data about our lives is inexorably tied to where we live, we are often concerned with the regions that we live.This visualization shows number households per 1000 in regions throughout MA have lived there for 3+ generations – as a indicator on commitment to the region.
  51. We need to extract two pieces of info. Similar to the iraq map, we need location information, but this time shapes of regions rather than single latlon coordinates.
  52. iN this case, we are starting with what looks like structured data, and further extracting info
  53. Person’s name.Extracting this type of information is called entity extraction, wher an entity may be a business name, famous person, etcThis is typically quite difficult, and requires an existing dataset of “important entities”
  54. Finally, a popular analysis is to classify the unstructured documents. Categorizing by topic, or emotionTwitinfo is a tool by marcua to analyze tweets about particular topics. One of its features is analyzing the sentiment of the tweets.Here are 4 example tweets from last year talking about the Christchurch earthquake. Blue = +Red = -The pie chart shows that the tweets are overwhelmingly positive.
  55. The structured data would then by happiness, and its type is a number between -1 and 1.there exist tools for specific types of analysis like sentiment or topicHowever
  56. Be really careful with these types of automatic categorization tools
  57. In all of the examples until that last one, what we’ve talked about amounted to pattern matching.This is really good. Tons of tools to do a good job
  58. For example, the extracted sentiment of tweets about the new zealand earthquake was really positive!This is surprising because earthquakes are generally considered not so good.Because the tweets are all wishing the survivors the best, but these extractors don’t understand.
  59. You can give your pile of documents to a thousand people who will extract the data you want quickly and cheaply.Mturk, crowd flower have more of an “anonymous workers” approach where someone will do your work, but you don’t know whoOdesk is more like directly hiring a contractorIn both cases, you’ll need to train the worker and deal with quality issues.
  60. If you have a bunch of the same forms, handwritten or not, captricity is a new startup that will take your forms, extract the parts you care about and return a nice, structured table containing the data.
  61. If you have a bunch of the same forms, handwritten or not, captricity is a new startup that will take your forms, extract the parts you care about and return a nice, structured table containing the data.
  62. If you care looking for people or places, Open Calais is a tool that automatically finds entities.Mario Monti is prime minister of Italy
  63. But I’m going to give you a tip sheet later that also contains this and the other tools.
  64. Just say the text!
  65. Number of users, number of posts per day. Major posts that have been censored
  66. ----- Meeting Notes (6/12/12 00:16) -----put chi chu here instead
  67. Thankfully the journalism and media studies program ----- Meeting Notes (6/12/12 00:39) -----change tweet to post
  68. Shorter. Bo xilai falls from power.
  69. Shorter. Bo xilai falls from power.
  70. Shorter. Bo xilai falls from power.
  71. We extract information such as the ip address of the post, the post contents, the post date, the deletion date, the poster, and other information.
  72. We extract information such as the ip address of the post, the post contents, the post date, the deletion date, the poster, and other information.
  73. The most difficult is completely unstructured data. For exampleHand written letters, where we want the sender and recipient names
  74. Or a scanned typewritten letter, and we want company and cate information
  75. Or text files like the pro-publica example, where we want state and address data
  76. A non text example would be scanned forms.In this case, Federal election contribution reports. Where we want the committee name and donation amounts and dates
  77. Going towards the structured end, there is data that smells unstructured, but actually contains some structured data.For example, a tweet I wrote about trends in the database community contains more than just the text
  78. In addition to the tweet text, which is unstructured, the Twitter API provides structured information Timestampof when the tweet was posted, my username, number of retweets, etc etc.That are all valuable to analyze without needing to process the actual text.
  79. Similarly, emails contain structured data in the form of….
  80. Subject, date, sender and tons more information.Later, Sudheendra will describe his email analyses tool that extractsspecific pieces of structured data and visualizes it.
  81. Working directly with unstructured data is really really hard.Often times this requires manual work of analyzing documents one by one.
  82. convince you that you can do a lot without messing too much with actual unstructured data.
  83. Hello, my name Is eugene wu. I’m actually a student right across the river at MIT. I study databases. Not part of my PhD, but what I’m interested in is how reporters are dealing with and analyzing your data.
  84. When I was asked to talk about anaylzing unstructured data for stories,hard time coming up with a talk.This is a fairly open ended topic, and I could talk about data scraping, visualization, extraction.The reason why there are so many techniques is thatDealing with unstructured data is very difficult and computers are terrible at it.
  85. Also didn’t want to talk about a single tool because they are often used for specific types of data/analysesLooking for something that is useful for a general audienceThen I thought, hey’ I’m a database student, and we work with tables all the time!
  86. The best ones are numerical data types. Computers are really really good at processing numerical values. They can easily show you the sum, or average, or look for trends.In fact pretty much every visualization tool, and analysis program will expect numerical data
  87. If you can specify the type of numeric, then better. For example, lat lon then you can plot it on a map
  88. Next are structured strings. These words where the meaning is different if the values are different. That is, there’s one way to say florida - capitalized florida.This is important when you want to ask “whats the total number of addresses in florida”?
  89. Finally is random text. This is very akin to saying “this attribute is unstructured text”. Computers are horrible with this type of data because it’s so ambiguous----- Meeting Notes (6/12/12 18:11) -----know is a number, we know we can sort them, lat lon we can put it on a map. stop.