5. 1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1057. Plaintiff-Intervenor, Brenda
Owens, is a citizen of Alabama and owns real property
located at 2105 Lane Avenue, Birmingham, Alabama
35217. Plaintiff is participating as a class representative in
the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1058. Plaintiffs-Intervenors, Daniel
and Nicole Smith are citizens of Alabama and together
own real property located at 766 Tabernacle Road,
Monroeville, Alabama
http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
6. 1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1057. Plaintiff-Intervenor, Brenda
Owens, is a citizen of Alabama and owns real property
located at 2105 Lane Avenue, Birmingham, Alabama
35217. Plaintiff is participating as a class representative in
the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1058. Plaintiffs-Intervenors, Daniel
and Nicole Smith are citizens of Alabama and together
own real property located at 766 Tabernacle
Road, Monroeville, Alabama
http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
34. Images
Images Text Blob
unstructured structured
1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated herein
by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a
citizen of Alabama and owns real property located at 2105
Lane Avenue, Birmingham, Alabama 35217. Plaintiff is
participating as a class representative in the
35. Images
Images Text Blob Email
unstructured structured
36. Images
Images Text Blob Email
unstructured structured
Subject Re: IRE conference in Boston
Date June 1, 3:08PM
From jaimi@ire.org
37. Images
Images Text Blob Email Excel
unstructured structured
38. Images Text Blob Email Excel
unstructured structured
39. Images Text Blob Email Excel
unstructured structured
“It’s sunny
in texas”
40. Images Text Blob Email Excel
unstructured structured
“It’s sunny Tweet Weather Location
It’s sunny in Sunny Texas
in texas” texas
41. Images Text Blob Email Excel
unstructured structured
“It’s sunny Tweet Weather Location
It’s sunny in Sunny (37.06,
in texas” texas -95.67)
42. Whe You have unstructured data
n
What structure do I need?
Ask
Attributes with simple types
Find
43. What Am I Talking About?
• Structured Data 101
• Structured data continuum
• More Examples
44. 2011 State of the Union
http://www.boston.com/news/politics/specials/obama_state_of_the_union_word_cloud/
46. Mr. Speaker, Mr. Vice President,
members of Congress,
distinguished guests, and fellow
Americans:
Tonight I want to begin by
congratulating the men and
women of the 112th Congress, as
well as your new Speaker, John
Boehner. And as we mark this
occasion, we're also mindful of
the empty chair in this chamber,
and we pray for the health of our
colleague -- and our friend --
Gabby Giffords.
It's no secret that those of us here
tonight have had our differences
over the last two years. The
debates have been contentious;
we have fought fiercely for our
beliefs. And that's a good thing.
47. Mr. Speaker, Mr. Vice President,
members of Congress,
distinguished guests, and fellow Word
Americans:
Mr
Speaker
Tonight I want to begin by
congratulating the men and Vice
women of the 112th Congress, as President
well as your new Speaker, John
Members
Boehner. And as we mark this
occasion, we're also mindful of Congress
the empty chair in this chamber, Distinguished
and we pray for the health of our
Guests
colleague -- and our friend --
Gabby Giffords. Americans
People
It's no secret that those of us here Jobs
tonight have had our differences
New
over the last two years. The
debates have been contentious; years
we have fought fiercely for our
beliefs. And that's a good thing.
71. Structure = Super Valuable
When You have unstructured data
Ask What structure do I need?
Find Attributes with simple types
72. Structure = Super Valuable
When You have unstructured data
Ask What structure do I need?
Find Attributes with simple types
tinyurl.com/iredatatipsheet
eugenewu@mit.edu
@sirrice
Notes de l'éditeur
Hi I’m eugenewu.I was asked to talk about unstructured data, and after some thought, I figured I’ll..
Actually talk about structured dataIn particular, I want you to walk away with three thingsWhat is SD and why you should careHow to think about structured data in contrast to unstructured data. Specifically that data isn’t just …finallyA bunch of stories and visualizations and quick stories of how the author went from unstructured to structured dataLet me start with an example before talking about what structure means
Jeff Larson and Joaquin Sapien, ProPublica and Aaron Kessler, Sarasota Herald Tribunedid a really nice data journalism piece on the impact of tainted drywall on home ownerslot of homes built using drywall from China, emitted foul odors and frequently caused mysterious electronics failures. health problem in residentsAnd produced a really nice visualization of the counties affected by the tainted drywall. Darker blue = more tainted homesLet’s walk through How they went from unstructured data to this visualization?
They started with court documents from class action lawsuits and tax forms
And extracted the plain textFor example, This is a partial list of plaintiffs. There were about 2000 in this document
And they manually extracted the state and address information from the text.
They then geocoded the addresses to get latitude longitude information,
and finally the county that house belongs in.Doing this process for nearly 7000 addresses
reveals the number of tainted homes in each of the 150 counties.This table is imported into a visualization tool to construct…
The map that is shown on the propublica page.That was a fairly large number of steps.
If we take a quick look at their process, we can grossly simplify it down to the following steps.Take text from docsSpecifically address informationPlot it on google maps
And stepping back, to bring this back into the context of this talk, they start with unstructured information extract specific structured data, and visualize
What we’ll talk about in this talk is how to go from unstructured information to structured data.
But the first thing to do is to describe…
why the heck we carewhat structured data is
Who cares? Structured data makes your life easier in a number of ways.There’s lots of software Databases, panda to help you store and analyze structured data
In a similar vein, practically all visualization tools expect your data in some kind of structured format.
It can easily take a long of time to extract structured data from your documents. But now that you’ve got structured data about tainted homes in each county it can be easier to create mashups with other data.In contrast, there are not a lot of tools that work with unstructured data.
The canonical example of structured data is a table like this, that I’m sure you’ve seen either on the web in the wild, or on sites like google fusion tables. What makes structured data .. Structured?
For practical purposes, think of structured
as a bunch of attributesFor example each of 3 columns.Each attribute has a name and a data type
Why are names important?Let’s say you want to create that propublica map of each county
If I just stored the data in the table in a text life like on the right, Google maps has no idea what its trying to plot.I can’t point a map at that tex.
What I can say is “create a map and use county”. Since the attribute has a name the map can easily get the county names
The data type embodies the “meaning” of the attribute. It says “what does this attribute represent?”The more specific you can be, the better.
If the data type is a number then we can sort it, or take the sum or average.If we know it’s a type of numer (date/time) then we can use the hour, or month dataLat, lon can be plotted on a mapNon-numeric but still important are structured strings
Non-numeric but still important are structured strings. These are special because for any given thing like florida, there’s only one way to spell it.
This is important because something like florida could be spelled in numerous ways. The computer doesn’t know how to reconcile the differences.If we wanted the total number of tainted walls in florida, we would end up with
Getting a program to extract florida in a single unambiguous way is generally pretty hard, but its important.
Finally they should be consistent. In the sense that each row in your table, or each document in your dataset contains these attributesSometimes your strucutred data may not be in this kind of tabular format, but rather data attached to individual documents.
Hopefully I’ve convinced you that structured data is a good idea.Now I want to describe how sturctured data relates to unstructured data…
Specifically that Data isn’t unstructured or structured. It all lies on a continuum.I want to give you examples that span this spectrum and what data we may want out of them.
The name of the is moving towards the right,
Concretely, let’s say we have a bunch of tweets and we want to understand how the weather reported by the twitterverse differs across geographic areas.
We want to extract two pieces of structured data. Weather is a string containing “sunny”Location is a string corresponding to locationOr we could extract even more specific data type
By using ageocoding app to turn string texas into the latitude longitude coordinates.
I’ve summarized the process into something that helps calm my nerves, which iswhen I have …Is to ask What structure? Is it dollars? Adddresses?That helps target my search for finding…
I figure it would be nice to end with more examples.
http://www.wordle.net/createLast year, the globe produced a world cloud of Obama’s state of the union speech
An attribute that represents a single word in the speech. Perhaps with the punctuation removed
So we would start from the speech text and
Construct this single attribute table
Twitter released this graphic of the number of tweets per second referencing bin laden when he was captured earlier last year.
In this case tweets already contain the information we want – time.
Per capita availability of boneless, trimmed meat
We need to extract two pieces of info. Similar to the iraq map, we need location information, but this time shapes of regions rather than single latlon coordinates.The nice part of this data is that it is often considered important, and can be found in a consistent location in the documents
Another example is the Deadly Day in Baghdad visualization produced byJACOB HARRIS and others the NYTimes, depicts the distribution of deaths in baghdad for a single day.Location of circle is latlon of where it happenedSize is how many peolp
This is an example of a wikileaks document the NYTimes had to work with.
KIA = killed in action. In this case, NYTimes extracted the data by hand. And sometimes this may be the case.But if the documents all looked like this (KIA at the top, WHERE:), it _may_ be possible to use pattern matching to extract this data.
Since much data about our lives is inexorably tied to where we live, we are often concerned with the regions that we live.This visualization shows number households per 1000 in regions throughout MA have lived there for 3+ generations – as a indicator on commitment to the region.
We need to extract two pieces of info. Similar to the iraq map, we need location information, but this time shapes of regions rather than single latlon coordinates.
iN this case, we are starting with what looks like structured data, and further extracting info
Person’s name.Extracting this type of information is called entity extraction, wher an entity may be a business name, famous person, etcThis is typically quite difficult, and requires an existing dataset of “important entities”
Finally, a popular analysis is to classify the unstructured documents. Categorizing by topic, or emotionTwitinfo is a tool by marcua to analyze tweets about particular topics. One of its features is analyzing the sentiment of the tweets.Here are 4 example tweets from last year talking about the Christchurch earthquake. Blue = +Red = -The pie chart shows that the tweets are overwhelmingly positive.
The structured data would then by happiness, and its type is a number between -1 and 1.there exist tools for specific types of analysis like sentiment or topicHowever
Be really careful with these types of automatic categorization tools
In all of the examples until that last one, what we’ve talked about amounted to pattern matching.This is really good. Tons of tools to do a good job
For example, the extracted sentiment of tweets about the new zealand earthquake was really positive!This is surprising because earthquakes are generally considered not so good.Because the tweets are all wishing the survivors the best, but these extractors don’t understand.
You can give your pile of documents to a thousand people who will extract the data you want quickly and cheaply.Mturk, crowd flower have more of an “anonymous workers” approach where someone will do your work, but you don’t know whoOdesk is more like directly hiring a contractorIn both cases, you’ll need to train the worker and deal with quality issues.
If you have a bunch of the same forms, handwritten or not, captricity is a new startup that will take your forms, extract the parts you care about and return a nice, structured table containing the data.
If you have a bunch of the same forms, handwritten or not, captricity is a new startup that will take your forms, extract the parts you care about and return a nice, structured table containing the data.
If you care looking for people or places, Open Calais is a tool that automatically finds entities.Mario Monti is prime minister of Italy
But I’m going to give you a tip sheet later that also contains this and the other tools.
Just say the text!
Number of users, number of posts per day. Major posts that have been censored
----- Meeting Notes (6/12/12 00:16) -----put chi chu here instead
Thankfully the journalism and media studies program ----- Meeting Notes (6/12/12 00:39) -----change tweet to post
Shorter. Bo xilai falls from power.
Shorter. Bo xilai falls from power.
Shorter. Bo xilai falls from power.
We extract information such as the ip address of the post, the post contents, the post date, the deletion date, the poster, and other information.
We extract information such as the ip address of the post, the post contents, the post date, the deletion date, the poster, and other information.
The most difficult is completely unstructured data. For exampleHand written letters, where we want the sender and recipient names
Or a scanned typewritten letter, and we want company and cate information
Or text files like the pro-publica example, where we want state and address data
A non text example would be scanned forms.In this case, Federal election contribution reports. Where we want the committee name and donation amounts and dates
Going towards the structured end, there is data that smells unstructured, but actually contains some structured data.For example, a tweet I wrote about trends in the database community contains more than just the text
In addition to the tweet text, which is unstructured, the Twitter API provides structured information Timestampof when the tweet was posted, my username, number of retweets, etc etc.That are all valuable to analyze without needing to process the actual text.
Similarly, emails contain structured data in the form of….
Subject, date, sender and tons more information.Later, Sudheendra will describe his email analyses tool that extractsspecific pieces of structured data and visualizes it.
Working directly with unstructured data is really really hard.Often times this requires manual work of analyzing documents one by one.
convince you that you can do a lot without messing too much with actual unstructured data.
Hello, my name Is eugene wu. I’m actually a student right across the river at MIT. I study databases. Not part of my PhD, but what I’m interested in is how reporters are dealing with and analyzing your data.
When I was asked to talk about anaylzing unstructured data for stories,hard time coming up with a talk.This is a fairly open ended topic, and I could talk about data scraping, visualization, extraction.The reason why there are so many techniques is thatDealing with unstructured data is very difficult and computers are terrible at it.
Also didn’t want to talk about a single tool because they are often used for specific types of data/analysesLooking for something that is useful for a general audienceThen I thought, hey’ I’m a database student, and we work with tables all the time!
The best ones are numerical data types. Computers are really really good at processing numerical values. They can easily show you the sum, or average, or look for trends.In fact pretty much every visualization tool, and analysis program will expect numerical data
If you can specify the type of numeric, then better. For example, lat lon then you can plot it on a map
Next are structured strings. These words where the meaning is different if the values are different. That is, there’s one way to say florida - capitalized florida.This is important when you want to ask “whats the total number of addresses in florida”?
Finally is random text. This is very akin to saying “this attribute is unstructured text”. Computers are horrible with this type of data because it’s so ambiguous----- Meeting Notes (6/12/12 18:11) -----know is a number, we know we can sort them, lat lon we can put it on a map. stop.