1. Big Data = Bigger Meta
O’Reilly Strata Conference
February 29 2012
2. Pivot/Skate, etc…
Founded 2003
Poor man’s GIS
Panamap
Refounded 2006
Neighborhood boundaries
Mass transit data
Refocused 2009
SaaS for mapping + on-demand data
3. Achtung!
NoSQL is no panacea
Big Data isn’t about data
Big Data isn’t new
Big Data doesn’t present a Boolean quandary
With power comes responsibility
AWS bills
Lady Gaga tweets
Innumeracy (correlation v causation)
4. Big v Important
Big Important
Heterogeneous Well-defined schema
Raw High value (not free)
Distributed Test-driven
Streaming/real time Relational
Search for meaning Historical
Time-sensitive Enterprise-focused
Philosophical
5. Data Exhaust
Analytics Probes
Social Media Gov 2.0
Some background to Urban Mapping. Wasn’t a straight forward path, but it’s very relevant-started close to 10 yrs ago with a printed map that reveals different layers of thematic imagery—streets, subways, neighborhoods, depending on the angle of viewing. We all know what happened to print, so I shifted the business to a new medium-in 2006 or so we collected much of the same data, but now using a spatial database as opposed to regular old vector/adobe illustrator. The writing was on the wall for licensing content to local web publishers, so shifted again-this time we moved upstream—continue to develop our own data, but greatly expand that effort to include commercial data and deliver it through our own mapping service. We do this for customers in various market segments, like Tableau Software, where we perform a few geo-services like hosting the base map and overlaying data.
I can be a bit of a curmudgeon and I hope a cautionary point of view has a place. Let’s talk about what Big Data is not. I’ll talk later about what it is.First thing to note is that Big Data isn’t really about data at all. But I am. It’s about tools and processes to manage and exploit info-nuggets. There’s nothing revolutionary about saying this, but I wanted to make it explicit. Second, big data isn’t especially new– Wall St and Walmart have been processing and deriving value for decades, but they don’t talk about it. Why? Because they make money doing so and don’t need to alert the competition. Anybody hear of Teradata? Whenever companies want to talk about what they are doing, it’s usually a red flag for me, meaning the technology, industry or something else hasn’t sufficiently evolved. But I’m also not saying Big Data is a rehash of enterprise software. More on that later…Finally, Big Data has democratized access to powerful tools at little cost. This doesn’t necessarily mean everybody knows how to use these tools. There can be some blowback, such as high credit card bills, analysis without direction/objective and lack of knowledge about basic statistics
There’s been exponential growth in data and it comes from any number of places. Some are shown here—mobile devices as probes, which vast capabilities to record all kinds of environmental variables, open government, social media and a desire for analytics which has been rebranded as business intelligence,
Processing and storage costs drop like rocks—enterprise software has been offering big solutions for decades to banking and others, but with incredibly low barriers to entry virtually anybody can participate.
Kal-i-um-akuswas a noted poet in the Library of Alexandria in 3rd century BC.
He created pin-a-keez, or Lists, a way of organizing works in the libraryEmbarked on the effort to organize 120k scrolls, by title, author, birthplace, father, education, summary of contents and other info. This was first effort to systematically create a bibliographic system. A direct link to metadata 2 millennia later
1595, Johan van der Does publishedNomenclator– this was the first instance of a printed catalog of library holdings. Represented a significant advancement over the Kal-i-um-akuslists, but it too close to two millennia to get here
The modern cataloging system: Dewey Decimal System, created 1876. Its father was Melville DeweyThe Dewey Decimal System attempted to organize all knowledge into ten main classes. Further subdivided into ten divisions, and each division into ten sections, giving ten main classes, 100 divisions and 1000 sections. Allows for infinite hierarchy, numerical and faceted (linking content from different areas).Other systems followed: Universal Decimal Classification, Library of Congress, etc…
This photo is from the Card Division at the Library of Congress in the1920s. The amount of physical metadata is astounding. Millions of library cards with metadata
The next major advancement was in the late 1960s. Early attempts at electronic indexing focused on a taxonomy of keywords and related information. Was efficient for reporting on what the system contained, but also kept the long running divorce between artifact and metadataThe online computer library center was created as a nonprofit to further access to library resources across institutions and decrease costs.The OCLC acquired the Dewey Decimal System and as any standards body does, sought to perpetuate its existence over the decadesThen the internet happened
That meant out wit the old, In with the new. This photo is library cards going into storage. Not sure why they’d even be archived after the transition to databases was made, but that’s for another time
So this is the situation. Beginning in the late 60s, electronically-stored metadata began to grow. The library cards (at left) went away, but the bifurcation was complete. Total separation of the thing from the description of the thing. And it sort of made sense– IT was in its infancy, so storage and processing costs were high. Publishers also exerted a great deal of control over how they permitted libraries to index and make available works.
To put the last 2000 years in perspective, Kal-i-um-akus created the first crude schema, leaving a place for metadata to be storedThe Nomenclator gave us the first bibliographic catalog, printed and bound, produced annuallyThe Dewey Decimal System was born in 1876 and was the basis of an extensive metadata system for published worksThen…the internet happened. In the top right you see the corner of a cloud. That’s my way of representing what happens next.The volume of data product grows exponentially, overtaking 2000 plus years of history in no time.
So how about the bifurcation/divorce I mentioned? The web brought the artifact and metadata together again
Google Books. Sure, we have the Dewey Decimal type stuff along with ISBN, retail price, etc…but we also threw in the whole damn book—full text search.Amazon does it too
In my industry, the state of metadata is horrendous. We’re stuck in the green screen days. Proprietary data formats and slow moving vendors don’t help.While I’m the first person to admit GIS needs to get off its ass and change, radically, there’s also something the real time streaming web can learn from us.
We hear about the rise of the curator, the part social scientist, part librarian, part RDBMS wiz and statistician.This is increasingly important across all industries—when dealing with a torrent of data, domain experts will be required to help make sense of it.
The Knowledge Hierarchy, as it is sometimes known, has been used to represent relationships between the stuff that turns into something meaningful. You could look at this going from a letter to a sentence to a paragraph or an ingredient to a recipe to a meal or something else. The details don’t matter here, but I think about the fundamental building block of data.One geocoded tweet has little or no value on its own. Contrast that with per capital income for this ZIP code. By amassing enough geocoded tweets, it’s clear we can get to something meaningful, but I don’t know how many tweets that is. I do know that per capita income can directly inform my marketing plans for selling a new shampoo.
With that, here’s some more wet blanket for everybody. Using Google Trends, I looked at a number of terms that might indicate the old fashioned RDBMS, SQL way of life and most seem to follow the blue line, which represents the term ‘metadata.’ Big Data, coincidentally, first appears a few months before the first Strata conference in 2011. ‘Curation’ has a longer life but doesn’t show the surge of Big Data, and everybody’s favorite ‘data scientist,’ doesn’t register as much more than a rounding error. I’m not using Google Trends to fully substantiate my argument, but I do hope you take a dose of skepticism before fully embracing ‘this.’
In close, I’d like to leave you with an emergent cliché. It’s also my measure of how geeky an audience I have: one person’s metadata is another person’s data.