Plenary talk for http://datadaytexas.com/ (2014-01-11) discussing state of the art for Data circa early 2014, and presenting a rubric called "Just Enough Math" to show how some areas of advanced math (beyond the killing fields of calculus) tie together the business cases in Big Data and a variety of open source distributed frameworks.
1. Mission: within a 40 minute talk, construct perspectives circa
early 2014 about the state of the art in Data technologies,
encompassing the content covered in DDTx14 talks,
and pointing toward “Where does it go from here?"
(8mm tape cartridge begins smoldering in a Dell LTO2 Ultrium-2)
———
[ Great to be back in Austin! ]
Glancing at our schedule for today,
I had browser tabs open for Data Day Texas and O’Reilly Strata,
and was momentarily perplexed about which was showing :)
We’re in for a whole lot of wonderful insights:
expert talks, tutorials, panels, book signings, BOFs, etc.
There’s immense brainpower and expertise
represented in our agenda,
and very much so within our audience!
One thing I’ve noticed about Lynn’s events,
especially here in Austin,
the networking can be the best part —
hallway conversations, lunches, office hours:
start-ups formed, consulting gigs landed,
deals struck between major firms, etc.
Definitely take advantage of these opportunities!
[ See you at Strata, OSCON, etc. ]
Also, putting on my other hat —
there are several O’Reilly authors gathered here,
disproportionately so for a conf this size.
Definitely check out Strata, OSCON, Velocity, Solid, etc.
I hope to run into many folks from Texas at those.
———
Recently I’ve been fortunate to catch lots of “Big Data” confs:
Hadoop Summit, Spark Summit, GOTO, DataBeat, etc…
One issue voiced repeatedly at these (and in my workshops):
Enterprise IT folks go to Big Data events, and it seems
like there’s 150 different vendors on the floor
saying similar things, all claiming to be different.
Q: Does that seem at all familiar?
And it’s bewildering to Enterprise IT folks
accustomed to talking with maybe THREE vendors.
There’s an overrun vendor floor downstairs,
meanwhile, upstairs, talks are all about
somebody doing rocket surgery,
far away from what you need for Monday morning.
Upstairs, downstairs.
Fortunately, here at DDTx we have a focus on
the practices of Big Data, the apps, use cases,
and, most quintessentially, how to build things!
Almost a Maker Faire for Big Data, if you will.
[ practice, use cases, apps, how to build things ]
To name a few talks coming up:
Charity Majors @Facebook,
Josh Wills and Eric Sammer @Cloudera,
excellent preso’s from Pivotal, Tableau, Revolution, etc.
We could keep going down the list.
The practitioners are here —
at the podiums and in the audience.
That’s something I really like about Data Day Texas.
Where I live, the usual elevator pitch seems to go…
“Something something hipster early-stage tech venture,
something something much like Uber, but for iguana owners.”
Which the VCs claim will be worth billions at IPO.
Or something.
I’m so grateful to be back in Austin!
———
Many years ago, I worked for a start-up here called Odin Corporation…
[ Odin vs. Ice Giants, etc. ]
We did work commercializing artificial neural networks.
That was in the early 1990s, way before neural networks were cool.
Or at least, before ANNs were known to
more than a handful of people.
Our “international” conferences were held at “discount” hotels.
Or something.
Some experts who consulted on that
Odin project are speakers here today.
One of them, Brad Martin, will present about
data security in a world that’s moved beyond trust,
Highly recommended.
Though we’ve been through dozens of companies in the time since,
I’m grateful for my friends, that we work together many years later.
That’s an important point to keep in mind:
tech start-ups come and go (a lot, really!)
but good people you meet stick around for a long time.
———
Back at Odin we had to be careful about compute costs…
We were running on embedded microprocessors at the time,
for example recognizing handwritten Kanji characters
on circa 1990 mobile devices.
Device production, oddly enough, now owned by Google.
[ circa 1990 mobile devices => Google ]
What a hoot!
That’s reaching back into ancient days for machine learning.
I think my keychain has more computing power now.
Google runs huge clusters of neural nets ten layers deep
in warehouse-scale datacenters, for image recognition, etc.
Deep Learning, as it’s now called.
The punchline about that tech history arc is that
Processing power is catching up with the math
That’s a truism in many ways about cluster computing —
Let’s get to that in a moment…
———
One summer day in San Francisco at the top of the Westfield,
Lynn and I were discussing about a kind of “divide”
observed through professional workshops about large-scale data.
Half of the audience tends to
have background in analytics,
they understand the math,
but lack enough coding experience to
move into writing production apps.
Another half has great chops
when it comes to systems programming,
and many of these people manage
large Linux clusters all day every day;
however, they perhaps lack enough math to
dive into the use cases
for apps running on those clusters.
Building interdisciplinary teams is key to industry success,
except that the work is partly opaque to each discipline.
Crossing that divide compels priorities on learning –
human learning about machine learning, if you will...
[ ¿quién es más macho? opaque code v. opaque math ]
The limiting factor is having “just enough math”
in the context of business use cases,
along with simple code examples.
Unfortunately, many people take math at university level
only to run headlong into the “killing fields” of calculus.
Q: Does that seem at all familiar?
Does anyone need three years of pre-calc, calc I, calc II, etc.,
before being allowed to learn how to do really useful things
with, say, graph theory, or monoids, or eigenvectors?
Eigenvectors and monoids are all over this field of Big Data,
just not enough people learn to be fluent with them,
so we must whisper about them in hush tones.
The “killing fields” of calculus were put in place
to weed out the ranks of potential employees going into
mechanical engineering and related fields.
How positively industrial.
I was a math major, and I have enormous respect for
ME’s because they had to take more math than us math majors.
However, in reality, that was a Cold War priority.
Producing lots of rocket surgeons, people fluent with
partial differential equations and thermodynamics,
building ICBMs, missile shields, strategic bomber fleets, etc.
[ Yeeee Haaawwww! ]
Um, Bokay.
But now we need millions of people fluent
with graph theory and linear algebra,
with tools for leveraging HPC clusters,
contending with enormous VVV of Data.
Thermodynamics is still important,
just maybe not as much right now.
Priorities have shifted.
[ interdisciplinary teams ]
We require interdisciplinary teams.
We need analytics people comfortable with writing code,
we need systems engineers comfortable solving the math.
Otherwise, we’re going to be drowning in Data.
———
I’ve been working on material called “Just Enough Math”,
essentially, advanced math for business people
in the context of Big Data, machine learning,
parallel processing, cluster computing, etc.
Tied to industry use cases,
with simple coding examples in Python and Scala.
Plus lots of history and primary sources,
plus lots of links for further study.
And almost no calculus.
Almost.
[ almost no calculus, we promise ]
We previewed some of that material in the
“Intro to Machine Learning” workshop on Fri.
How was that?
I’d like to share some analysis
from industry use cases,
from considering the
foundational math and how to teach it…
Let’s call it “The Big Picture”
———
For the next bits, kudos to Chris Severs @eBay;
conversations we had reinforced insights about
an approach to machine learning in the general case.
To wit:
[ real-world => graph => sparse matrix => cluster
compute ]
Starting with the Real-World™, that’s messy…
In Data Science, we say that
we spend 80% of our efforts just cleaning up data.
Some people (DJ Patil, et al.) even go as far as to say
that’s the science in Data Science, cleaning up the data.
Perhaps so. YMMV.
The point is that Machine Learning is about
generalizing, predicting patterns and indicators,
based on prior data and insights.
Generalization in ML is based on 3 components:
representation, evaluation, optimization
Pedro Domingos @U Washington says so,
and I’m drinking that kool-aid.
———
Representing data collected in the real world
is largely about geospatial and time series,
however, representing patterns in the data
is generally about graphs.
Why is there so much interest in graph databases,
graph queries, etc., these days?
Because real problems are generally graphs:
whether they are ad-tech, or anti-fraud, or social networks,
or educational insights, locations services, agricultural planning…
real problems are almost always represented as some kind of graph.
Before we had SQL, we used Codasyl, etc., i.e., graph databases —
or IMS, which has trees plus links, i.e., graphs.
[ Ceci n’est pas un graphe ]
Now we have Titan, GraphLab, Giraffe, GraphX, etc.
Processing power is catching up with the math
Steve Kramer will present about leveraging complex graphs.
Last year we had another friend and colleague,
Matthias Broechler @Titan
The idea is: graph queries are highly effective at
*proving* the predictive power in your data
much more so than SQL!!
Start there, prove your strategy with graphs first.
Step 1 in ML: representation,
probably as graphs
———
Graphs can grow very large.
Measures that I hear around Facebook, Twitter, etc.,
those number into trillions of elements.
Graphs are simple to annotate with even more data
Scale-out, however, can become a problem.
Graphs are also quite interesting for two reasons:
1. at scale they tend to be sparse
2. graphs can generally be turned into matrices
Proving our approach through graph queries first,
then we take ginormous amounts of data,
woven into big graphs,
converting those into large, sparse matrices.
In terms of “Just Enough Math”,
algebraic graph theory provides a bridge, if you will,
between graphs and linear algebra — matrices
Bueno.
Because “vee havf vays” to handle sparse matrices.
Step 2 in ML: evaluation,
grounded in data at scale
Sparse matrices are quite awesome for parallel processing.
[ suggestion: take the red pill . . . ]
Sparse matrices fit well with column stores:
Cassandra, HBase, Vertica, Accumulo, etc.
A variety of speakers will be talking about those today!
Russell Jurney will present about Agile Data Science,
also highly recommended.
———
For the Scala fans out there —
and moreover for Clojure, Haskell, F#, too —
there are excellent ways to leverage functional programming
to write concise programs that do enormous amounts of work
defining data workflows at scale.
[ data is fully functional ]
From a software engineering management perspective,
read: less code to maintain, less expensive QA.
In terms of “Just Enough Math”,
this needs a wee bit o’ abstract algebra
but it’s really fun!
Whenever I think about data workflows,
the first thing that comes to mind is abstract algebra…
(um, awkward)
If you’re familiar with the notion of, say,
Docker for containerizing apps on Linux,
there’s an analogy… monoids, semigroups, rings, etc.,
these alien artifacts from abstract algebra
serve to “containerize” the business logic in workflows,
so that apps can be parallelized.
[ containerized business logic ]
In other words, compiler hints on a grand scale.
One of the best realizations of this so far in open source
has got to be https://github.com/twitter/summingbird/wiki
Seriously. Filled. With. Awesome.
That works drives the revenue apps which drove the Twitter IPO
tying together Hadoop, Storm, Spark, etc., into a common framework
with lots of abstract algebra applied.
Sam Ritchie will present about Summingbird today —
For the FP folks, this is one of my top recommends.
———
While we’re talking about large-scale data workflows and metadata,
I wrote some stuff about that…
There are a number of excellent frameworks for data workflows,
This is one of the most interesting areas
evolving within Big Data in the past two years.
Because system integration is either gold or a money pit.
That’s a truly hard problem.
[ system integration, you know you love it ]
Some excellent workflow systems come to mind:
* KNIME is another of my favorite examples for production workflow
systems
Michael Berthold will present today
* Py: Anaconda / IPython Notebook / Pandas / scikit-learn / Augustus /
etc.
Matthew Russell has a tutorial today, highly recommended
* R, RStudio, Revolution Analytics
Paul Ingram will present today
* I’ve heard that Cloudera has a few of these frameworks, too ;)
* Actian has an entire product line, integrating KNIME and other tools
* Julia is also coming up in the world!
———
Bokay,
so we take real-world messy data,
represent it as graphs,
transform graphs into sparse matrices
leveraging algebraic graph theory,
evaluate using a bunch o’ linear algebra,
apply loads of abstract algebra
to leverage functional programming,
defining complex, large-scale data workflows
and then parallelism pops out the end of the tube.
What do we do with parallelism?
We run all those apps on clusters!
[ clusters, FTW! ]
What’s interesting about cluster computing —
for example, in Google papers about datacenter computing
which analyze cluster traces across several companies —
is that we can breakdown the work into a few categories:
*
*
*
*
moving data around (what others think your Data Scientists do)
stochastic gradient descent (what your Data Scientists think they do)
cleaning up data (what your Data Scientists actually do)
highly-available services (what your Data Scientists should be doing)
We throw enormous amounts of cluster cost at the hard problem of
provisioning and scheduling resources so that
HA services can meet the required SLAs.
Q: Does that seem at all familiar?
Google showed it, and I believe it.
Clusters run more than batch jobs:
Memcached, Ruby on Rails, MySQL, all manner of Python, etc.
I’ve been working with a project called Apache Mesos,
based on work from Google about datacenter computing
focusing on distributed kernel, low-latency HA services, etc.
Essentially, how to “roll your own” distributed frameworks
in parallel, at scale, on commodity hardware,
levering Linux kernel features in place since 2006,
in a few hundred lines of functional programming code.
Please join me for a tutorial about Mesos later today…
———
But otherwise, once we peel off the services
and the moving/cleaning data
we’re left with SGD or something closely akin
Our clusters spend lots of time optimizing learners
or optimizing schedules, or plans,
Or something.
[ optimization rocks ]
In terms of “Just Enough Math”,
that is, well, that’s optimization
Step 3 in ML: optimization
parallel processing for lots of optimization
Say, is anybody planning to talk about SGD today?
———
Two really interesting things about optimization come to mind…
One: companies like Twitter, Google, etc., spent lots of capital
to formulate high-ROI apps running in parallel on clusters at scale,
so that the heavy-lifting boils down to lots of gradient decent.
What happens when quantum computing becomes commodity hardware?
Quantum algorithms knock down the cost of gradient descent
exponentially.
Exponential decrease in cost for the critical code running
in those multi-billion dollar datacenters.
Imagine that…
Twitter spent lot of capital to build out Summingbird, etc.,
and make that approach to parallel workflows open source.
Continuum has some *really* interesting work
with distrib Py at scale, also heading into that neighborhood.
Words to the wise.
Processing power is catching up with the math
For those of us who have kiddos, know kiddos, etc.,
I can summarize the part of the experience in one word:
Minecraft
I have two daughters, ages 9 and 8,
who seem permanently attatched to Minecraft.
YouTube currently has 93,000,000 videos about Minecraft,
much of which were created by kiddos
teaching other kiddos how to do programming…
Like how to pull apart and reassemble a JAR file
so you can add new features to your minecraft server.
My 9 y.o. has learned some Linux sys admin skills,
running our neighborhood Minecraft server on AWS.
That’s all quite awesome.
Google is working to identify 10 y.o.’s who can hack quantum:
[ superposition and creepers ]
How brilliant is that?
Doing the math… not so many years from now, some of these
10 y.o. Minecraft experts may become Google AI interns…
That’s all quite awesome too.
Two: sometimes optimization doesn’t work that neatly.
suppose you don’t have a differentiable objective function,
or cannot approximate one effectively,
common optimization techniques sez too bad.
There’s a body of work called Evolutionary Algorithms,
which handles optimization problems that SGD, etc., cannot.
GPs have been around since the 1970s,
based on really interesting math to leverage,
but it was quite costly on processors back then…
[ evolution vs. design ]
Bill Worzel will present today.
BTW, ask Bill about quantum algorithms used for GPs.
For the FP folks, this is another of my top recommends,
due to ample use of combinators… #justsayin
———
Wow, I completely forgot to get into compressed sensing,
probabilistic data structures, plus some other areas of math
which are really interesting and useful.
Perhaps we’ll catch those next time.
In any case, I hope this gives some indication of where
advanced math intersects with
Big Data use cases.
There’s much more to cover, but you’re better off
hearing it from the experts.
The Big Picture for state of the art in Data.
———
I wish y’all an excellent Data Day Texas!
Paco Nathan
http://liber118.com/pxn/
@pacoid