SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
Mission: within a 40 minute talk, construct perspectives circa
early 2014 about the state of the art in Data technologies,
encompassing the content covered in DDTx14 talks,
and pointing toward “Where does it go from here?"
(8mm tape cartridge begins smoldering in a Dell LTO2 Ultrium-2)
———

[ Great to be back in Austin! ]
Glancing at our schedule for today,
I had browser tabs open for Data Day Texas and O’Reilly Strata,
and was momentarily perplexed about which was showing :)
We’re in for a whole lot of wonderful insights:
expert talks, tutorials, panels, book signings, BOFs, etc.
There’s immense brainpower and expertise
represented in our agenda,
and very much so within our audience!
One thing I’ve noticed about Lynn’s events,
especially here in Austin,
the networking can be the best part —
hallway conversations, lunches, office hours:
start-ups formed, consulting gigs landed,
deals struck between major firms, etc.
Definitely take advantage of these opportunities!

[ See you at Strata, OSCON, etc. ]
Also, putting on my other hat —
there are several O’Reilly authors gathered here,
disproportionately so for a conf this size.
Definitely check out Strata, OSCON, Velocity, Solid, etc.
I hope to run into many folks from Texas at those.
———
Recently I’ve been fortunate to catch lots of “Big Data” confs:
Hadoop Summit, Spark Summit, GOTO, DataBeat, etc…
One issue voiced repeatedly at these (and in my workshops):
Enterprise IT folks go to Big Data events, and it seems
like there’s 150 different vendors on the floor
saying similar things, all claiming to be different.

Q: Does that seem at all familiar?
And it’s bewildering to Enterprise IT folks
accustomed to talking with maybe THREE vendors.
There’s an overrun vendor floor downstairs,
meanwhile, upstairs, talks are all about
somebody doing rocket surgery,
far away from what you need for Monday morning.
Upstairs, downstairs.
Fortunately, here at DDTx we have a focus on
the practices of Big Data, the apps, use cases,
and, most quintessentially, how to build things!
Almost a Maker Faire for Big Data, if you will.

[ practice, use cases, apps, how to build things ]
To name a few talks coming up:
Charity Majors @Facebook,
Josh Wills and Eric Sammer @Cloudera,
excellent preso’s from Pivotal, Tableau, Revolution, etc.
We could keep going down the list.
The practitioners are here —
at the podiums and in the audience.
That’s something I really like about Data Day Texas.
Where I live, the usual elevator pitch seems to go…
“Something something hipster early-stage tech venture,
something something much like Uber, but for iguana owners.”
Which the VCs claim will be worth billions at IPO.
Or something.
I’m so grateful to be back in Austin!
———
Many years ago, I worked for a start-up here called Odin Corporation…

[ Odin vs. Ice Giants, etc. ]
We did work commercializing artificial neural networks.
That was in the early 1990s, way before neural networks were cool.
Or at least, before ANNs were known to
more than a handful of people.
Our “international” conferences were held at “discount” hotels.
Or something.
Some experts who consulted on that
Odin project are speakers here today.
One of them, Brad Martin, will present about
data security in a world that’s moved beyond trust,
Highly recommended.
Though we’ve been through dozens of companies in the time since,
I’m grateful for my friends, that we work together many years later.
That’s an important point to keep in mind:
tech start-ups come and go (a lot, really!)
but good people you meet stick around for a long time.
———
Back at Odin we had to be careful about compute costs…
We were running on embedded microprocessors at the time,
for example recognizing handwritten Kanji characters
on circa 1990 mobile devices.
Device production, oddly enough, now owned by Google.

[ circa 1990 mobile devices => Google ]
What a hoot!
That’s reaching back into ancient days for machine learning.
I think my keychain has more computing power now.
Google runs huge clusters of neural nets ten layers deep
in warehouse-scale datacenters, for image recognition, etc.
Deep Learning, as it’s now called.
The punchline about that tech history arc is that

Processing power is catching up with the math
That’s a truism in many ways about cluster computing —
Let’s get to that in a moment…
———
One summer day in San Francisco at the top of the Westfield,
Lynn and I were discussing about a kind of “divide”
observed through professional workshops about large-scale data.
Half of the audience tends to
have background in analytics,
they understand the math,
but lack enough coding experience to
move into writing production apps.
Another half has great chops
when it comes to systems programming,
and many of these people manage
large Linux clusters all day every day;
however, they perhaps lack enough math to
dive into the use cases
for apps running on those clusters.
Building interdisciplinary teams is key to industry success,
except that the work is partly opaque to each discipline.
Crossing that divide compels priorities on learning –
human learning about machine learning, if you will...

[ ¿quién es más macho? opaque code v. opaque math ]
The limiting factor is having “just enough math”
in the context of business use cases,
along with simple code examples.
Unfortunately, many people take math at university level
only to run headlong into the “killing fields” of calculus.

Q: Does that seem at all familiar?
Does anyone need three years of pre-calc, calc I, calc II, etc.,
before being allowed to learn how to do really useful things
with, say, graph theory, or monoids, or eigenvectors?
Eigenvectors and monoids are all over this field of Big Data,
just not enough people learn to be fluent with them,
so we must whisper about them in hush tones.
The “killing fields” of calculus were put in place
to weed out the ranks of potential employees going into
mechanical engineering and related fields.
How positively industrial.
I was a math major, and I have enormous respect for
ME’s because they had to take more math than us math majors.
However, in reality, that was a Cold War priority.
Producing lots of rocket surgeons, people fluent with
partial differential equations and thermodynamics,
building ICBMs, missile shields, strategic bomber fleets, etc.

[ Yeeee Haaawwww! ]
Um, Bokay.
But now we need millions of people fluent
with graph theory and linear algebra,
with tools for leveraging HPC clusters,
contending with enormous VVV of Data.
Thermodynamics is still important,
just maybe not as much right now.
Priorities have shifted.

[ interdisciplinary teams ]
We require interdisciplinary teams.
We need analytics people comfortable with writing code,
we need systems engineers comfortable solving the math.
Otherwise, we’re going to be drowning in Data.
———
I’ve been working on material called “Just Enough Math”,
essentially, advanced math for business people
in the context of Big Data, machine learning,
parallel processing, cluster computing, etc.
Tied to industry use cases,
with simple coding examples in Python and Scala.
Plus lots of history and primary sources,
plus lots of links for further study.
And almost no calculus.
Almost.

[ almost no calculus, we promise ]
We previewed some of that material in the
“Intro to Machine Learning” workshop on Fri.
How was that?
I’d like to share some analysis
from industry use cases,
from considering the
foundational math and how to teach it…

Let’s call it “The Big Picture”
———
For the next bits, kudos to Chris Severs @eBay;
conversations we had reinforced insights about
an approach to machine learning in the general case.
To wit:

[ real-world => graph => sparse matrix => cluster
compute ]
Starting with the Real-World™, that’s messy…
In Data Science, we say that
we spend 80% of our efforts just cleaning up data.
Some people (DJ Patil, et al.) even go as far as to say
that’s the science in Data Science, cleaning up the data.
Perhaps so. YMMV.
The point is that Machine Learning is about
generalizing, predicting patterns and indicators,
based on prior data and insights.
Generalization in ML is based on 3 components:
representation, evaluation, optimization
Pedro Domingos @U Washington says so,
and I’m drinking that kool-aid.
———
Representing data collected in the real world
is largely about geospatial and time series,
however, representing patterns in the data
is generally about graphs.
Why is there so much interest in graph databases,
graph queries, etc., these days?
Because real problems are generally graphs:
whether they are ad-tech, or anti-fraud, or social networks,
or educational insights, locations services, agricultural planning…
real problems are almost always represented as some kind of graph.
Before we had SQL, we used Codasyl, etc., i.e., graph databases —
or IMS, which has trees plus links, i.e., graphs.

[ Ceci n’est pas un graphe ]
Now we have Titan, GraphLab, Giraffe, GraphX, etc.

Processing power is catching up with the math
Steve Kramer will present about leveraging complex graphs.
Last year we had another friend and colleague,
Matthias Broechler @Titan
The idea is: graph queries are highly effective at
*proving* the predictive power in your data
much more so than SQL!!
Start there, prove your strategy with graphs first.
Step 1 in ML: representation,
probably as graphs
———
Graphs can grow very large.
Measures that I hear around Facebook, Twitter, etc.,
those number into trillions of elements.
Graphs are simple to annotate with even more data
Scale-out, however, can become a problem.
Graphs are also quite interesting for two reasons:
1. at scale they tend to be sparse
2. graphs can generally be turned into matrices
Proving our approach through graph queries first,
then we take ginormous amounts of data,
woven into big graphs,
converting those into large, sparse matrices.
In terms of “Just Enough Math”,
algebraic graph theory provides a bridge, if you will,
between graphs and linear algebra — matrices
Bueno.
Because “vee havf vays” to handle sparse matrices.
Step 2 in ML: evaluation,
grounded in data at scale
Sparse matrices are quite awesome for parallel processing.

[ suggestion: take the red pill . . . ]
Sparse matrices fit well with column stores:
Cassandra, HBase, Vertica, Accumulo, etc.
A variety of speakers will be talking about those today!
Russell Jurney will present about Agile Data Science,
also highly recommended.
———
For the Scala fans out there —
and moreover for Clojure, Haskell, F#, too —
there are excellent ways to leverage functional programming
to write concise programs that do enormous amounts of work
defining data workflows at scale.

[ data is fully functional ]
From a software engineering management perspective,
read: less code to maintain, less expensive QA.
In terms of “Just Enough Math”,
this needs a wee bit o’ abstract algebra
but it’s really fun!
Whenever I think about data workflows,
the first thing that comes to mind is abstract algebra…
(um, awkward)
If you’re familiar with the notion of, say,
Docker for containerizing apps on Linux,
there’s an analogy… monoids, semigroups, rings, etc.,
these alien artifacts from abstract algebra
serve to “containerize” the business logic in workflows,
so that apps can be parallelized.

[ containerized business logic ]
In other words, compiler hints on a grand scale.
One of the best realizations of this so far in open source
has got to be https://github.com/twitter/summingbird/wiki
Seriously. Filled. With. Awesome.
That works drives the revenue apps which drove the Twitter IPO
tying together Hadoop, Storm, Spark, etc., into a common framework
with lots of abstract algebra applied.
Sam Ritchie will present about Summingbird today —
For the FP folks, this is one of my top recommends.
———
While we’re talking about large-scale data workflows and metadata,
I wrote some stuff about that…
There are a number of excellent frameworks for data workflows,
This is one of the most interesting areas
evolving within Big Data in the past two years.
Because system integration is either gold or a money pit.
That’s a truly hard problem.

[ system integration, you know you love it ]
Some excellent workflow systems come to mind:
* KNIME is another of my favorite examples for production workflow
systems
Michael Berthold will present today
* Py: Anaconda / IPython Notebook / Pandas / scikit-learn / Augustus /
etc.
Matthew Russell has a tutorial today, highly recommended
* R, RStudio, Revolution Analytics
Paul Ingram will present today
* I’ve heard that Cloudera has a few of these frameworks, too ;)
* Actian has an entire product line, integrating KNIME and other tools
* Julia is also coming up in the world!
———
Bokay,
so we take real-world messy data,
represent it as graphs,
transform graphs into sparse matrices
leveraging algebraic graph theory,
evaluate using a bunch o’ linear algebra,
apply loads of abstract algebra
to leverage functional programming,
defining complex, large-scale data workflows
and then parallelism pops out the end of the tube.
What do we do with parallelism?
We run all those apps on clusters!

[ clusters, FTW! ]
What’s interesting about cluster computing —
for example, in Google papers about datacenter computing
which analyze cluster traces across several companies —
is that we can breakdown the work into a few categories:
*
*
*
*

moving data around (what others think your Data Scientists do)
stochastic gradient descent (what your Data Scientists think they do)
cleaning up data (what your Data Scientists actually do)
highly-available services (what your Data Scientists should be doing)

We throw enormous amounts of cluster cost at the hard problem of
provisioning and scheduling resources so that
HA services can meet the required SLAs.

Q: Does that seem at all familiar?
Google showed it, and I believe it.
Clusters run more than batch jobs:
Memcached, Ruby on Rails, MySQL, all manner of Python, etc.
I’ve been working with a project called Apache Mesos,
based on work from Google about datacenter computing
focusing on distributed kernel, low-latency HA services, etc.
Essentially, how to “roll your own” distributed frameworks
in parallel, at scale, on commodity hardware,
levering Linux kernel features in place since 2006,
in a few hundred lines of functional programming code.
Please join me for a tutorial about Mesos later today…
———
But otherwise, once we peel off the services
and the moving/cleaning data
we’re left with SGD or something closely akin
Our clusters spend lots of time optimizing learners
or optimizing schedules, or plans,
Or something.

[ optimization rocks ]
In terms of “Just Enough Math”,
that is, well, that’s optimization
Step 3 in ML: optimization
parallel processing for lots of optimization
Say, is anybody planning to talk about SGD today?
———
Two really interesting things about optimization come to mind…
One: companies like Twitter, Google, etc., spent lots of capital
to formulate high-ROI apps running in parallel on clusters at scale,
so that the heavy-lifting boils down to lots of gradient decent.
What happens when quantum computing becomes commodity hardware?
Quantum algorithms knock down the cost of gradient descent
exponentially.
Exponential decrease in cost for the critical code running
in those multi-billion dollar datacenters.
Imagine that…
Twitter spent lot of capital to build out Summingbird, etc.,
and make that approach to parallel workflows open source.
Continuum has some *really* interesting work
with distrib Py at scale, also heading into that neighborhood.
Words to the wise.

Processing power is catching up with the math
For those of us who have kiddos, know kiddos, etc.,
I can summarize the part of the experience in one word:
Minecraft
I have two daughters, ages 9 and 8,
who seem permanently attatched to Minecraft.
YouTube currently has 93,000,000 videos about Minecraft,
much of which were created by kiddos
teaching other kiddos how to do programming…
Like how to pull apart and reassemble a JAR file
so you can add new features to your minecraft server.
My 9 y.o. has learned some Linux sys admin skills,
running our neighborhood Minecraft server on AWS.
That’s all quite awesome.
Google is working to identify 10 y.o.’s who can hack quantum:

[ superposition and creepers ]
How brilliant is that?
Doing the math… not so many years from now, some of these
10 y.o. Minecraft experts may become Google AI interns…
That’s all quite awesome too.
Two: sometimes optimization doesn’t work that neatly.
suppose you don’t have a differentiable objective function,
or cannot approximate one effectively,
common optimization techniques sez too bad.
There’s a body of work called Evolutionary Algorithms,
which handles optimization problems that SGD, etc., cannot.
GPs have been around since the 1970s,
based on really interesting math to leverage,
but it was quite costly on processors back then…

[ evolution vs. design ]
Bill Worzel will present today.
BTW, ask Bill about quantum algorithms used for GPs.
For the FP folks, this is another of my top recommends,
due to ample use of combinators… #justsayin
———
Wow, I completely forgot to get into compressed sensing,
probabilistic data structures, plus some other areas of math
which are really interesting and useful.
Perhaps we’ll catch those next time.
In any case, I hope this gives some indication of where
advanced math intersects with
Big Data use cases.
There’s much more to cover, but you’re better off
hearing it from the experts.
The Big Picture for state of the art in Data.
———

I wish y’all an excellent Data Day Texas!

Paco Nathan
http://liber118.com/pxn/
@pacoid

Contenu connexe

En vedette

A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...SMART Infrastructure Facility
 
Les médias sociaux au service des stratégies de contenus marketing
Les médias sociaux au service des stratégies de contenus marketingLes médias sociaux au service des stratégies de contenus marketing
Les médias sociaux au service des stratégies de contenus marketingEmilie Marquois
 
Ermes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia GiuliaErmes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia GiuliaSimone Puksic
 
Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Paco Nathan
 
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員MeetupDatacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員MeetupPaco Nathan
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and BeyondPaco Nathan
 
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...David Gleich
 
Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)Tim O'Reilly
 
AWS Start-Up Tour 2007 / HeadCase
AWS Start-Up Tour 2007 / HeadCaseAWS Start-Up Tour 2007 / HeadCase
AWS Start-Up Tour 2007 / HeadCasePaco Nathan
 
Zero Waste à Gipuzkoa (Pays basque espagnol)
Zero Waste à Gipuzkoa (Pays basque espagnol)Zero Waste à Gipuzkoa (Pays basque espagnol)
Zero Waste à Gipuzkoa (Pays basque espagnol)Zero Waste France, Cniid
 
TOC Bologna 2012: How to Receive Funding and Support for New Digital and Prin...
TOC Bologna 2012: How to Receive Funding and Support for New Digital and Prin...TOC Bologna 2012: How to Receive Funding and Support for New Digital and Prin...
TOC Bologna 2012: How to Receive Funding and Support for New Digital and Prin...OReillyTOC
 
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens	Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens Hakka Labs
 
Awakening India - Jago Party
Awakening India - Jago PartyAwakening India - Jago Party
Awakening India - Jago PartyKapil Mohan
 
Flow Engines - Hack The Way You Work, Not The Time You Have
Flow Engines - Hack The Way You Work, Not The Time You HaveFlow Engines - Hack The Way You Work, Not The Time You Have
Flow Engines - Hack The Way You Work, Not The Time You HaveJohn V Willshire
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldOReillyStrata
 
Seattle Data Geeks: Hadoop and Beyond
Seattle Data Geeks: Hadoop and BeyondSeattle Data Geeks: Hadoop and Beyond
Seattle Data Geeks: Hadoop and BeyondPaco Nathan
 

En vedette (19)

Government 2.0
Government 2.0Government 2.0
Government 2.0
 
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
A GeoSocial Intelligence Framework for Studying & Promoting Resilience to Sea...
 
Les médias sociaux au service des stratégies de contenus marketing
Les médias sociaux au service des stratégies de contenus marketingLes médias sociaux au service des stratégies de contenus marketing
Les médias sociaux au service des stratégies de contenus marketing
 
Ermes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia GiuliaErmes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia Giulia
 
Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2
 
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員MeetupDatacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and Beyond
 
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
 
Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)
 
Timorexpony
TimorexponyTimorexpony
Timorexpony
 
AWS Start-Up Tour 2007 / HeadCase
AWS Start-Up Tour 2007 / HeadCaseAWS Start-Up Tour 2007 / HeadCase
AWS Start-Up Tour 2007 / HeadCase
 
Zero Waste à Gipuzkoa (Pays basque espagnol)
Zero Waste à Gipuzkoa (Pays basque espagnol)Zero Waste à Gipuzkoa (Pays basque espagnol)
Zero Waste à Gipuzkoa (Pays basque espagnol)
 
TOC Bologna 2012: How to Receive Funding and Support for New Digital and Prin...
TOC Bologna 2012: How to Receive Funding and Support for New Digital and Prin...TOC Bologna 2012: How to Receive Funding and Support for New Digital and Prin...
TOC Bologna 2012: How to Receive Funding and Support for New Digital and Prin...
 
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens	Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
 
Awakening India - Jago Party
Awakening India - Jago PartyAwakening India - Jago Party
Awakening India - Jago Party
 
Flow Engines - Hack The Way You Work, Not The Time You Have
Flow Engines - Hack The Way You Work, Not The Time You HaveFlow Engines - Hack The Way You Work, Not The Time You Have
Flow Engines - Hack The Way You Work, Not The Time You Have
 
Publishers “in” Libraries: New Agents, New Roles, New Challenges
Publishers “in” Libraries:New Agents, New Roles, New ChallengesPublishers “in” Libraries:New Agents, New Roles, New Challenges
Publishers “in” Libraries: New Agents, New Roles, New Challenges
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the world
 
Seattle Data Geeks: Hadoop and Beyond
Seattle Data Geeks: Hadoop and BeyondSeattle Data Geeks: Hadoop and Beyond
Seattle Data Geeks: Hadoop and Beyond
 

Plus de Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

Plus de Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Dernier

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 

Dernier (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 

Data Day Texas 2014 keynote: "The Big Picture"

  • 1. Mission: within a 40 minute talk, construct perspectives circa early 2014 about the state of the art in Data technologies, encompassing the content covered in DDTx14 talks, and pointing toward “Where does it go from here?" (8mm tape cartridge begins smoldering in a Dell LTO2 Ultrium-2) ——— [ Great to be back in Austin! ] Glancing at our schedule for today, I had browser tabs open for Data Day Texas and O’Reilly Strata, and was momentarily perplexed about which was showing :) We’re in for a whole lot of wonderful insights: expert talks, tutorials, panels, book signings, BOFs, etc. There’s immense brainpower and expertise represented in our agenda, and very much so within our audience! One thing I’ve noticed about Lynn’s events, especially here in Austin, the networking can be the best part — hallway conversations, lunches, office hours: start-ups formed, consulting gigs landed, deals struck between major firms, etc. Definitely take advantage of these opportunities! [ See you at Strata, OSCON, etc. ] Also, putting on my other hat — there are several O’Reilly authors gathered here, disproportionately so for a conf this size. Definitely check out Strata, OSCON, Velocity, Solid, etc. I hope to run into many folks from Texas at those. ——— Recently I’ve been fortunate to catch lots of “Big Data” confs: Hadoop Summit, Spark Summit, GOTO, DataBeat, etc… One issue voiced repeatedly at these (and in my workshops): Enterprise IT folks go to Big Data events, and it seems like there’s 150 different vendors on the floor saying similar things, all claiming to be different. Q: Does that seem at all familiar? And it’s bewildering to Enterprise IT folks accustomed to talking with maybe THREE vendors. There’s an overrun vendor floor downstairs, meanwhile, upstairs, talks are all about somebody doing rocket surgery, far away from what you need for Monday morning. Upstairs, downstairs. Fortunately, here at DDTx we have a focus on the practices of Big Data, the apps, use cases, and, most quintessentially, how to build things! Almost a Maker Faire for Big Data, if you will. [ practice, use cases, apps, how to build things ] To name a few talks coming up: Charity Majors @Facebook, Josh Wills and Eric Sammer @Cloudera, excellent preso’s from Pivotal, Tableau, Revolution, etc. We could keep going down the list. The practitioners are here — at the podiums and in the audience. That’s something I really like about Data Day Texas. Where I live, the usual elevator pitch seems to go… “Something something hipster early-stage tech venture, something something much like Uber, but for iguana owners.” Which the VCs claim will be worth billions at IPO. Or something. I’m so grateful to be back in Austin! ——— Many years ago, I worked for a start-up here called Odin Corporation… [ Odin vs. Ice Giants, etc. ] We did work commercializing artificial neural networks. That was in the early 1990s, way before neural networks were cool. Or at least, before ANNs were known to more than a handful of people. Our “international” conferences were held at “discount” hotels. Or something. Some experts who consulted on that Odin project are speakers here today. One of them, Brad Martin, will present about data security in a world that’s moved beyond trust, Highly recommended. Though we’ve been through dozens of companies in the time since, I’m grateful for my friends, that we work together many years later. That’s an important point to keep in mind: tech start-ups come and go (a lot, really!) but good people you meet stick around for a long time. ——— Back at Odin we had to be careful about compute costs… We were running on embedded microprocessors at the time, for example recognizing handwritten Kanji characters on circa 1990 mobile devices. Device production, oddly enough, now owned by Google. [ circa 1990 mobile devices => Google ] What a hoot! That’s reaching back into ancient days for machine learning. I think my keychain has more computing power now. Google runs huge clusters of neural nets ten layers deep in warehouse-scale datacenters, for image recognition, etc. Deep Learning, as it’s now called. The punchline about that tech history arc is that Processing power is catching up with the math That’s a truism in many ways about cluster computing — Let’s get to that in a moment… ——— One summer day in San Francisco at the top of the Westfield, Lynn and I were discussing about a kind of “divide” observed through professional workshops about large-scale data. Half of the audience tends to have background in analytics, they understand the math, but lack enough coding experience to move into writing production apps. Another half has great chops when it comes to systems programming, and many of these people manage large Linux clusters all day every day; however, they perhaps lack enough math to dive into the use cases for apps running on those clusters. Building interdisciplinary teams is key to industry success, except that the work is partly opaque to each discipline. Crossing that divide compels priorities on learning – human learning about machine learning, if you will... [ ¿quién es más macho? opaque code v. opaque math ] The limiting factor is having “just enough math” in the context of business use cases, along with simple code examples. Unfortunately, many people take math at university level only to run headlong into the “killing fields” of calculus. Q: Does that seem at all familiar? Does anyone need three years of pre-calc, calc I, calc II, etc., before being allowed to learn how to do really useful things with, say, graph theory, or monoids, or eigenvectors? Eigenvectors and monoids are all over this field of Big Data, just not enough people learn to be fluent with them, so we must whisper about them in hush tones. The “killing fields” of calculus were put in place to weed out the ranks of potential employees going into mechanical engineering and related fields. How positively industrial. I was a math major, and I have enormous respect for ME’s because they had to take more math than us math majors. However, in reality, that was a Cold War priority. Producing lots of rocket surgeons, people fluent with partial differential equations and thermodynamics, building ICBMs, missile shields, strategic bomber fleets, etc. [ Yeeee Haaawwww! ] Um, Bokay. But now we need millions of people fluent with graph theory and linear algebra, with tools for leveraging HPC clusters, contending with enormous VVV of Data. Thermodynamics is still important, just maybe not as much right now. Priorities have shifted. [ interdisciplinary teams ] We require interdisciplinary teams. We need analytics people comfortable with writing code, we need systems engineers comfortable solving the math. Otherwise, we’re going to be drowning in Data. ——— I’ve been working on material called “Just Enough Math”, essentially, advanced math for business people in the context of Big Data, machine learning, parallel processing, cluster computing, etc. Tied to industry use cases, with simple coding examples in Python and Scala. Plus lots of history and primary sources, plus lots of links for further study. And almost no calculus. Almost. [ almost no calculus, we promise ] We previewed some of that material in the “Intro to Machine Learning” workshop on Fri. How was that? I’d like to share some analysis from industry use cases, from considering the foundational math and how to teach it… Let’s call it “The Big Picture” ——— For the next bits, kudos to Chris Severs @eBay; conversations we had reinforced insights about an approach to machine learning in the general case. To wit: [ real-world => graph => sparse matrix => cluster compute ] Starting with the Real-World™, that’s messy… In Data Science, we say that we spend 80% of our efforts just cleaning up data. Some people (DJ Patil, et al.) even go as far as to say that’s the science in Data Science, cleaning up the data. Perhaps so. YMMV. The point is that Machine Learning is about generalizing, predicting patterns and indicators, based on prior data and insights. Generalization in ML is based on 3 components: representation, evaluation, optimization Pedro Domingos @U Washington says so, and I’m drinking that kool-aid. ——— Representing data collected in the real world is largely about geospatial and time series, however, representing patterns in the data is generally about graphs. Why is there so much interest in graph databases, graph queries, etc., these days? Because real problems are generally graphs: whether they are ad-tech, or anti-fraud, or social networks, or educational insights, locations services, agricultural planning… real problems are almost always represented as some kind of graph. Before we had SQL, we used Codasyl, etc., i.e., graph databases — or IMS, which has trees plus links, i.e., graphs. [ Ceci n’est pas un graphe ] Now we have Titan, GraphLab, Giraffe, GraphX, etc. Processing power is catching up with the math Steve Kramer will present about leveraging complex graphs. Last year we had another friend and colleague, Matthias Broechler @Titan The idea is: graph queries are highly effective at *proving* the predictive power in your data much more so than SQL!! Start there, prove your strategy with graphs first. Step 1 in ML: representation, probably as graphs ——— Graphs can grow very large. Measures that I hear around Facebook, Twitter, etc., those number into trillions of elements. Graphs are simple to annotate with even more data Scale-out, however, can become a problem. Graphs are also quite interesting for two reasons: 1. at scale they tend to be sparse 2. graphs can generally be turned into matrices Proving our approach through graph queries first, then we take ginormous amounts of data, woven into big graphs, converting those into large, sparse matrices. In terms of “Just Enough Math”, algebraic graph theory provides a bridge, if you will, between graphs and linear algebra — matrices Bueno. Because “vee havf vays” to handle sparse matrices. Step 2 in ML: evaluation, grounded in data at scale Sparse matrices are quite awesome for parallel processing. [ suggestion: take the red pill . . . ] Sparse matrices fit well with column stores: Cassandra, HBase, Vertica, Accumulo, etc. A variety of speakers will be talking about those today! Russell Jurney will present about Agile Data Science, also highly recommended. ——— For the Scala fans out there — and moreover for Clojure, Haskell, F#, too — there are excellent ways to leverage functional programming to write concise programs that do enormous amounts of work defining data workflows at scale. [ data is fully functional ] From a software engineering management perspective, read: less code to maintain, less expensive QA. In terms of “Just Enough Math”, this needs a wee bit o’ abstract algebra but it’s really fun! Whenever I think about data workflows, the first thing that comes to mind is abstract algebra… (um, awkward) If you’re familiar with the notion of, say, Docker for containerizing apps on Linux, there’s an analogy… monoids, semigroups, rings, etc., these alien artifacts from abstract algebra serve to “containerize” the business logic in workflows, so that apps can be parallelized. [ containerized business logic ] In other words, compiler hints on a grand scale. One of the best realizations of this so far in open source has got to be https://github.com/twitter/summingbird/wiki Seriously. Filled. With. Awesome. That works drives the revenue apps which drove the Twitter IPO tying together Hadoop, Storm, Spark, etc., into a common framework with lots of abstract algebra applied. Sam Ritchie will present about Summingbird today — For the FP folks, this is one of my top recommends. ——— While we’re talking about large-scale data workflows and metadata, I wrote some stuff about that… There are a number of excellent frameworks for data workflows, This is one of the most interesting areas evolving within Big Data in the past two years. Because system integration is either gold or a money pit. That’s a truly hard problem. [ system integration, you know you love it ] Some excellent workflow systems come to mind: * KNIME is another of my favorite examples for production workflow systems Michael Berthold will present today * Py: Anaconda / IPython Notebook / Pandas / scikit-learn / Augustus / etc. Matthew Russell has a tutorial today, highly recommended * R, RStudio, Revolution Analytics Paul Ingram will present today * I’ve heard that Cloudera has a few of these frameworks, too ;) * Actian has an entire product line, integrating KNIME and other tools * Julia is also coming up in the world! ——— Bokay, so we take real-world messy data, represent it as graphs, transform graphs into sparse matrices leveraging algebraic graph theory, evaluate using a bunch o’ linear algebra, apply loads of abstract algebra to leverage functional programming, defining complex, large-scale data workflows and then parallelism pops out the end of the tube. What do we do with parallelism? We run all those apps on clusters! [ clusters, FTW! ] What’s interesting about cluster computing — for example, in Google papers about datacenter computing which analyze cluster traces across several companies — is that we can breakdown the work into a few categories: * * * * moving data around (what others think your Data Scientists do) stochastic gradient descent (what your Data Scientists think they do) cleaning up data (what your Data Scientists actually do) highly-available services (what your Data Scientists should be doing) We throw enormous amounts of cluster cost at the hard problem of provisioning and scheduling resources so that HA services can meet the required SLAs. Q: Does that seem at all familiar? Google showed it, and I believe it. Clusters run more than batch jobs: Memcached, Ruby on Rails, MySQL, all manner of Python, etc. I’ve been working with a project called Apache Mesos, based on work from Google about datacenter computing focusing on distributed kernel, low-latency HA services, etc. Essentially, how to “roll your own” distributed frameworks in parallel, at scale, on commodity hardware, levering Linux kernel features in place since 2006, in a few hundred lines of functional programming code. Please join me for a tutorial about Mesos later today… ——— But otherwise, once we peel off the services and the moving/cleaning data we’re left with SGD or something closely akin Our clusters spend lots of time optimizing learners or optimizing schedules, or plans, Or something. [ optimization rocks ] In terms of “Just Enough Math”, that is, well, that’s optimization Step 3 in ML: optimization parallel processing for lots of optimization Say, is anybody planning to talk about SGD today? ——— Two really interesting things about optimization come to mind… One: companies like Twitter, Google, etc., spent lots of capital to formulate high-ROI apps running in parallel on clusters at scale, so that the heavy-lifting boils down to lots of gradient decent. What happens when quantum computing becomes commodity hardware? Quantum algorithms knock down the cost of gradient descent exponentially. Exponential decrease in cost for the critical code running in those multi-billion dollar datacenters. Imagine that… Twitter spent lot of capital to build out Summingbird, etc., and make that approach to parallel workflows open source. Continuum has some *really* interesting work with distrib Py at scale, also heading into that neighborhood. Words to the wise. Processing power is catching up with the math For those of us who have kiddos, know kiddos, etc., I can summarize the part of the experience in one word: Minecraft I have two daughters, ages 9 and 8, who seem permanently attatched to Minecraft. YouTube currently has 93,000,000 videos about Minecraft, much of which were created by kiddos teaching other kiddos how to do programming… Like how to pull apart and reassemble a JAR file so you can add new features to your minecraft server. My 9 y.o. has learned some Linux sys admin skills, running our neighborhood Minecraft server on AWS. That’s all quite awesome. Google is working to identify 10 y.o.’s who can hack quantum: [ superposition and creepers ] How brilliant is that? Doing the math… not so many years from now, some of these 10 y.o. Minecraft experts may become Google AI interns… That’s all quite awesome too. Two: sometimes optimization doesn’t work that neatly. suppose you don’t have a differentiable objective function, or cannot approximate one effectively, common optimization techniques sez too bad. There’s a body of work called Evolutionary Algorithms, which handles optimization problems that SGD, etc., cannot. GPs have been around since the 1970s, based on really interesting math to leverage, but it was quite costly on processors back then… [ evolution vs. design ] Bill Worzel will present today. BTW, ask Bill about quantum algorithms used for GPs. For the FP folks, this is another of my top recommends, due to ample use of combinators… #justsayin ——— Wow, I completely forgot to get into compressed sensing, probabilistic data structures, plus some other areas of math which are really interesting and useful. Perhaps we’ll catch those next time. In any case, I hope this gives some indication of where advanced math intersects with Big Data use cases. There’s much more to cover, but you’re better off hearing it from the experts. The Big Picture for state of the art in Data. ——— I wish y’all an excellent Data Day Texas! Paco Nathan http://liber118.com/pxn/ @pacoid