SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
Programming a Parallel Future
Joe Hellerstein
UC Berkeley Computer Science

                                                                                 Inside this Whitepaper
Things change fast in computer science, but odds are good that
they will change especially fast in the next few years. Much of
                                                                                 • Technology Trends: A
this change centers on the shift toward parallel computing. In
                                                                                   divergence in Moore’s
the short term, parallelism will thrive in settings with massive                   Law
datasets and analytics. Longer term, the shift to parallelism will
                                                                                 • Bright Spots: SQL, Big
impact all software. In this note, I’ll outline some key changes
                                                                                   Data and MapReduce
that have recently taken place in the computer industry, show
why existing software systems are ill-equipped to handle this new                • What Is Computer
reality, and point toward some bright spots on the horizon.                        Science Doing About It?
                                                                                 • Things to Watch
Technology Trends: A Divergence in Moore’s Law
Like many changes in computer science, the rapid drive toward
parallel computing is a function of technology trends in hardware.
Most technology watchers are familiar with Moore’s Law, and the
general notion that computing performance per dollar has grown
at an exponential rate — doubling about every 18-24 months
— over the last 50 years. But recently, Moore’s Law has been
pushing different aspects of computing in different directions, and
this divergence is substantially changing the rules for computer
technology.

Moore’s Law and exponential growth may be familiar, but they
are so phenomenal they deserve a pause for consideration. In
common parlance, the term “exponential” is often used as slang
for “very fast.” But when computer scientists talk about something
exponentiating, they tend to use more colorful phrases: blowing up
or exploding. The storage industry has been a continual example
of this, delivering exponentially increasing storage density per
dollar over time. In concrete terms, consumers today might have
a few Terabyte disks in the basement for home movies and laptop
backups — after all, they only cost about $150 at the local office
supply store. But only 10 years ago, when Google was starting
out, that was roughly the storage capacity of the largest databases
on the planet.

If all aspects of computer hardware followed Moore’s Law together,
software developers could simply enjoy increasing performance
1
 Given exponentially decreasing prices per byte, that comment nearly instantly
makes this note obsolete!
without changing their behavior much. This has been true, more or
less, for decades now. The reason things are changing so radically
now is that not all of the computing substrate is exponentiating in
the same way. And when things “blow up” in different directions,
the changes are swift and significant.

To explain, recall that what Moore’s Law actually predicts is the
number of transistors that can be placed on an integrated circuit.
Until recently, these extra transistors had been used to increase CPU
speed. But, in recent years, limits on heat and power dissipation
have prevented computer architects from continuing this trend.
So although Moore’s Law continues in its strict definition, CPUs
are not getting much faster, even though many other computing
trends – RAM sizes, disk density, network bandwidth – continue
at their usual pace.

Rather than providing speed improvements, the extra transistors
from Moore’s Law are being used to pack more CPUs into each chip.
Most computers being sold today have a single chip containing
between 2 and 8 processor “cores.” In the short term, this still
seems to make our existing software go faster: one core can
run operating systems utilities, another can run the currently
active application, another can drive the display, and so on. But
remember, Moore’s Law continues doubling every 18 months. That
means your laptop in nine years will have 128 processors, and a
typical corporate rack of 40-odd computers will have something
in the neighborhood of 20,000 cores. Parallel software should, in
principle, take advantage not only of the hundreds of processors
per machine, but of the entire rack — even an entire datacenter
of machines.

What does this mean for software? Since individual cores will
not get appreciably faster, we need massively parallel programs
that can scale up with the increasing number of cores, or we will
effectively drop off of the exponential growth curve of Moore’s
Law. Unfortunately, the large majority of today’s software is
written for a single processor, and there is no technique known
to “auto-parallelize” these programs. Worse yet, this is not just a
“legacy software” problem. Programmers still find it notoriously
difficult to reason about multiple, simultaneous tasks in the parallel
model, which is much harder for the human brain to grasp than
writing “plain old” algorithms. So this is a problem that threatens
to plague even new, green-field software projects.

This is the radically new environment facing software technologists
over the coming years. Things will have to change substantially
for progress to continue.




                                                 Page 2
Bright Spots: SQL, Big Data and MapReduce
To look for constructive ways forward, it helps to revisit past
successes and failures. Back in the 1980’s and 1990’s, there were
big research efforts targeting what was then termed “massively
parallel” computing, on dozens and sometimes hundreds of
processors. Much of that research targeted computationally
intensive scientific applications, and spawned startup companies
like Thinking Machines, Kendall Square and Sequent. The challenge
at that time was to provide programming languages and models
that could take a complex computational task, and tease apart
separate pieces that could run on different processors. In rough
terms, that burst of energy did not have much short-term success:
very few programmers were able to reason in the programming
models that were developed, and the various startup companies
mostly failed or were quietly absorbed.

Parallel Databases and SQL
But one branch of parallel research from that time did meet with
success: parallel databases. Companies like Teradata – and
research projects like Gamma at Wisconsin and Bubba at MCC
– demonstrated that the Relational data model and SQL lend
themselves quite naturally to what is now called “data parallelism.”
Rather than requiring the programmer to unravel an algorithm into
separate threads to be run on separate cores, parallel databases
let them chop up the input data tables into pieces, and pump
each piece through the same single-machine program on each
processor. This “parallel dataflow” model makes programming a
parallel machine as easy as programming a single machine. It also
works on “shared-nothing” clusters of computers in a datacenter:
the machines involved can communicate via simple streams of
data messages, without a need for an expensive shared RAM or
disk infrastructure.

The parallel database revolution was enormously successful, but
at the time it only touched a modest segment of the computing
industry: a few high-end enterprise data warehousing customers
who had unusually large-scale data entry and data capture
efforts. But this niche has grown radically in recent years, and
the importance of these ideas is being rethought.

Big Data
We’re now entering what I call the “Industrial Revolution of Data,”
where the vast majority of data will be stamped out by machines:
software logs, cameras, microphones, RFID readers, wireless
sensor networks, and so on. These machines generate data a
lot faster than people can, and their production rates will grow
exponentially with Moore’s Law. Storing this data is cheap, and it
can be mined for valuable information.
                                                Page 3
This trend has convinced nearly everyone in computing that the
next leaps forward will revolve around what many computer
scientists have dubbed “Big Data” problems. In previous years
this focus was largely confined to the database field, and SQL was
a popular language to tackle the problem. But a broader array
of developers are now interested, and they want to wrangle their
data in typical programming languages, rather than use SQL,
which they often find unfamiliar and restrictive.

MapReduce
The MapReduce programming model has turned a new page
in the parallelism story. In the late 1990s, the pioneering web
search companies built new parallel software infrastructure to
manage web crawls and indexes. As part of this effort, they were
forced to reinvent a number of ideas from parallel databases —
in part because the commercial database products at the time
did not handle their workload well. Google developed a business
model that depended not on crawling and indexing per se, but
instead on running massive analytics tasks over their data, e.g.,
to ensure that advertisements were relevant to searches so that
ads could be sold for measurable financial benefit. To facilitate
this kind of task, they implemented a programming framework
called MapReduce, based on two simple operations from functional
programming languages.

The Google MapReduce framework is a parallel dataflow system
that works by partitioning data across machines, each of which
runs the same single-node logic. But unlike SQL, MapReduce
largely asks programmers to write traditional code, in languages
like C, Java, Python, and Perl, whereas SQL provides a higher-level
language that is more flexible and optimizable, but less familiar to
many programmers. In addition to its familiar syntax, MapReduce
allows programs to be written to and read from traditional files in
a filesystem, rather than requiring database schema definitions.

Q: SQL or MapReduce?         A: Yes.
Technically speaking, SQL has some advantages over MapReduce,
including easy ways to combine multiple data sets, and the
opportunity for deeper code analysis and just-in-time query
optimizations. In this context, one of the most exciting developments
on the scene is the emergence of platforms that provide both SQL
and MapReduce interfaces within a single runtime environment.
These are especially useful when they support parallel access to
both database tables and filesystem files from either language.
Examples of these frameworks include the commercial Greenplum
system (which provides all of the above), the commercial Aster
Data system (which provides SQL and MapReduce over database


                                                Page 4
tables), and the open-source Hive framework from Facebook
(which provides an SQL-like language over files, layered on
the open-source Hadoop MapReduce engine.) DryadLINQ from
Microsoft Research is another interesting design point in this
space that merges a SQL-like syntax with a MapReduce-style
parallel dataflow.

MapReduce has brought a new wave of excited, bright developers
to the challenge of writing parallel programs against Big Data.
This is critical: a revolution in parallel software development can
only be achieved by a broad base of enthusiastic, productive
programmers. SQL also has legions of experienced programmers,
and integration with a wide variety of software tools and
frameworks. The new combined platforms for data parallelism
expand the options for these programmers, and should foster
synergies between the SQL and MapReduce communities. There
are interesting days ahead for massively parallel programs against
Big Data – which is good news for the future of Moore’s Law and
computing in general.


On the Commercial Front
In the context of the previous discussion, I should note that I
am an advisor at Greenplum, and strongly encouraged their
development of a MapReduce interface for many of the reasons
listed above. I was pleasantly surprised to learn of Aster’s entry
into that space at the same time Greenplum made their first public
announcements. Meanwhile, I am in close touch with friends
and colleagues in the Hadoop world, and the various companies
extending that work.

For the moment, these platforms have various distinguishing
features that differentiate them in the marketplace. But these are
systems in quick evolution, and there will undoubtedly be a lot of
change in this realm in upcoming months and years. In addition
to evolution from the first set of players, we can expect the large
enterprise players like Oracle, IBM and Microsoft to make moves
in this space if it can be shown to be lucrative. The role of the web
search and cloud hosting companies will be interesting to watch
as well, since they have the expertise and infrastructure to host
large MapReduce jobs.

If the startups grow and the large players follow, this will only fuel
the fire for data-centric parallel programming. That in turn will
drive the training of more and more programmers to think about
parallel programs on Big Data. And once the programmers are
on board in large numbers, bigger and more unexpected changes
may emerge.




                                                 Page 5
What Is Computer Science Doing About It?
The computing industry is actively pursuing parallel data
management with both SQL and MapReduce frameworks. And
the pipeline is filling from the academic front as well.

In terms of education, MapReduce is such a compelling entryway
into parallel programming, it is being used to nurture a new
generation of parallel programmers. Every Berkeley CS freshman
now learns MapReduce. Other schools have undertaken similar
programs, and a consortium of companies is eagerly supporting
these efforts with shared computing platforms, curriculum
development, and support of the Hadoop open-source backend.
This is a great example of an academic/industrial collaboration
working toward a common goal.

But it’s the news on the research front that I find most intriguing.
The focus on data-parallelism today surrounds Big Data, and that
is an important application domain. But what about the rest of
the software industry? How will parallel computing transcend Big
Data problems, so that broad classes of software can leverage
multicore and cluster parallelism?

This is a hard problem, but based on research in the last 5-10
years, I am optimistic that academic computer science can play a
leadership role here. In recent years, the data-centric approach
to programming exemplified by SQL and MapReduce has been
gaining footholds well outside of batch-oriented data parallelism.
There has been a groundswell of work on “declarative”, data-
centric languages for a variety of domain-specific tasks, mostly
using extensions of Datalog, a formal language popular with
theoreticians. These new data-centric languages have been
popping up in networking and distributed systems, natural language
processing, compiler analysis, modular robotics, security, machine
learning, and video games, among other applications. And they
are being proposed for tasks that are not embarrassingly parallel.
It turns out that focusing on the data can make a broad class of
programs simpler — much simpler! — to express.

As one example from my research group at Berkeley, our version of
the Chord Distributed Hash Table (DHT) is 47 lines of our Overlog
language; the reference implementation is over 10,000 lines of
C++. (DHTs are a key component of cloud services like Amazon’s
Dynamo). That is the kind of scenario where the quantitative
difference is best captured qualitatively. You can print out our
Chord implementation on one sheet of paper, take it down to the
coffee shop, and figure it out. Doing that with 10,000 lines of
C++ would be a superhuman feat of Programmer-Fu, and a big
waste of paper. Now, Overlog is a fairly academic language, but
we are following it up with a much more complete language called
Lincoln that is targeted at a much wider range of programmers.
Other such languages are under development in various groups.
                                                Page 6
Things to Watch
It is easy to see the need for new parallel programming approaches,
but hard to envision where and how the next generation of big
ideas will play out. Here are some things I am keeping an eye
on:

   •   MapReduce Extensions and Integration. I am not a
       big fan of the specifics of Google’s MapReduce, nor of the
       cloning of that model in the open-source Hadoop framework.
       Its biggest drawback is the need to stage data to disk
       over and over, which prevents it from providing real-time
       feedback, and dooms it to poor overall performance. (It is
       not only slower than it should be, it is an enormous energy
       hog, which should even bother resource-rich companies
       like Google and Yahoo). But I do not view MapReduce as a
       fixed target. Most of the companies I talk to using Hadoop
       have modified it significantly, and there are a number of
       languages layered on top including Yahoo!’s Pig, IBM’s JAQL,
       and Facebook’s Hive. New MapReduce implementations like
       those of Greenplum and Aster Data will presumably remove
       some of these design roadblocks, and open up interesting
       integrations with database technology. MapReduce is a
       movement, not an artifact, and the landscape there is likely
       to change substantially in the next months and years.

   •   The Next Programming Language(s). The need for
       Parallelism opens the door for a new programming language.
       MapReduce is almost certainly too simple, and SQL too
       cumbersome to be a broadly useful language. If the next
       popular language has to provide parallelism, what will it
       look like? I am placing a bet on data-centric declarative
       languages like our new effort, Lincoln. But these have yet
       to have their day in the sun, and there are certainly plenty
       of fans of other paradigms. Given that languages succeed
       for both technical and non-technical reasons, how will this
       space get staked out?

   •   Cloud Computing.         One of the major challenges in
       exploiting parallelism is to make improvements in
       legacy software, including operating systems, desktop
       applications, games etc. The emergence of software in
       the cloud is intriguing in its timing, because it provides an
       opportunity to rewrite everything from scratch. Shouldn’t
       this new platform be programmed for parallelism, both at
       its core and in its applications? That’s not really the case
       today. Are the programs being written today for early cloud
       platforms doomed to irrelevance because they predate the
       next parallel language? Is this an opportunity to leapfrog
       the early movers in the Cloud Computing space?


                                                Page 7
•   Data and Statistics. Data volumes will continue to grow
       as never before; everybody will want to turn information
       into value. As part of that evolution, statisticians, machine
       learning experts, and other data analysts will play an
       increasingly important role in effective organizations, and
       will need to be skilled at handling very big data sets. The
       practice of “Data Mining” arose with roughly this premise,
       but the research has grown increasingly sophisticated, and I
       expect this sophistication to become much more widespread
       in industry over the coming years. Practitioners who can
       master this combination of skills will be highly valued, as
       will the tools that they embrace for their work.


About Joe Hellerstein
Joseph M. Hellerstein, an Advisor to Greenplum, is a Professor
of Computer Science at the University of California, Berkeley,
whose research focuses on data management and networking.
His work has been recognized via awards including an Alfred P.
Sloan Research Fellowship, MIT Technology Review’s inaugural
TR100 list, and two ACM-SIGMOD “Test of Time” awards. Key
ideas from his research have been incorporated into commercial
and open-source database software released by IBM, Oracle,
and PostgreSQL. He has also held industrial posts including
Director of Intel Research Berkeley, and Chief Scientist of Cohera
Corporation.




                                                                       Contact
                                                                       Greenplum
                                                                       1900 S. Norfolk Drive,
                                                                       Suite 224
                                                                       San Mateo, CA 94403
                                                                       United States
                                                                       +1 650 286 8012 (ph)
                                                                       +1 650 286 8010 (fax)

                                                Page 8

Contenu connexe

Tendances

Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3
mcacicio
 
Analytics as a Service in SL
Analytics as a Service in SLAnalytics as a Service in SL
Analytics as a Service in SL
SkylabReddy Vanga
 
Technology Review: Innovation Strategies for Business Impact Feb_2011
Technology Review: Innovation Strategies for Business Impact Feb_2011Technology Review: Innovation Strategies for Business Impact Feb_2011
Technology Review: Innovation Strategies for Business Impact Feb_2011
Kevin Carter
 

Tendances (19)

Cloud Computing: Big Data Technology
Cloud Computing: Big Data TechnologyCloud Computing: Big Data Technology
Cloud Computing: Big Data Technology
 
NJVC Implementation of Cloud Computing Solutions in Federal Agencies
NJVC Implementation of Cloud Computing Solutions in Federal AgenciesNJVC Implementation of Cloud Computing Solutions in Federal Agencies
NJVC Implementation of Cloud Computing Solutions in Federal Agencies
 
Leveraging the-digital-future
Leveraging the-digital-futureLeveraging the-digital-future
Leveraging the-digital-future
 
Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3
 
Analytics as a Service in SL
Analytics as a Service in SLAnalytics as a Service in SL
Analytics as a Service in SL
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data References
 
Big Data for Defense and Security
Big Data for Defense and SecurityBig Data for Defense and Security
Big Data for Defense and Security
 
W-JAX Keynote - Big Data and Corporate Evolution
W-JAX Keynote - Big Data and Corporate EvolutionW-JAX Keynote - Big Data and Corporate Evolution
W-JAX Keynote - Big Data and Corporate Evolution
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
 
The promise and challenge of Big Data
The promise and challenge of Big DataThe promise and challenge of Big Data
The promise and challenge of Big Data
 
Technology Review: Innovation Strategies for Business Impact Feb_2011
Technology Review: Innovation Strategies for Business Impact Feb_2011Technology Review: Innovation Strategies for Business Impact Feb_2011
Technology Review: Innovation Strategies for Business Impact Feb_2011
 
Big Data
Big DataBig Data
Big Data
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Big data upload
Big data uploadBig data upload
Big data upload
 
Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
 
Smart Data and real-world semantic web applications (2004)
Smart Data and real-world semantic web applications (2004)Smart Data and real-world semantic web applications (2004)
Smart Data and real-world semantic web applications (2004)
 

En vedette

Portifólio Apresentação PDF
Portifólio Apresentação PDFPortifólio Apresentação PDF
Portifólio Apresentação PDF
Diego Braz
 
Gimasio fitness sport-life-mujer (1)
Gimasio fitness sport-life-mujer (1)Gimasio fitness sport-life-mujer (1)
Gimasio fitness sport-life-mujer (1)
Michel Jaral
 
Webマイニングと情報論的学習理論
Webマイニングと情報論的学習理論Webマイニングと情報論的学習理論
Webマイニングと情報論的学習理論
Hiroshi Ono
 
Viver não doi
Viver não doiViver não doi
Viver não doi
valdiroliv
 
Hyper Estraierの設計と実装
Hyper Estraierの設計と実装Hyper Estraierの設計と実装
Hyper Estraierの設計と実装
Hiroshi Ono
 
O sentido da vida
O sentido da vidaO sentido da vida
O sentido da vida
valdiroliv
 
Page 1 Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics
Page 1 Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics Page 1 Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics
Page 1 Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics
Hiroshi Ono
 

En vedette (8)

Portifólio Apresentação PDF
Portifólio Apresentação PDFPortifólio Apresentação PDF
Portifólio Apresentação PDF
 
Gimasio fitness sport-life-mujer (1)
Gimasio fitness sport-life-mujer (1)Gimasio fitness sport-life-mujer (1)
Gimasio fitness sport-life-mujer (1)
 
Webマイニングと情報論的学習理論
Webマイニングと情報論的学習理論Webマイニングと情報論的学習理論
Webマイニングと情報論的学習理論
 
Viver não doi
Viver não doiViver não doi
Viver não doi
 
Muhammed Riswan-QC rev1
Muhammed Riswan-QC rev1Muhammed Riswan-QC rev1
Muhammed Riswan-QC rev1
 
Hyper Estraierの設計と実装
Hyper Estraierの設計と実装Hyper Estraierの設計と実装
Hyper Estraierの設計と実装
 
O sentido da vida
O sentido da vidaO sentido da vida
O sentido da vida
 
Page 1 Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics
Page 1 Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics Page 1 Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics
Page 1 Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics
 

Similaire à Parallel_Computing_future

locotalk-whitepaper-2016
locotalk-whitepaper-2016locotalk-whitepaper-2016
locotalk-whitepaper-2016
Anthony Wijnen
 
The Growth Of Data Centers
The Growth Of Data CentersThe Growth Of Data Centers
The Growth Of Data Centers
Gina Buck
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
asifahmed
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
dkuhn
 
The Trouble With Enterprise SoftwareF A L L 2 0 0 7 .docx
The Trouble With Enterprise SoftwareF A L L  2 0 0 7    .docxThe Trouble With Enterprise SoftwareF A L L  2 0 0 7    .docx
The Trouble With Enterprise SoftwareF A L L 2 0 0 7 .docx
ssusera34210
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
kevinflorian
 
Parallel disrtibuted computing_Lecture_Number four. 1.pptx
Parallel disrtibuted computing_Lecture_Number four. 1.pptxParallel disrtibuted computing_Lecture_Number four. 1.pptx
Parallel disrtibuted computing_Lecture_Number four. 1.pptx
RidaAmir10
 

Similaire à Parallel_Computing_future (20)

locotalk-whitepaper-2016
locotalk-whitepaper-2016locotalk-whitepaper-2016
locotalk-whitepaper-2016
 
The Growth Of Data Centers
The Growth Of Data CentersThe Growth Of Data Centers
The Growth Of Data Centers
 
Priorities Shift In IC Design
Priorities Shift In IC DesignPriorities Shift In IC Design
Priorities Shift In IC Design
 
Shamit khemka list outs 6 technology trends for 2015
Shamit khemka list outs 6 technology trends for 2015Shamit khemka list outs 6 technology trends for 2015
Shamit khemka list outs 6 technology trends for 2015
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | Qubole
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
 
Idc analyst report a new breed of servers for digital transformation
Idc analyst report a new breed of servers for digital transformationIdc analyst report a new breed of servers for digital transformation
Idc analyst report a new breed of servers for digital transformation
 
The Trouble With Enterprise SoftwareF A L L 2 0 0 7 .docx
The Trouble With Enterprise SoftwareF A L L  2 0 0 7    .docxThe Trouble With Enterprise SoftwareF A L L  2 0 0 7    .docx
The Trouble With Enterprise SoftwareF A L L 2 0 0 7 .docx
 
Technology Disruption
Technology DisruptionTechnology Disruption
Technology Disruption
 
Learning from Machine Intelligence: The Next Wave of Digital Transformation
Learning from Machine Intelligence: The Next Wave of Digital TransformationLearning from Machine Intelligence: The Next Wave of Digital Transformation
Learning from Machine Intelligence: The Next Wave of Digital Transformation
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Data Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : NotesData Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : Notes
 
Parallel disrtibuted computing_Lecture_Number four. 1.pptx
Parallel disrtibuted computing_Lecture_Number four. 1.pptxParallel disrtibuted computing_Lecture_Number four. 1.pptx
Parallel disrtibuted computing_Lecture_Number four. 1.pptx
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
 
Enterprise Software at Web-Scale
Enterprise Software at Web-ScaleEnterprise Software at Web-Scale
Enterprise Software at Web-Scale
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Fog Computing: A Platform for Internet of Things and Analytics
Fog Computing: A Platform for Internet of Things and AnalyticsFog Computing: A Platform for Internet of Things and Analytics
Fog Computing: A Platform for Internet of Things and Analytics
 

Plus de Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
Hiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
Hiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
Hiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
Hiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
Hiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
Hiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
Hiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
Hiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
Hiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
Hiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
Hiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
Hiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
Hiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
Hiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
Hiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
Hiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
Hiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
Hiroshi Ono
 

Plus de Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Parallel_Computing_future

  • 1. Programming a Parallel Future Joe Hellerstein UC Berkeley Computer Science Inside this Whitepaper Things change fast in computer science, but odds are good that they will change especially fast in the next few years. Much of • Technology Trends: A this change centers on the shift toward parallel computing. In divergence in Moore’s the short term, parallelism will thrive in settings with massive Law datasets and analytics. Longer term, the shift to parallelism will • Bright Spots: SQL, Big impact all software. In this note, I’ll outline some key changes Data and MapReduce that have recently taken place in the computer industry, show why existing software systems are ill-equipped to handle this new • What Is Computer reality, and point toward some bright spots on the horizon. Science Doing About It? • Things to Watch Technology Trends: A Divergence in Moore’s Law Like many changes in computer science, the rapid drive toward parallel computing is a function of technology trends in hardware. Most technology watchers are familiar with Moore’s Law, and the general notion that computing performance per dollar has grown at an exponential rate — doubling about every 18-24 months — over the last 50 years. But recently, Moore’s Law has been pushing different aspects of computing in different directions, and this divergence is substantially changing the rules for computer technology. Moore’s Law and exponential growth may be familiar, but they are so phenomenal they deserve a pause for consideration. In common parlance, the term “exponential” is often used as slang for “very fast.” But when computer scientists talk about something exponentiating, they tend to use more colorful phrases: blowing up or exploding. The storage industry has been a continual example of this, delivering exponentially increasing storage density per dollar over time. In concrete terms, consumers today might have a few Terabyte disks in the basement for home movies and laptop backups — after all, they only cost about $150 at the local office supply store. But only 10 years ago, when Google was starting out, that was roughly the storage capacity of the largest databases on the planet. If all aspects of computer hardware followed Moore’s Law together, software developers could simply enjoy increasing performance 1 Given exponentially decreasing prices per byte, that comment nearly instantly makes this note obsolete!
  • 2. without changing their behavior much. This has been true, more or less, for decades now. The reason things are changing so radically now is that not all of the computing substrate is exponentiating in the same way. And when things “blow up” in different directions, the changes are swift and significant. To explain, recall that what Moore’s Law actually predicts is the number of transistors that can be placed on an integrated circuit. Until recently, these extra transistors had been used to increase CPU speed. But, in recent years, limits on heat and power dissipation have prevented computer architects from continuing this trend. So although Moore’s Law continues in its strict definition, CPUs are not getting much faster, even though many other computing trends – RAM sizes, disk density, network bandwidth – continue at their usual pace. Rather than providing speed improvements, the extra transistors from Moore’s Law are being used to pack more CPUs into each chip. Most computers being sold today have a single chip containing between 2 and 8 processor “cores.” In the short term, this still seems to make our existing software go faster: one core can run operating systems utilities, another can run the currently active application, another can drive the display, and so on. But remember, Moore’s Law continues doubling every 18 months. That means your laptop in nine years will have 128 processors, and a typical corporate rack of 40-odd computers will have something in the neighborhood of 20,000 cores. Parallel software should, in principle, take advantage not only of the hundreds of processors per machine, but of the entire rack — even an entire datacenter of machines. What does this mean for software? Since individual cores will not get appreciably faster, we need massively parallel programs that can scale up with the increasing number of cores, or we will effectively drop off of the exponential growth curve of Moore’s Law. Unfortunately, the large majority of today’s software is written for a single processor, and there is no technique known to “auto-parallelize” these programs. Worse yet, this is not just a “legacy software” problem. Programmers still find it notoriously difficult to reason about multiple, simultaneous tasks in the parallel model, which is much harder for the human brain to grasp than writing “plain old” algorithms. So this is a problem that threatens to plague even new, green-field software projects. This is the radically new environment facing software technologists over the coming years. Things will have to change substantially for progress to continue. Page 2
  • 3. Bright Spots: SQL, Big Data and MapReduce To look for constructive ways forward, it helps to revisit past successes and failures. Back in the 1980’s and 1990’s, there were big research efforts targeting what was then termed “massively parallel” computing, on dozens and sometimes hundreds of processors. Much of that research targeted computationally intensive scientific applications, and spawned startup companies like Thinking Machines, Kendall Square and Sequent. The challenge at that time was to provide programming languages and models that could take a complex computational task, and tease apart separate pieces that could run on different processors. In rough terms, that burst of energy did not have much short-term success: very few programmers were able to reason in the programming models that were developed, and the various startup companies mostly failed or were quietly absorbed. Parallel Databases and SQL But one branch of parallel research from that time did meet with success: parallel databases. Companies like Teradata – and research projects like Gamma at Wisconsin and Bubba at MCC – demonstrated that the Relational data model and SQL lend themselves quite naturally to what is now called “data parallelism.” Rather than requiring the programmer to unravel an algorithm into separate threads to be run on separate cores, parallel databases let them chop up the input data tables into pieces, and pump each piece through the same single-machine program on each processor. This “parallel dataflow” model makes programming a parallel machine as easy as programming a single machine. It also works on “shared-nothing” clusters of computers in a datacenter: the machines involved can communicate via simple streams of data messages, without a need for an expensive shared RAM or disk infrastructure. The parallel database revolution was enormously successful, but at the time it only touched a modest segment of the computing industry: a few high-end enterprise data warehousing customers who had unusually large-scale data entry and data capture efforts. But this niche has grown radically in recent years, and the importance of these ideas is being rethought. Big Data We’re now entering what I call the “Industrial Revolution of Data,” where the vast majority of data will be stamped out by machines: software logs, cameras, microphones, RFID readers, wireless sensor networks, and so on. These machines generate data a lot faster than people can, and their production rates will grow exponentially with Moore’s Law. Storing this data is cheap, and it can be mined for valuable information. Page 3
  • 4. This trend has convinced nearly everyone in computing that the next leaps forward will revolve around what many computer scientists have dubbed “Big Data” problems. In previous years this focus was largely confined to the database field, and SQL was a popular language to tackle the problem. But a broader array of developers are now interested, and they want to wrangle their data in typical programming languages, rather than use SQL, which they often find unfamiliar and restrictive. MapReduce The MapReduce programming model has turned a new page in the parallelism story. In the late 1990s, the pioneering web search companies built new parallel software infrastructure to manage web crawls and indexes. As part of this effort, they were forced to reinvent a number of ideas from parallel databases — in part because the commercial database products at the time did not handle their workload well. Google developed a business model that depended not on crawling and indexing per se, but instead on running massive analytics tasks over their data, e.g., to ensure that advertisements were relevant to searches so that ads could be sold for measurable financial benefit. To facilitate this kind of task, they implemented a programming framework called MapReduce, based on two simple operations from functional programming languages. The Google MapReduce framework is a parallel dataflow system that works by partitioning data across machines, each of which runs the same single-node logic. But unlike SQL, MapReduce largely asks programmers to write traditional code, in languages like C, Java, Python, and Perl, whereas SQL provides a higher-level language that is more flexible and optimizable, but less familiar to many programmers. In addition to its familiar syntax, MapReduce allows programs to be written to and read from traditional files in a filesystem, rather than requiring database schema definitions. Q: SQL or MapReduce? A: Yes. Technically speaking, SQL has some advantages over MapReduce, including easy ways to combine multiple data sets, and the opportunity for deeper code analysis and just-in-time query optimizations. In this context, one of the most exciting developments on the scene is the emergence of platforms that provide both SQL and MapReduce interfaces within a single runtime environment. These are especially useful when they support parallel access to both database tables and filesystem files from either language. Examples of these frameworks include the commercial Greenplum system (which provides all of the above), the commercial Aster Data system (which provides SQL and MapReduce over database Page 4
  • 5. tables), and the open-source Hive framework from Facebook (which provides an SQL-like language over files, layered on the open-source Hadoop MapReduce engine.) DryadLINQ from Microsoft Research is another interesting design point in this space that merges a SQL-like syntax with a MapReduce-style parallel dataflow. MapReduce has brought a new wave of excited, bright developers to the challenge of writing parallel programs against Big Data. This is critical: a revolution in parallel software development can only be achieved by a broad base of enthusiastic, productive programmers. SQL also has legions of experienced programmers, and integration with a wide variety of software tools and frameworks. The new combined platforms for data parallelism expand the options for these programmers, and should foster synergies between the SQL and MapReduce communities. There are interesting days ahead for massively parallel programs against Big Data – which is good news for the future of Moore’s Law and computing in general. On the Commercial Front In the context of the previous discussion, I should note that I am an advisor at Greenplum, and strongly encouraged their development of a MapReduce interface for many of the reasons listed above. I was pleasantly surprised to learn of Aster’s entry into that space at the same time Greenplum made their first public announcements. Meanwhile, I am in close touch with friends and colleagues in the Hadoop world, and the various companies extending that work. For the moment, these platforms have various distinguishing features that differentiate them in the marketplace. But these are systems in quick evolution, and there will undoubtedly be a lot of change in this realm in upcoming months and years. In addition to evolution from the first set of players, we can expect the large enterprise players like Oracle, IBM and Microsoft to make moves in this space if it can be shown to be lucrative. The role of the web search and cloud hosting companies will be interesting to watch as well, since they have the expertise and infrastructure to host large MapReduce jobs. If the startups grow and the large players follow, this will only fuel the fire for data-centric parallel programming. That in turn will drive the training of more and more programmers to think about parallel programs on Big Data. And once the programmers are on board in large numbers, bigger and more unexpected changes may emerge. Page 5
  • 6. What Is Computer Science Doing About It? The computing industry is actively pursuing parallel data management with both SQL and MapReduce frameworks. And the pipeline is filling from the academic front as well. In terms of education, MapReduce is such a compelling entryway into parallel programming, it is being used to nurture a new generation of parallel programmers. Every Berkeley CS freshman now learns MapReduce. Other schools have undertaken similar programs, and a consortium of companies is eagerly supporting these efforts with shared computing platforms, curriculum development, and support of the Hadoop open-source backend. This is a great example of an academic/industrial collaboration working toward a common goal. But it’s the news on the research front that I find most intriguing. The focus on data-parallelism today surrounds Big Data, and that is an important application domain. But what about the rest of the software industry? How will parallel computing transcend Big Data problems, so that broad classes of software can leverage multicore and cluster parallelism? This is a hard problem, but based on research in the last 5-10 years, I am optimistic that academic computer science can play a leadership role here. In recent years, the data-centric approach to programming exemplified by SQL and MapReduce has been gaining footholds well outside of batch-oriented data parallelism. There has been a groundswell of work on “declarative”, data- centric languages for a variety of domain-specific tasks, mostly using extensions of Datalog, a formal language popular with theoreticians. These new data-centric languages have been popping up in networking and distributed systems, natural language processing, compiler analysis, modular robotics, security, machine learning, and video games, among other applications. And they are being proposed for tasks that are not embarrassingly parallel. It turns out that focusing on the data can make a broad class of programs simpler — much simpler! — to express. As one example from my research group at Berkeley, our version of the Chord Distributed Hash Table (DHT) is 47 lines of our Overlog language; the reference implementation is over 10,000 lines of C++. (DHTs are a key component of cloud services like Amazon’s Dynamo). That is the kind of scenario where the quantitative difference is best captured qualitatively. You can print out our Chord implementation on one sheet of paper, take it down to the coffee shop, and figure it out. Doing that with 10,000 lines of C++ would be a superhuman feat of Programmer-Fu, and a big waste of paper. Now, Overlog is a fairly academic language, but we are following it up with a much more complete language called Lincoln that is targeted at a much wider range of programmers. Other such languages are under development in various groups. Page 6
  • 7. Things to Watch It is easy to see the need for new parallel programming approaches, but hard to envision where and how the next generation of big ideas will play out. Here are some things I am keeping an eye on: • MapReduce Extensions and Integration. I am not a big fan of the specifics of Google’s MapReduce, nor of the cloning of that model in the open-source Hadoop framework. Its biggest drawback is the need to stage data to disk over and over, which prevents it from providing real-time feedback, and dooms it to poor overall performance. (It is not only slower than it should be, it is an enormous energy hog, which should even bother resource-rich companies like Google and Yahoo). But I do not view MapReduce as a fixed target. Most of the companies I talk to using Hadoop have modified it significantly, and there are a number of languages layered on top including Yahoo!’s Pig, IBM’s JAQL, and Facebook’s Hive. New MapReduce implementations like those of Greenplum and Aster Data will presumably remove some of these design roadblocks, and open up interesting integrations with database technology. MapReduce is a movement, not an artifact, and the landscape there is likely to change substantially in the next months and years. • The Next Programming Language(s). The need for Parallelism opens the door for a new programming language. MapReduce is almost certainly too simple, and SQL too cumbersome to be a broadly useful language. If the next popular language has to provide parallelism, what will it look like? I am placing a bet on data-centric declarative languages like our new effort, Lincoln. But these have yet to have their day in the sun, and there are certainly plenty of fans of other paradigms. Given that languages succeed for both technical and non-technical reasons, how will this space get staked out? • Cloud Computing. One of the major challenges in exploiting parallelism is to make improvements in legacy software, including operating systems, desktop applications, games etc. The emergence of software in the cloud is intriguing in its timing, because it provides an opportunity to rewrite everything from scratch. Shouldn’t this new platform be programmed for parallelism, both at its core and in its applications? That’s not really the case today. Are the programs being written today for early cloud platforms doomed to irrelevance because they predate the next parallel language? Is this an opportunity to leapfrog the early movers in the Cloud Computing space? Page 7
  • 8. Data and Statistics. Data volumes will continue to grow as never before; everybody will want to turn information into value. As part of that evolution, statisticians, machine learning experts, and other data analysts will play an increasingly important role in effective organizations, and will need to be skilled at handling very big data sets. The practice of “Data Mining” arose with roughly this premise, but the research has grown increasingly sophisticated, and I expect this sophistication to become much more widespread in industry over the coming years. Practitioners who can master this combination of skills will be highly valued, as will the tools that they embrace for their work. About Joe Hellerstein Joseph M. Hellerstein, an Advisor to Greenplum, is a Professor of Computer Science at the University of California, Berkeley, whose research focuses on data management and networking. His work has been recognized via awards including an Alfred P. Sloan Research Fellowship, MIT Technology Review’s inaugural TR100 list, and two ACM-SIGMOD “Test of Time” awards. Key ideas from his research have been incorporated into commercial and open-source database software released by IBM, Oracle, and PostgreSQL. He has also held industrial posts including Director of Intel Research Berkeley, and Chief Scientist of Cohera Corporation. Contact Greenplum 1900 S. Norfolk Drive, Suite 224 San Mateo, CA 94403 United States +1 650 286 8012 (ph) +1 650 286 8010 (fax) Page 8