Unraveling Multimodality with Large Language Models.pdf
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
1. Hadoop on OpenStack with
Sahara
August 19, 2014
Matthew Farrellee (@spinningmatt)
Emerging Technology and Strategy
CTO Office, Red Hat
2. Hadoop is
8/19/14 tesora.com
• Narrow definition - Apache Hadoop - a specific
Apache project originally from Yahoo!, based on
papers from Google
• Broad definition - the ecosystem of projects,
primarily within Apache, that integrate in some
form with Apache Hadoop
• I’m going to use the broad definition
3. Hadoop often looks like
8/19/14 tesora.com
• Multiple, loosely coupled
projects focused on data
storage and processing
• Includes: workload,
resource, system
management; data ingest
& storage; compute
frameworks and domain
languages
4. Hadoop is often used to
8/19/14 tesora.com
• Store data
• ETL data
• Analyze data
• Structured and unstructured
5. Data today
8/19/14 tesora.com
• Structured or unstructured
• >2.5x more unstructured
• Rate of growth for unstructured is 2x structured
6. Data problems
8/19/14 tesora.com
• It’s not just that processing data is expensive
• In hardware costs
• In computational time
• Most of all, in human time
• Data creation outpaces storage capacity
8. The analysis itself is hard
8/19/14 tesora.com
• Data sources are hard to find, or create
• Data is always dirty and needs cleaning
• Clean data is always approximate
• Figuring out the right question to ask takes
iterations
10. Sahara’s history
8/19/14 tesora.com
• Started at the Portland summit (April 2013)
• Joint effort by Red Hat, Mirantis and
Hortonworks
• Originally called Savanna
• Incubated in Icehouse (released April 2014)
• Supported Apache and Hortonworks Hadoop
• Integrated for Juno (release October 2014)
11. Sahara’s use cases
8/19/14 tesora.com
• Cluster
• Start / stop / scale
• Different shapes and sizes
• Repeatable (template mechanism)
• Workload (Elastic Data Processing, a.k.a EDP)
• Job = Analysis code + Data urls
• Queued and run across clusters (ephemeral or
persistent)
15. Sahara’s basic structures
8/19/14 tesora.com
• Plugins - controller for specific software collections
• Images - in Glance, w/ special plugin specific tags
• Templates
• Two kinds, node group and cluster
• Combine node groups to form a cluster
template
• Clusters - the live clusters
16. Sahara’s EDP structures
8/19/14 tesora.com
• Data sources
• Input and output locations (Swift/HDFS/etc urls)
• Job binaries
• Often JARs or scripts, stored in a data source
• Jobs
• Templates for a job w/ parameters empty
• Job executions
• Instances of templates w/ parameters filled