SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis:	
  Scaling	
  Python	
  Analy=cs	
  
on	
  Hadoop	
  and	
  Impala	
  
Wes	
  McKinney,	
  SF	
  Data	
  Mining	
  Meetup	
  2015-­‐10-­‐22	
  
@wesmckinn	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  R&D	
  at	
  Cloudera	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Mathema=cian	
  —	
  MIT	
  ‘07	
  
•  “Professional	
  SQL	
  programmer”	
  2007-­‐2010	
  (@	
  AQR)	
  
•  Created	
  pandas	
  (Python	
  library)	
  in	
  2008	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
•  Founder	
  of	
  DataPad	
  
	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  is	
  popular…	
  
•  Python	
  has	
  become	
  a	
  standard	
  language	
  of	
  data	
  science	
  
•  Why	
  is	
  it	
  popular?	
  
• Maximizes	
  produc=vity	
  for	
  data	
  engineers	
  and	
  data	
  scien=sts	
  
• Build	
  robust	
  sobware	
  and	
  do	
  interac=ve	
  data	
  analysis	
  with	
  100%	
  Python	
  code	
  	
  
• Easy-­‐to-­‐learn	
  and	
  makes	
  happy	
  and	
  produc=ve	
  data	
  teams	
  	
  
• Large,	
  diverse	
  open	
  source	
  development	
  community	
  
• Comprehensive	
  libraries:	
  data	
  wrangling,	
  ML,	
  visualiza=on,	
  etc.	
  
•  Main	
  use	
  case:	
  data	
  science	
  &	
  engineering	
  swiss	
  army	
  knife	
  on	
  small-­‐to-­‐medium	
  
size	
  data	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
…but	
  Python	
  does	
  not	
  scale	
  today	
  
•  Python	
  ecosystem	
  confined	
  to	
  single-­‐node	
  analysis	
  
• Great	
  for	
  smaller	
  data	
  sets	
  
• Requires	
  sampling	
  or	
  aggrega=ons	
  for	
  larger	
  data	
  
• Distributed	
  tools	
  compromise	
  in	
  various	
  ways	
  
•  Extrac=ng	
  samples	
  or	
  aggrega=ons	
  for	
  larger	
  data	
  means:	
  
• “Scales”	
  by	
  losing	
  more	
  fidelity	
  
• Addi=onal	
  ETL	
  overhead	
  to	
  extract	
  samples/aggrega=ons	
  
• Loss	
  of	
  produc=vity	
  with	
  mul=ple	
  languages,	
  tools,	
  etc	
  
• Blocks	
  certain	
  analysis	
  and	
  use	
  cases	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Industry	
  Analy=cs	
   Scien=fic	
  Compu=ng	
  
Heterogeneous	
  data	
  
	
  	
  	
  	
  Flat	
  tables	
  and	
  JSON	
  
Spark	
  /	
  MapReduce	
  
SQL	
  
DFS-­‐friendly	
  /	
  streaming	
  data	
  formats	
  
More	
  physical	
  machines	
  
Homogeneous	
  data	
  
	
  	
  	
  	
  Mul=dimensional	
  arrays	
  
HPC	
  tools	
  
Linear	
  algebra	
  
Scien=fic	
  data	
  formats	
  
Fewer	
  physical	
  machines	
  
Some	
  simplis=c	
  generaliza=ons	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Industry	
  Analy=cs	
   Scien=fic	
  Compu=ng	
  
Heterogeneous	
  data	
  
	
  	
  	
  	
  Flat	
  tables	
  and	
  JSON	
  
Spark	
  /	
  MapReduce	
  
SQL	
  
DFS-­‐friendly	
  /	
  streaming	
  data	
  formats	
  
More	
  physical	
  machines	
  
Homogeneous	
  data	
  
	
  	
  	
  	
  Mul=dimensional	
  arrays	
  
HPC	
  tools	
  
Linear	
  algebra	
  
Scien=fic	
  data	
  formats	
  (e.g.	
  HDF5)	
  
Fewer	
  physical	
  machines	
  
Some	
  simplis=c	
  generaliza=ons	
  
Python:	
  heavy	
  investment,	
  	
  
generally	
  
Python:	
  light	
  investment,	
  
generally	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  Hugely	
  popular	
  Python	
  table	
  /	
  “data	
  frame”	
  library	
  
• Labeled	
  table,	
  array,	
  and	
  =me	
  series	
  data	
  structures	
  
•  Popular	
  for	
  data	
  prepara=on,	
  ETL,	
  and	
  in-­‐memory	
  analy=cs	
  
•  Built	
  using	
  Python’s	
  scien=fic	
  compu=ng	
  stack	
  
• User	
  API	
  /	
  domain	
  specific	
  language	
  
• Bespoke	
  in-­‐memory	
  analy=cs	
  /	
  rela=onal	
  algebra	
  engine	
  
• IO	
  interfaces	
  (CSV,	
  SQL,	
  etc.)	
  
• Expanded	
  data	
  type	
  system	
  (beyond	
  NumPy)	
  
•  Supports	
  flat	
  data	
  only	
  (or	
  semi-­‐structured	
  data	
  that	
  can	
  be	
  flaqened)	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Many	
  SQL	
  engines	
  
…	
  and	
  more	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  “Great	
  Decoupling”	
  for	
  Big	
  Data	
  
UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  sample	
  big	
  data	
  architecture	
  
Kafka
Kafka
Kafka
Kafka
Application data
HDFS
JSON Spark/MapReduce
Columnar
storage
Analytic SQL Engine
User
SQL
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Nested	
  /	
  Complex	
  types	
  support	
  
•  Arrays,	
  structs,	
  maps,	
  and	
  unions	
  as	
  first-­‐class	
  value	
  types	
  
•  Analyze	
  JSON-­‐like	
  data	
  directly	
  without	
  flaqening	
  or	
  normaliza=on	
  
•  Most	
  new	
  SQL	
  engines	
  have	
  some	
  level	
  of	
  support	
  
• Impala	
  
• Presto	
  
• Drill	
  
• BigQuery	
  
• Spark	
  SQL	
  
• Hive	
  
• …	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis	
  in	
  a	
  nutshell	
  
•  For	
  Python	
  programmers	
  doing	
  analy=cs	
  in	
  industry	
  
•  Project	
  Blog:	
  hqp://blog.ibis-­‐project.org	
  
•  Joint	
  project	
  with	
  Impala	
  team	
  @	
  Cloudera	
  
•  Apache-­‐licensed,	
  open	
  source	
  hqp://github.com/cloudera/ibis	
  	
  
•  Crabing	
  a	
  compelling	
  Python-­‐on-­‐Hadoop	
  user	
  experience	
  
• Remove	
  SQL	
  coding	
  from	
  user	
  workflows	
  
• Develop	
  high	
  performance	
  Python	
  extension	
  APIs	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis	
  in	
  a	
  nutshell,	
  cont’d	
  
•  Composable	
  Python	
  DSL	
  (“Ibis	
  expressions”)	
  makes	
  hand-­‐coding	
  SQL	
  SELECT	
  
statements	
  unnecessary	
  
•  Ibis	
  for	
  SQL	
  Programmers:	
  hqp://docs.ibis-­‐project.org/sql.html	
  
•  Development	
  roadmap	
  targets	
  Impala	
  (C++	
  /	
  LLVM)	
  query	
  engine	
  
• …	
  but	
  SQL	
  compiler	
  toolchain	
  is	
  general	
  purpose	
  
•  Current	
  supports	
  Impala	
  and	
  SQLite,	
  but	
  soon	
  other	
  dialects	
  
• We	
  welcome	
  external	
  contributors	
  for	
  other	
  Analy=c	
  SQL	
  engines	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Benefits	
  of	
  Ibis	
  
•  Maximize	
  developer	
  produc=vity	
  
• Mirrors	
  single-­‐node	
  Python	
  experience	
  
• Solve	
  big	
  data	
  problems	
  without	
  leaving	
  Python	
  
• Leverage	
  Python	
  skills,	
  ecosystem,	
  and	
  tools	
  
•  Python	
  as	
  first-­‐class	
  language	
  for	
  Hadoop	
  
• Full-­‐fidelity	
  analysis	
  without	
  extrac=ons	
  
• Python	
  analysis	
  at	
  any	
  scale	
  
• Na=ve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Brief	
  interac=ve	
  demo	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis/Impala	
  Joint	
  Roadmap	
  
•  More	
  natural	
  data	
  modeling	
  
• Complex	
  types	
  support	
  
•  Integra=on	
  with	
  full	
  Python	
  data	
  ecosystem	
  
• Advanced	
  analy=cs	
  +	
  machine	
  learning	
  
• Enable	
  use	
  of	
  performance	
  compu=ng	
  tools	
  
•  User	
  extensibility	
  with	
  na=ve	
  performance	
  
• In-­‐memory	
  columnar	
  format	
  
• Python-­‐to-­‐LLVM	
  IR	
  compila=on	
  
•  Workflow	
  and	
  usability	
  tools	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Execu=ng	
  data	
  science	
  languages	
  in	
  the	
  compute	
  layer	
  
UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enabling	
  interoperability	
  with	
  big	
  data	
  systems	
  
•  Distributed	
  /	
  MPP	
  query	
  engines:	
  implemented	
  in	
  a	
  host	
  language	
  
• Typically	
  C/C++	
  or	
  Java/Scala	
  
•  User-­‐defined	
  func=ons	
  (UDFs)	
  through	
  various	
  means	
  
• Implement	
  in	
  host	
  language	
  
• Implement	
  in	
  user	
  language	
  through	
  some	
  external	
  language	
  protocol	
  (oben	
  
RPC-­‐based)	
  
•  External	
  UDFs	
  are	
  usually	
  very	
  slow	
  (cf:	
  PL/Python,	
  PySpark,	
  etc.)	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  are	
  UDFs	
  good	
  for?	
  
•  Note:	
  industry	
  data	
  scien=sts	
  have	
  libraries	
  containing	
  100s	
  of	
  UDFs	
  for	
  Hive	
  or	
  
other	
  distributed	
  query	
  engines	
  
•  Custom	
  data	
  transforma=ons	
  
•  Custom	
  domain	
  logic	
  (date	
  /	
  =me	
  /	
  data	
  types)	
  
•  Custom	
  data	
  types	
  
•  Custom	
  aggrega=ons	
  (incl.	
  machine	
  learning	
  /	
  sta=s=cs	
  expressible	
  as	
  reduc=ons)	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  are	
  external	
  UDFs	
  slow?	
  
•  Serializa=on	
  /	
  deserializa=on	
  overhead	
  
•  Scalar	
  vs	
  vectorized	
  computa=ons	
  
•  RPC	
  overhead	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Example:	
  Vectoriza=on	
  for	
  interpreted	
  languages	
  
SUM(CASE WHEN x > y
THEN x
ELSE x + y
END)
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Vectorized	
  vs	
  Interpreted	
  perf	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  to	
  make	
  them	
  fast?	
  
•  Common	
  run=me	
  memory	
  representa=on	
  for	
  tabular	
  data	
  
•  Share-­‐memory	
  (zero-­‐copy	
  or	
  memcpy-­‐only)	
  external	
  UDF	
  protocol	
  
•  Vectorized	
  UDF	
  interface	
  (for	
  interpreted	
  languages)	
  
•  Impala	
  is	
  uniquely	
  posi=oned	
  to	
  play	
  well	
  with	
  Ibis	
  
• Best-­‐in-­‐class	
  performance	
  and	
  scalability	
  
• C++	
  and	
  LLVM-­‐based	
  (JIT	
  compiler)	
  run=me	
  
• Unified,	
  efficient	
  data	
  interchange	
  amongst	
  Ibis,	
  Impala,	
  and	
  Kudu	
  will	
  enable	
  
high	
  performance	
  real	
  =me	
  analy=cs	
  from	
  Python	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Memory	
  representa=on	
  
•  Many	
  query	
  engines	
  are	
  standardizing	
  on	
  in-­‐memory	
  columnar	
  rep’n	
  of	
  
materialized	
  transient	
  data	
  
• Impala:	
  
hqp://blog.cloudera.com/blog/2015/07/whats-­‐next-­‐for-­‐impala-­‐more-­‐
reliability-­‐usability-­‐and-­‐performance-­‐at-­‐even-­‐greater-­‐scale/	
  
• Apache	
  Drill:	
  hqps://drill.apache.org/faq/	
  
•  Industry-­‐standard	
  serializa=on	
  format:	
  Apache	
  Parquet	
  
• hqps://parquet.apache.org/	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Serializa=on	
  vs	
  In-­‐memory	
  
•  Serializa=on	
  formats	
  (e.g.	
  Parquet)	
  	
  
• Op=mize	
  for	
  IO	
  /	
  DFS	
  throughput	
  at	
  expense	
  of	
  CPU/memory	
  bus	
  throughput	
  
• Do	
  not	
  consider	
  random	
  access	
  or	
  in-­‐memory	
  analy=cs	
  as	
  a	
  goal	
  
•  No	
  standardized	
  in-­‐memory	
  containers	
  for	
  materialized	
  data	
  from	
  file	
  /	
  RPC	
  
protocols	
  (Parquet,	
  Thrib,	
  protobuf,	
  Avro,	
  etc.)	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Standardized	
  in-­‐memory	
  columnar	
  (IMC)	
  
•  Compact	
  in-­‐memory	
  representa=on	
  for	
  semistructured	
  data	
  
•  Part	
  of	
  Impala’s	
  upcoming	
  dev	
  roadmap	
  
•  Some	
  prior	
  IMC-­‐for-­‐SQL	
  work:	
  Apache	
  Drill	
  
•  Standardized	
  memory	
  representa=on	
  means	
  data	
  can	
  be	
  shared	
  without	
  
serializa=on	
  
•  Create	
  a	
  canonical	
  C/C++	
  implementa=on	
  for	
  use	
  in	
  Python	
  /	
  R	
  /	
  Julia	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis’s	
  Vision	
  
•  Uncompromised	
  Python	
  experience	
  
• 100%	
  Python	
  end-­‐to-­‐end	
  user	
  workflows	
  	
  
• Enable	
  integra=on	
  with	
  the	
  exis=ng	
  Python	
  data	
  ecosystem	
  (pandas,	
  scikit-­‐
learn,	
  NumPy,	
  etc)	
  
•  Interac=ve	
  at	
  big	
  data	
  scale	
  
• Full-­‐fidelity	
  analysis	
  without	
  extrac=ons	
  
• Scalability	
  for	
  big	
  data	
  
• Na=ve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  

Contenu connexe

Tendances

Tendances (20)

Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 

En vedette

Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Yahoo Developer Network
 
Compiled Python UDFs for Impala
Compiled Python UDFs for ImpalaCompiled Python UDFs for Impala
Compiled Python UDFs for Impala
Cloudera, Inc.
 

En vedette (20)

DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation Engine
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsDataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris Wiggins
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastDataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
DataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedDataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeed
 
AWS Big Data Platform - Pop-up Loft Tel Aviv
AWS Big Data Platform - Pop-up Loft Tel AvivAWS Big Data Platform - Pop-up Loft Tel Aviv
AWS Big Data Platform - Pop-up Loft Tel Aviv
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoop
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Making Big Data Projects Successful - Data Science Pop-up Seattle
Making Big Data Projects Successful - Data Science Pop-up SeattleMaking Big Data Projects Successful - Data Science Pop-up Seattle
Making Big Data Projects Successful - Data Science Pop-up Seattle
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson
Approximate Nearest Neighbors and Vector Models by Erik BernhardssonApproximate Nearest Neighbors and Vector Models by Erik Bernhardsson
Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson
 
How To Plan a Successful Big Data Pilot
How To Plan a Successful Big Data PilotHow To Plan a Successful Big Data Pilot
How To Plan a Successful Big Data Pilot
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Compiled Python UDFs for Impala
Compiled Python UDFs for ImpalaCompiled Python UDFs for Impala
Compiled Python UDFs for Impala
 

Similaire à Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney

Similaire à Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney (20)

Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Pandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data ExperiencePandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data Experience
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 

Plus de Hakka Labs

Plus de Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Scaling  Python  Analy=cs   on  Hadoop  and  Impala   Wes  McKinney,  SF  Data  Mining  Meetup  2015-­‐10-­‐22   @wesmckinn  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  R&D  at  Cloudera   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Mathema=cian  —  MIT  ‘07   •  “Professional  SQL  programmer”  2007-­‐2010  (@  AQR)   •  Created  pandas  (Python  library)  in  2008   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Founder  of  DataPad    
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Python  is  popular…   •  Python  has  become  a  standard  language  of  data  science   •  Why  is  it  popular?   • Maximizes  produc=vity  for  data  engineers  and  data  scien=sts   • Build  robust  sobware  and  do  interac=ve  data  analysis  with  100%  Python  code     • Easy-­‐to-­‐learn  and  makes  happy  and  produc=ve  data  teams     • Large,  diverse  open  source  development  community   • Comprehensive  libraries:  data  wrangling,  ML,  visualiza=on,  etc.   •  Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium   size  data  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   …but  Python  does  not  scale  today   •  Python  ecosystem  confined  to  single-­‐node  analysis   • Great  for  smaller  data  sets   • Requires  sampling  or  aggrega=ons  for  larger  data   • Distributed  tools  compromise  in  various  ways   •  Extrac=ng  samples  or  aggrega=ons  for  larger  data  means:   • “Scales”  by  losing  more  fidelity   • Addi=onal  ETL  overhead  to  extract  samples/aggrega=ons   • Loss  of  produc=vity  with  mul=ple  languages,  tools,  etc   • Blocks  certain  analysis  and  use  cases  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy=cs   Scien=fic  Compu=ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul=dimensional  arrays   HPC  tools   Linear  algebra   Scien=fic  data  formats   Fewer  physical  machines   Some  simplis=c  generaliza=ons  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy=cs   Scien=fic  Compu=ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul=dimensional  arrays   HPC  tools   Linear  algebra   Scien=fic  data  formats  (e.g.  HDF5)   Fewer  physical  machines   Some  simplis=c  generaliza=ons   Python:  heavy  investment,     generally   Python:  light  investment,   generally  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Hugely  popular  Python  table  /  “data  frame”  library   • Labeled  table,  array,  and  =me  series  data  structures   •  Popular  for  data  prepara=on,  ETL,  and  in-­‐memory  analy=cs   •  Built  using  Python’s  scien=fic  compu=ng  stack   • User  API  /  domain  specific  language   • Bespoke  in-­‐memory  analy=cs  /  rela=onal  algebra  engine   • IO  interfaces  (CSV,  SQL,  etc.)   • Expanded  data  type  system  (beyond  NumPy)   •  Supports  flat  data  only  (or  semi-­‐structured  data  that  can  be  flaqened)  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Many  SQL  engines   …  and  more  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   The  “Great  Decoupling”  for  Big  Data   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   A  sample  big  data  architecture   Kafka Kafka Kafka Kafka Application data HDFS JSON Spark/MapReduce Columnar storage Analytic SQL Engine User SQL
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Nested  /  Complex  types  support   •  Arrays,  structs,  maps,  and  unions  as  first-­‐class  value  types   •  Analyze  JSON-­‐like  data  directly  without  flaqening  or  normaliza=on   •  Most  new  SQL  engines  have  some  level  of  support   • Impala   • Presto   • Drill   • BigQuery   • Spark  SQL   • Hive   • …  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis  in  a  nutshell   •  For  Python  programmers  doing  analy=cs  in  industry   •  Project  Blog:  hqp://blog.ibis-­‐project.org   •  Joint  project  with  Impala  team  @  Cloudera   •  Apache-­‐licensed,  open  source  hqp://github.com/cloudera/ibis     •  Crabing  a  compelling  Python-­‐on-­‐Hadoop  user  experience   • Remove  SQL  coding  from  user  workflows   • Develop  high  performance  Python  extension  APIs  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis  in  a  nutshell,  cont’d   •  Composable  Python  DSL  (“Ibis  expressions”)  makes  hand-­‐coding  SQL  SELECT   statements  unnecessary   •  Ibis  for  SQL  Programmers:  hqp://docs.ibis-­‐project.org/sql.html   •  Development  roadmap  targets  Impala  (C++  /  LLVM)  query  engine   • …  but  SQL  compiler  toolchain  is  general  purpose   •  Current  supports  Impala  and  SQLite,  but  soon  other  dialects   • We  welcome  external  contributors  for  other  Analy=c  SQL  engines  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Benefits  of  Ibis   •  Maximize  developer  produc=vity   • Mirrors  single-­‐node  Python  experience   • Solve  big  data  problems  without  leaving  Python   • Leverage  Python  skills,  ecosystem,  and  tools   •  Python  as  first-­‐class  language  for  Hadoop   • Full-­‐fidelity  analysis  without  extrac=ons   • Python  analysis  at  any  scale   • Na=ve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Brief  interac=ve  demo  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis/Impala  Joint  Roadmap   •  More  natural  data  modeling   • Complex  types  support   •  Integra=on  with  full  Python  data  ecosystem   • Advanced  analy=cs  +  machine  learning   • Enable  use  of  performance  compu=ng  tools   •  User  extensibility  with  na=ve  performance   • In-­‐memory  columnar  format   • Python-­‐to-­‐LLVM  IR  compila=on   •  Workflow  and  usability  tools  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Execu=ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  interoperability  with  big  data  systems   •  Distributed  /  MPP  query  engines:  implemented  in  a  host  language   • Typically  C/C++  or  Java/Scala   •  User-­‐defined  func=ons  (UDFs)  through  various  means   • Implement  in  host  language   • Implement  in  user  language  through  some  external  language  protocol  (oben   RPC-­‐based)   •  External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   What  are  UDFs  good  for?   •  Note:  industry  data  scien=sts  have  libraries  containing  100s  of  UDFs  for  Hive  or   other  distributed  query  engines   •  Custom  data  transforma=ons   •  Custom  domain  logic  (date  /  =me  /  data  types)   •  Custom  data  types   •  Custom  aggrega=ons  (incl.  machine  learning  /  sta=s=cs  expressible  as  reduc=ons)  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Why  are  external  UDFs  slow?   •  Serializa=on  /  deserializa=on  overhead   •  Scalar  vs  vectorized  computa=ons   •  RPC  overhead  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Example:  Vectoriza=on  for  interpreted  languages   SUM(CASE WHEN x > y THEN x ELSE x + y END)
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Vectorized  vs  Interpreted  perf  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   How  to  make  them  fast?   •  Common  run=me  memory  representa=on  for  tabular  data   •  Share-­‐memory  (zero-­‐copy  or  memcpy-­‐only)  external  UDF  protocol   •  Vectorized  UDF  interface  (for  interpreted  languages)   •  Impala  is  uniquely  posi=oned  to  play  well  with  Ibis   • Best-­‐in-­‐class  performance  and  scalability   • C++  and  LLVM-­‐based  (JIT  compiler)  run=me   • Unified,  efficient  data  interchange  amongst  Ibis,  Impala,  and  Kudu  will  enable   high  performance  real  =me  analy=cs  from  Python  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Memory  representa=on   •  Many  query  engines  are  standardizing  on  in-­‐memory  columnar  rep’n  of   materialized  transient  data   • Impala:   hqp://blog.cloudera.com/blog/2015/07/whats-­‐next-­‐for-­‐impala-­‐more-­‐ reliability-­‐usability-­‐and-­‐performance-­‐at-­‐even-­‐greater-­‐scale/   • Apache  Drill:  hqps://drill.apache.org/faq/   •  Industry-­‐standard  serializa=on  format:  Apache  Parquet   • hqps://parquet.apache.org/  
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Serializa=on  vs  In-­‐memory   •  Serializa=on  formats  (e.g.  Parquet)     • Op=mize  for  IO  /  DFS  throughput  at  expense  of  CPU/memory  bus  throughput   • Do  not  consider  random  access  or  in-­‐memory  analy=cs  as  a  goal   •  No  standardized  in-­‐memory  containers  for  materialized  data  from  file  /  RPC   protocols  (Parquet,  Thrib,  protobuf,  Avro,  etc.)  
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Standardized  in-­‐memory  columnar  (IMC)   •  Compact  in-­‐memory  representa=on  for  semistructured  data   •  Part  of  Impala’s  upcoming  dev  roadmap   •  Some  prior  IMC-­‐for-­‐SQL  work:  Apache  Drill   •  Standardized  memory  representa=on  means  data  can  be  shared  without   serializa=on   •  Create  a  canonical  C/C++  implementa=on  for  use  in  Python  /  R  /  Julia  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis’s  Vision   •  Uncompromised  Python  experience   • 100%  Python  end-­‐to-­‐end  user  workflows     • Enable  integra=on  with  the  exis=ng  Python  data  ecosystem  (pandas,  scikit-­‐ learn,  NumPy,  etc)   •  Interac=ve  at  big  data  scale   • Full-­‐fidelity  analysis  without  extrac=ons   • Scalability  for  big  data   • Na=ve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own