SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
`
Big Data Pipeline
@tuplejump
A data engineering startup, with
a vision to simplify data
engineering and empower the
next generation of data powered
miracles!
tuplejump
Who Am I?
• Founder and CEO @ Tuplejump
• Earlier worked at Pramati, Cordys and couple of startups.
• A polyglot developer
• Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell
• Love data hacking in R and Python
• Java and Javascript fed me for a long long time
• Committed to Scala
• Believe in choosing the best tool for the task
• Open Source fanatic
@milliondreams | mytechrantings.blogspot.com
The big data pipeline
COLLECT TRANSFORM
PREDICT
STORE
EXPLORE VISUALIZE
The Tuplejump Platform
COLLECT TRANSFORM
PREDICT
STORE
EXPLORE VISUALIZE
Hydra
The tentacled
framework to
gather high
volume and
velocity data
from push and
pull powered
by Akka,
reacting on
demands to
events and
streaming to
Spark to batch
process.
COLLECT
Spark +
Calliope
Using the
friendly Spark
API with added
features to
easily consume
or load data
from and to
Cassandra
powered
storage.
TRANSFORM
Cassandra++
Cassandra provides a single storage
mechanism for Files, (un)structured
data, Generic data.
STORE
MinerBot
Building on Spark's ML framework,
going towards machine assisted
insights, we are in building our own
EA and ANN/DL frameworks to take
ML to the next level.
PREDICT
Shark + Calliope
Ad Hoc querying with shark on your
data in Dstore.
UberCube
A OLAP cube engine
EXPLORE
Pissaro
A modern, game changing data frontend, which
is “not just dashboards”, providing highly
interactive and reactive visualization frontend.
VIZUALIZE
Advantages
• All the advantages of Spark + All the advantages of Cassandra + Much more!
• Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions
• Shark + C* provide for superfast ad hoc querying.
• UberCube empowers sub-millisecond responses on very large cubes
• MinerBot provides ready to use ML Algos, plus a possibility of much more complex
algos and mechanisms than just map reduce.
• Ready to use, no integration required
• Easy to develop, deploy, monitor and scale
Why Scala?
• Object oriented and functional
• Runs on the JVM
• 100% compatible with Java
• Modern, evolving, scalable
• Concise, flexible and high performance
• Excellent support for DSL development
• Spark and Play use Scala as their primary language
• We used it for long and we love it!!!
The ultimate gyan!
You can flirt with other languages,
you can have short affairs with few,
You will fall in love with Scala at the first sight,
You have to marry her to know her!
Let’s call in some friends
• Akka - Actors to build concurrent and distributed applications
• Spark - The blue eyed whiz kid of the Big Data class
• Play - The web development champion
• SBT - The best builder in town
• ScalaTest - The story teller
• Shapeless and Scalaz - Masters of the Dark Arts
Concurrency With Akka
• Inspired by Erlang’s Actor Model
• Runs on the JVM
• Actors define behavior to handle typed messages
• Actors process one message at a time
• Can use Group/Pool of actors behind routers for concurrency
• Can run thousands of actos on a modern server
• Location transparency for clustering
• Supervision and state recovery for HA
Batch Processing with Spark
• Resilient Distributed Datasets
• Fast in-memory big data
• Map/Reduce on steroids
• Iterative and interactive
• Code in scala, java, python and now R
• Streaming (DStreams - Batch processing on streams)
• MLLib, Shark, Spark SQL, GraphX and more
Web development with Play
• Modern high velocity, highly scalable
• Built on Akka and Netty (NIO)
• Reactive in design (reactive I/O)
• Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol
• Feature rich yet flexible
Build with SBT
• I hate writing XML
• Very easy to get started
• All the power of Scala in the build
• Maven dependency management + more
Testing with ScalaTest
• Write specs not tests and excellent tool for BDD
• Specs DSL very close to english
• Many testing styles
• Powerful matchers (“should be”)
• Fixtures
• Mock objects with ScalaMock
Taking functional further
• Shapeless
• Scrap your boilerplate
• Generic programming
• Existential types
• ScalaZ
• Bringing Haskel to Scala
• Monads, Functors and all the theory!
Thank you!
• http://www.tuplejump.com/
• http://github.com/tuplejump/
• http://tuplejump.github.com/calliope/
• http://tuplejump.github.com/stargate/
• @tuplejump on twitter

Contenu connexe

Tendances

Devops Days, 2019 - Charlotte
Devops Days, 2019 - CharlotteDevops Days, 2019 - Charlotte
Devops Days, 2019 - Charlottebotsplash.com
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingJustin Cunningham
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackElasticsearch
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetCarl W. Handlin
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...HostedbyConfluent
 
Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...
Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...
Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...HostedbyConfluent
 
Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...
Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...
Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...HostedbyConfluent
 
EventHub for kafka ecosystems kafka meetup
EventHub for kafka ecosystems   kafka meetupEventHub for kafka ecosystems   kafka meetup
EventHub for kafka ecosystems kafka meetupNitin Kumar
 
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...HostedbyConfluent
 
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...HostedbyConfluent
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
AtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
AtlasCamp 2014: Preparing Your Plugin for JIRA Data CenterAtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
AtlasCamp 2014: Preparing Your Plugin for JIRA Data CenterAtlassian
 

Tendances (20)

Devops Days, 2019 - Charlotte
Devops Days, 2019 - CharlotteDevops Days, 2019 - Charlotte
Devops Days, 2019 - Charlotte
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational Scaling
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic Stack
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
 
Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...
Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...
Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...
 
Data streaming
Data streamingData streaming
Data streaming
 
Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...
Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...
Creating a Kafka Topic. Super easy? | Andrew Stevenson and Marios Andreopoulo...
 
EventHub for kafka ecosystems kafka meetup
EventHub for kafka ecosystems   kafka meetupEventHub for kafka ecosystems   kafka meetup
EventHub for kafka ecosystems kafka meetup
 
Elk meetup
Elk meetupElk meetup
Elk meetup
 
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
 
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
AtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
AtlasCamp 2014: Preparing Your Plugin for JIRA Data CenterAtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
AtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 

Similaire à Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
20160524 ibm fast data meetup
20160524 ibm fast data meetup20160524 ibm fast data meetup
20160524 ibm fast data meetupshinolajla
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Lucas Jellema
 
Building Asynchronous Applications
Building Asynchronous ApplicationsBuilding Asynchronous Applications
Building Asynchronous ApplicationsJohan Edstrom
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Gravy Analytics
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 
Chicago Microservices Integration Talk
Chicago Microservices Integration TalkChicago Microservices Integration Talk
Chicago Microservices Integration TalkChristian Posta
 

Similaire à Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks. (20)

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
20160524 ibm fast data meetup
20160524 ibm fast data meetup20160524 ibm fast data meetup
20160524 ibm fast data meetup
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
 
Building Asynchronous Applications
Building Asynchronous ApplicationsBuilding Asynchronous Applications
Building Asynchronous Applications
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
Chicago Microservices Integration Talk
Chicago Microservices Integration TalkChicago Microservices Integration Talk
Chicago Microservices Integration Talk
 

Plus de Thoughtworks

Design System as a Product
Design System as a ProductDesign System as a Product
Design System as a ProductThoughtworks
 
Designers, Developers & Dogs
Designers, Developers & DogsDesigners, Developers & Dogs
Designers, Developers & DogsThoughtworks
 
Cloud-first for fast innovation
Cloud-first for fast innovationCloud-first for fast innovation
Cloud-first for fast innovationThoughtworks
 
More impact with flexible teams
More impact with flexible teamsMore impact with flexible teams
More impact with flexible teamsThoughtworks
 
Culture of Innovation
Culture of InnovationCulture of Innovation
Culture of InnovationThoughtworks
 
Developer Experience
Developer ExperienceDeveloper Experience
Developer ExperienceThoughtworks
 
When we design together
When we design togetherWhen we design together
When we design togetherThoughtworks
 
Hardware is hard(er)
Hardware is hard(er)Hardware is hard(er)
Hardware is hard(er)Thoughtworks
 
Customer-centric innovation enabled by cloud
 Customer-centric innovation enabled by cloud Customer-centric innovation enabled by cloud
Customer-centric innovation enabled by cloudThoughtworks
 
Amazon's Culture of Innovation
Amazon's Culture of InnovationAmazon's Culture of Innovation
Amazon's Culture of InnovationThoughtworks
 
When in doubt, go live
When in doubt, go liveWhen in doubt, go live
When in doubt, go liveThoughtworks
 
Don't cross the Rubicon
Don't cross the RubiconDon't cross the Rubicon
Don't cross the RubiconThoughtworks
 
Your test coverage is a lie!
Your test coverage is a lie!Your test coverage is a lie!
Your test coverage is a lie!Thoughtworks
 
Docker container security
Docker container securityDocker container security
Docker container securityThoughtworks
 
Redefining the unit
Redefining the unitRedefining the unit
Redefining the unitThoughtworks
 
Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Thoughtworks
 
A Tribute to Turing
A Tribute to TuringA Tribute to Turing
A Tribute to TuringThoughtworks
 
Rsa maths worked out
Rsa maths worked outRsa maths worked out
Rsa maths worked outThoughtworks
 

Plus de Thoughtworks (20)

Design System as a Product
Design System as a ProductDesign System as a Product
Design System as a Product
 
Designers, Developers & Dogs
Designers, Developers & DogsDesigners, Developers & Dogs
Designers, Developers & Dogs
 
Cloud-first for fast innovation
Cloud-first for fast innovationCloud-first for fast innovation
Cloud-first for fast innovation
 
More impact with flexible teams
More impact with flexible teamsMore impact with flexible teams
More impact with flexible teams
 
Culture of Innovation
Culture of InnovationCulture of Innovation
Culture of Innovation
 
Dual-Track Agile
Dual-Track AgileDual-Track Agile
Dual-Track Agile
 
Developer Experience
Developer ExperienceDeveloper Experience
Developer Experience
 
When we design together
When we design togetherWhen we design together
When we design together
 
Hardware is hard(er)
Hardware is hard(er)Hardware is hard(er)
Hardware is hard(er)
 
Customer-centric innovation enabled by cloud
 Customer-centric innovation enabled by cloud Customer-centric innovation enabled by cloud
Customer-centric innovation enabled by cloud
 
Amazon's Culture of Innovation
Amazon's Culture of InnovationAmazon's Culture of Innovation
Amazon's Culture of Innovation
 
When in doubt, go live
When in doubt, go liveWhen in doubt, go live
When in doubt, go live
 
Don't cross the Rubicon
Don't cross the RubiconDon't cross the Rubicon
Don't cross the Rubicon
 
Error handling
Error handlingError handling
Error handling
 
Your test coverage is a lie!
Your test coverage is a lie!Your test coverage is a lie!
Your test coverage is a lie!
 
Docker container security
Docker container securityDocker container security
Docker container security
 
Redefining the unit
Redefining the unitRedefining the unit
Redefining the unit
 
Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22
 
A Tribute to Turing
A Tribute to TuringA Tribute to Turing
A Tribute to Turing
 
Rsa maths worked out
Rsa maths worked outRsa maths worked out
Rsa maths worked out
 

Dernier

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Dernier (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium, ThoughtWorks.

  • 2. A data engineering startup, with a vision to simplify data engineering and empower the next generation of data powered miracles! tuplejump
  • 3. Who Am I? • Founder and CEO @ Tuplejump • Earlier worked at Pramati, Cordys and couple of startups. • A polyglot developer • Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell • Love data hacking in R and Python • Java and Javascript fed me for a long long time • Committed to Scala • Believe in choosing the best tool for the task • Open Source fanatic @milliondreams | mytechrantings.blogspot.com
  • 4. The big data pipeline COLLECT TRANSFORM PREDICT STORE EXPLORE VISUALIZE
  • 5. The Tuplejump Platform COLLECT TRANSFORM PREDICT STORE EXPLORE VISUALIZE Hydra The tentacled framework to gather high volume and velocity data from push and pull powered by Akka, reacting on demands to events and streaming to Spark to batch process. COLLECT Spark + Calliope Using the friendly Spark API with added features to easily consume or load data from and to Cassandra powered storage. TRANSFORM Cassandra++ Cassandra provides a single storage mechanism for Files, (un)structured data, Generic data. STORE MinerBot Building on Spark's ML framework, going towards machine assisted insights, we are in building our own EA and ANN/DL frameworks to take ML to the next level. PREDICT Shark + Calliope Ad Hoc querying with shark on your data in Dstore. UberCube A OLAP cube engine EXPLORE Pissaro A modern, game changing data frontend, which is “not just dashboards”, providing highly interactive and reactive visualization frontend. VIZUALIZE
  • 6. Advantages • All the advantages of Spark + All the advantages of Cassandra + Much more! • Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions • Shark + C* provide for superfast ad hoc querying. • UberCube empowers sub-millisecond responses on very large cubes • MinerBot provides ready to use ML Algos, plus a possibility of much more complex algos and mechanisms than just map reduce. • Ready to use, no integration required • Easy to develop, deploy, monitor and scale
  • 7. Why Scala? • Object oriented and functional • Runs on the JVM • 100% compatible with Java • Modern, evolving, scalable • Concise, flexible and high performance • Excellent support for DSL development • Spark and Play use Scala as their primary language • We used it for long and we love it!!!
  • 8. The ultimate gyan! You can flirt with other languages, you can have short affairs with few, You will fall in love with Scala at the first sight, You have to marry her to know her!
  • 9. Let’s call in some friends • Akka - Actors to build concurrent and distributed applications • Spark - The blue eyed whiz kid of the Big Data class • Play - The web development champion • SBT - The best builder in town • ScalaTest - The story teller • Shapeless and Scalaz - Masters of the Dark Arts
  • 10. Concurrency With Akka • Inspired by Erlang’s Actor Model • Runs on the JVM • Actors define behavior to handle typed messages • Actors process one message at a time • Can use Group/Pool of actors behind routers for concurrency • Can run thousands of actos on a modern server • Location transparency for clustering • Supervision and state recovery for HA
  • 11. Batch Processing with Spark • Resilient Distributed Datasets • Fast in-memory big data • Map/Reduce on steroids • Iterative and interactive • Code in scala, java, python and now R • Streaming (DStreams - Batch processing on streams) • MLLib, Shark, Spark SQL, GraphX and more
  • 12. Web development with Play • Modern high velocity, highly scalable • Built on Akka and Netty (NIO) • Reactive in design (reactive I/O) • Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol • Feature rich yet flexible
  • 13. Build with SBT • I hate writing XML • Very easy to get started • All the power of Scala in the build • Maven dependency management + more
  • 14. Testing with ScalaTest • Write specs not tests and excellent tool for BDD • Specs DSL very close to english • Many testing styles • Powerful matchers (“should be”) • Fixtures • Mock objects with ScalaMock
  • 15. Taking functional further • Shapeless • Scrap your boilerplate • Generic programming • Existential types • ScalaZ • Bringing Haskel to Scala • Monads, Functors and all the theory!
  • 16. Thank you! • http://www.tuplejump.com/ • http://github.com/tuplejump/ • http://tuplejump.github.com/calliope/ • http://tuplejump.github.com/stargate/ • @tuplejump on twitter