2. ● Ganesha Yadiyala
● Big data consultant at
datamantra.io
● Consult in spark and scala
● ganeshayadiyala@gmail.com
3. Agenda
● Problem Statement
● Business view
● Why spark
● Thinking REST
● Load API
● Transform API
● Machine learning
● Pipeline API
● Save API
4. Problem Statement
Build a generic solution which can be used to do
transformation on data and then analyse it to get useful result
out of it.
5. Business view
● This is an era of big data.
● All companies are trying to get something useful from the
data and solve problems.
● There exists many frameworks in big data but we need a
tool which will leverage most of them and can solve
problem easily.
● So if there is a general solution or tool which can be able
to solve many of these problem that would be a big plus.
6. Why we used spark
There are many big data frameworks out there which can be
used for analysis of data, but we chose to use spark because,
● Capability to handle multiple data source
● Easy binding with the external data
● Good support for machine learning through spark-ml and
spark mllib
7. Thinking REST
To do all this transformation and analysis we provided REST
api because,
● Minimise the coupling between client and server
● Different clients can use the REST api to interact with the
tool.
● Used Akka-http for rest service
8. Akka-http
It is an Actor-based toolkit for interacting with web services
and clients,
● It is also written in scala and it uses same configuration
management library as spark
● It is an actor and future based
10. Rest server design
● Instead of going with spark jobserver we went with our
own rest server
● Once the rest server is started spark context is created
● All the configuration is passed to the spark context
through typesafe during its creation
● Same context is used for all the operation.
11. Loading from different sources
We supported different types of data,
● Csv datasource
● Json datasource
● Parquet datasource
● Xml datasource
12. Loading from different sources
We also supported some of the sources like,
● Mongodb
● Kafka
● JDBC
● Cassandra
13. Transformation
In big data world data which is coming to the system cannot
be used as it is, we may have to transform the data as
needed for the operation
We gave the API’s in REST to do this transformation, which
internally call spark dataframe API’s
14. Example
Some of the transformation we provided is,
● Cast - Cast the datatype of a column
● Filter - filter based on the formula or condition
● Aggregation - Max,min,sum,median etc
● Joins - Joining two datasets
15. Machine learning - spark ml
Spark ml provides higher level API which is built on top of the
dataframe.
● We did not used mllib because that is built on top of the
rdd.
● We provided rest API which will talk to these ML apis
16. Example
Some of the ml apis we provided are,
● Linear regression
● Decision tree (regressor and classifier)
● Ridge regression
● KMeans etc...
17. Challenges in spark ml
● It was very difficult to write generic api because not all the
ml algorithms expect similar inputs
● Not all the apis are documented properly
● Validation on the type of the columns which can be given
to these API are really difficult.
18. Save API
Once the transformation is done or ml gives the output use
may want to save the result. We support,
● text
● json
● parquet
● mongodb
● cassandra etc...
19. Pipeline and scheduling
We also implemented a pipeline api which will pipe all the
loading, transformation or ml apis.
If the user want to run this operation at scheduled time it is
possible through schedule API which we have provided.
20. Summary
No solution will be able to solve all the big data problems, but
we tried to build a tool which is generic enough to write your
own transformation on data, analyse it and we can solve
many of the problems