This document discusses Python in the big data ecosystem using the SMACK stack. SMACK stands for Spark, Kafka, Cassandra, Mesos, and Akka. Spark provides in-memory processing for speed and efficiency. Kafka handles data streaming. Cassandra provides scalable data storage across multiple computers. Mesos enables containerized environments for scalability and management. Akka supports high concurrency. The document outlines how SMACK is useful for mixed volume/velocity data, ETL/ELT processes, and near real-time analytics at scale. It provides examples of using each tool in the stack and discusses when SMACK is applicable.
1. Python in Big Data
Ecosystem
Nicholas Lu (Chee Seng)
PyCon Malaysia 2017
2. About me:
Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing
passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and
tab.
github.com/lucheeseng827
3. Why do we need Py
in the Tonne
World of heavy jvm and low level language as performance vs
simplicity
Total of data is immense
RAMs are getting cheaper
Less code = Less error = Less time of development
6. 1. Intro to SMACK
➔ Spark
In-memory processing does make stuff
faster and more efficient.
➔ Kafka
How many type of straws are we using
to dry up a water tank.
➔ Cassandra
Storage of data in multiple computer
does make it faster.
7. ➔ Mesos
Containerized environment for
ease of scalability and
management.
➔ Akka
High concurrency for better utilization
and effective processes
15. Dealing with mixed volume and velocity
Doing ETL/ELT (fixed schedule, move around)
Prefer speedy micro batch over classic batch process(second vs
minutes)
Plan to upgrade more features in the coming time
18. 2. Flow
Sequence for the data processing flow
➔ Pipe them in
Show me the data.
➔ Collect and Subscribe
Customer data in channel 4 and
Finance in channel 2
➔ Process in Batch
Release the Kraken!
➔ Process On-The-Go
Near real time processing for higher
urgency
32. Then, Marcos discovered
SMACK
He has his interest in Python
completely revived.
He’s able to give every project a great
SMACK. Project that provides client
fast analytics at scale.
33. What’s next?
Flink implementation
Apache Beam implementation
ML implementation
Implementation of Caching
DataFrames
SQL in spark streaming
DC/OS(Multi cloud tenancy)
Many more
35. Thank you!
For more about making this demo better
(please do give feedback to
lu.cheeseng827@gmail.com)
Notes de l'éditeur
Problem statement
How does python fair in big data world
jiji
World of heavy jvm and low level language as performance vs simplicity
Total of data is immense
RAMs areHow does python fair in big data world
World of heavy jvm and low level language as performance vs simplicity
Total of data is immense
RAMs are getting cheaper getting cheaper
When you want to put scalable processing up in speed, processing high bandwidth of logs and transaction
Explain what is happening in the backend, form data collection,