Python in big data ecosystem by Nicholas Lu

•Télécharger en tant que PPTX, PDF•

2 j'aime•594 vues

This document discusses Python in the big data ecosystem using the SMACK stack. SMACK stands for Spark, Kafka, Cassandra, Mesos, and Akka. Spark provides in-memory processing for speed and efficiency. Kafka handles data streaming. Cassandra provides scalable data storage across multiple computers. Mesos enables containerized environments for scalability and management. Akka supports high concurrency. The document outlines how SMACK is useful for mixed volume/velocity data, ETL/ELT processes, and near real-time analytics at scale. It provides examples of using each tool in the stack and discusses when SMACK is applicable.

Logiciels

Python in Big Data
Ecosystem
Nicholas Lu (Chee Seng)
PyCon Malaysia 2017

About me:
Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing
passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and
tab.
github.com/lucheeseng827

Why do we need Py
in the Tonne
World of heavy jvm and low level language as performance vs
simplicity
Total of data is immense
RAMs are getting cheaper
Less code = Less error = Less time of development

1. Intro to SMACK
➔ Spark
In-memory processing does make stuff
faster and more efficient.
➔ Kafka
How many type of straws are we using
to dry up a water tank.
➔ Cassandra
Storage of data in multiple computer
does make it faster.

➔ Mesos
Containerized environment for
ease of scalability and
management.
➔ Akka
High concurrency for better utilization
and effective processes

Python in everything
Python Package SMACK Big Guy Developer
Kafka Confluent,
Pykafka, Python-
Kafka
Kafka Confluent
Pyspark Spark DataBricks
Mesos-python Mesos Mesosphere
Cassandra-driver Cassandra DataStax
Pykka Akka Unknown

http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html

https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka

Dealing with mixed volume and velocity
Doing ETL/ELT (fixed schedule, move around)
Prefer speedy micro batch over classic batch process(second vs
minutes)
Plan to upgrade more features in the coming time

Tools Needed
Python
Docker(kafka, cassandra)
Spark
Some big flat file
A decently rammed computer

2. Flow
Sequence for the data processing flow
➔ Pipe them in
Show me the data.
➔ Collect and Subscribe
Customer data in channel 4 and
Finance in channel 2
➔ Process in Batch
Release the Kraken!
➔ Process On-The-Go
Near real time processing for higher
urgency

https://intellipaat.com/tutorial/cassandra-tutorial/brief-architecture-of-cassandra/

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html

Then, Marcos discovered
SMACK
He has his interest in Python
completely revived.
He’s able to give every project a great
SMACK. Project that provides client
fast analytics at scale.

What’s next?
Flink implementation
Apache Beam implementation
ML implementation
Implementation of Caching
DataFrames
SQL in spark streaming
DC/OS(Multi cloud tenancy)
Many more

Thank you!
For more about making this demo better
(please do give feedback to
lu.cheeseng827@gmail.com)

Recommandé

ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity StonesAWS User Group Kochi

Msr2009 ianSAIL_QU

AWS re:Invent 2014 talk: Scheduling using Apache Mesos in the CloudSharma Podila

Dive into orchestration with KubernetesAyman Awartani

Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsSharma Podila

Hyperloglog Lightning TalkSimon Prickett

Scale search powered apps with Elastisearch, k8s and go - Maxime BoisvertWeb à Québec

Ipv6Chaand Chopra

Recommandé

ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity StonesAWS User Group Kochi

Msr2009 ianSAIL_QU

AWS re:Invent 2014 talk: Scheduling using Apache Mesos in the CloudSharma Podila

Dive into orchestration with KubernetesAyman Awartani

Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsSharma Podila

Hyperloglog Lightning TalkSimon Prickett

Scale search powered apps with Elastisearch, k8s and go - Maxime BoisvertWeb à Québec

Ipv6Chaand Chopra

Streaming solutions for real time problems Aparna Gaonkar

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju

BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro

BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro

Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...HostedbyConfluent

Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...confluent

ApacheCon 2021 Apache Deep Learning 302Timothy Spann

Devops Spark StreamingMarilyn Waldman

Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku

Polyglot Processing - An Introduction 1.0 Dr. Mohan K. Bavirisetty

Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau

Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau

Dataframes Showdown (miniConf 2022)8thLight

Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningStorage Switzerland

Open Source Lambda Architecture for deep learningPatrick Nicolas

Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...MLconf

Scaling PyData Up and OutTravis Oliphant

Hopsworks - The Platform for Data-Intensive AIQAware GmbH

Programming the BBC micro:bit with MicroPython by Dunham High SchoolPYCON MY PLT

Train your dragons! by Shilpa KarkeraPYCON MY PLT

Contenu connexe

Similaire à Python in big data ecosystem by Nicholas Lu

Streaming solutions for real time problems Aparna Gaonkar

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju

BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro

BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro

Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...HostedbyConfluent

Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...confluent

ApacheCon 2021 Apache Deep Learning 302Timothy Spann

Devops Spark StreamingMarilyn Waldman

Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku

Polyglot Processing - An Introduction 1.0 Dr. Mohan K. Bavirisetty

Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau

Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau

Dataframes Showdown (miniConf 2022)8thLight

Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningStorage Switzerland

Open Source Lambda Architecture for deep learningPatrick Nicolas

Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...MLconf

Scaling PyData Up and OutTravis Oliphant

Hopsworks - The Platform for Data-Intensive AIQAware GmbH

Similaire à Python in big data ecosystem by Nicholas Lu (20)

Streaming solutions for real time problems

Spark Summit EU 2015: Lessons from 300+ production users

Data Analytics and Machine Learning: From Node to Cluster on ARM64

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

BKK16-408B Data Analytics and Machine Learning From Node to Cluster

Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...

Via Varejo taking data from legacy to a new world at Brazil Black Friday (Mar...

ApacheCon 2021 Apache Deep Learning 302

Devops Spark Streaming

Apache Big Data Europa- How to make money with your own data

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

Polyglot Processing - An Introduction 1.0

Making the big data ecosystem work together with python apache arrow, spark,...

Making the big data ecosystem work together with Python & Apache Arrow, Apach...

Dataframes Showdown (miniConf 2022)

Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning

Open Source Lambda Architecture for deep learning

Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...

Scaling PyData Up and Out

Hopsworks - The Platform for Data-Intensive AI

Plus de PYCON MY PLT

Programming the BBC micro:bit with MicroPython by Dunham High SchoolPYCON MY PLT

Train your dragons! by Shilpa KarkeraPYCON MY PLT

Python testing like a pro by Keith YangPYCON MY PLT

The programmer's mind by Jessica McKellarPYCON MY PLT

Using machine learning to try and predict taxi availability by Narahari Allam...PYCON MY PLT

Data mining news articles by Amir Othman for PyCon APAC 2017PYCON MY PLT

Plus de PYCON MY PLT (6)

Programming the BBC micro:bit with MicroPython by Dunham High School

Train your dragons! by Shilpa Karkera

Python testing like a pro by Keith Yang

The programmer's mind by Jessica McKellar

Using machine learning to try and predict taxi availability by Narahari Allam...

Data mining news articles by Amir Othman for PyCon APAC 2017

Dernier

5 Signs You Need a Fashion PLM Software.pdfWave PLM

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

TECUNIQUE: Success Stories: IT Service providermohitmore19

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

DNT_Corporate presentation know about usDynamic Netsoft

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Software Quality Assurance Interview QuestionsArshad QA

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Dernier (20)

5 Signs You Need a Fashion PLM Software.pdf

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

How To Use Server-Side Rendering with Nuxt.js

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

TECUNIQUE: Success Stories: IT Service provider

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

DNT_Corporate presentation know about us

Hand gesture recognition PROJECT PPT.pptx

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Active Directory Penetration Testing, cionsystems.com.pdf

Unlocking the Future of AI Agents with Large Language Models

Software Quality Assurance Interview Questions

Microsoft AI Transformation Partner Playbook.pdf

Advancing Engineering with AI through the Next Generation of Strategic Projec...

A Secure and Reliable Document Management System is Essential.docx

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Python in big data ecosystem by Nicholas Lu

1. Python in Big Data Ecosystem Nicholas Lu (Chee Seng) PyCon Malaysia 2017

2. About me: Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and tab. github.com/lucheeseng827

3. Why do we need Py in the Tonne World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs are getting cheaper Less code = Less error = Less time of development

4. What’s SMACK

6. 1. Intro to SMACK ➔ Spark In-memory processing does make stuff faster and more efficient. ➔ Kafka How many type of straws are we using to dry up a water tank. ➔ Cassandra Storage of data in multiple computer does make it faster.

7. ➔ Mesos Containerized environment for ease of scalability and management. ➔ Akka High concurrency for better utilization and effective processes

8. Why SMACK

9. Python in everything Python Package SMACK Big Guy Developer Kafka Confluent, Pykafka, Python- Kafka Kafka Confluent Pyspark Spark DataBricks Mesos-python Mesos Mesosphere Cassandra-driver Cassandra DataStax Pykka Akka Unknown

10. http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html

11. https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka

12.

13.

14. When do you SMACK?

15. Dealing with mixed volume and velocity Doing ETL/ELT (fixed schedule, move around) Prefer speedy micro batch over classic batch process(second vs minutes) Plan to upgrade more features in the coming time

16. Tools Needed Python Docker(kafka, cassandra) Spark Some big flat file A decently rammed computer

17.

18. 2. Flow Sequence for the data processing flow ➔ Pipe them in Show me the data. ➔ Collect and Subscribe Customer data in channel 4 and Finance in channel 2 ➔ Process in Batch Release the Kraken! ➔ Process On-The-Go Near real time processing for higher urgency

19.

20.

21. https://intellipaat.com/tutorial/cassandra-tutorial/brief-architecture-of-cassandra/

22. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html

23.

24.

25.

26. SMACK

27. SMACK time

28. SMACK time

29. SMACK time

30.

31.

32. Then, Marcos discovered SMACK He has his interest in Python completely revived. He’s able to give every project a great SMACK. Project that provides client fast analytics at scale.

33. What’s next? Flink implementation Apache Beam implementation ML implementation Implementation of Caching DataFrames SQL in spark streaming DC/OS(Multi cloud tenancy) Many more

34. Q&A

35. Thank you! For more about making this demo better (please do give feedback to lu.cheeseng827@gmail.com)

Notes de l'éditeur

Problem statement
How does python fair in big data world jiji World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs areHow does python fair in big data world World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs are getting cheaper getting cheaper
When you want to put scalable processing up in speed, processing high bandwidth of logs and transaction
Explain what is happening in the backend, form data collection,