This document provides an overview of various Python packages for data science and analytics including pandas, pyarrow, dask, matplotlib, numpy, and scipy. It introduces the purpose and basic usage of each package. The document also lists Jupyter notebooks demonstrating examples using random walk simulations, Monte Carlo pi estimation, and data manipulation. Source code and homework examples are provided in a GitHub repository for further practice.
2. About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
5. numpy
• Provide high-performance calculations
• Linear algebra, Fourier transform….
• numpy has a faster processing speed than other python libraries. numpy is
generally for performing basic operations like sorting, indexing, and array
manipulation
• Manipulating numpy
• Create array
• Create matrix
• Matrix slicing
• Matrix axis
• Matrix computation
• add, multiply
• dot
5
numpy.ipynb
6. • Given two persons (A, B) are betting by coin head or tail
• If A draws head, and A wins one dollar, vice versa.
• The goal is to investigate the result of total dollar amount distribution
by drawing times. (We found the total dollar amount and drawing
time are square root curve distribution)
Random walk
6
1 1 …
-1 1 …
… … …
Drawing times
persons
1 2 …
-1 0 …
… … …
Drawing times – Accumulation amount
persons
random_walk.ipynb
7. • Monte Carlo method
PI
7
(x1,y1)
(x2,y2)
The ratio of circle area and square area:
Only need the in circle dots (distance <1)
pi.ipynb
8. scipy
• SciPy provides algorithms for optimization, integration, interpolation,
eigenvalue problems, algebraic equations, differential equations,
statistics and many other classes of problems.
• NumPy stands for Numerical Python while SciPy stands for Scientific
Python.
8
Ref: https://docs.scipy.org/doc/scipy/reference/linalg.html#module-scipy.linalg
9. pandas
• Working like Excel spreadsheet
• Manipulating data cell with formulas transformation
• series
• dataframe
• iloc, loc
• groupby
• time-series
• visualization
9
pandas.ipynb
10. pyarrow
• Apache Arrow is a development platform for in-memory analytics. It
contains a set of technologies that enable big data systems to store,
process and move data fast.
• The Arrow Python bindings (also named “PyArrow”) have first-class
integration with NumPy, pandas, and built-in Python objects. They are
based on the C++ implementation of Arrow.
10
Ref: https://arrow.apache.org/docs/python/index.html
pyarrow.ipynb
11. dask
• Dask is open source and freely available. It is developed in
coordination with other community projects like NumPy, pandas, and
scikit-learn.
11
Ref : https://dask.org/
12. matplotlib
• Matplotlib is a comprehensive library for creating static, animated,
and interactive visualizations in Python
12
Ref: https://matplotlib.org/
matplotlib.ipynb
13. Reference
• Try to use those packages and find out the prime number from 0 to
100000
13