Data Science for Folks Without (or With!) a Ph.D.

Merely having more data does not
give Amazon a strategic advantage.

Data Science for Folks Without (or With!) a Ph.D.
Douglas Starnes
Kansas City Developers Conference 2019

Who Am I?
• Memphis, TN area
• Polyglot ninja
• Co-director of Memphis Python User Group
• Conference Speaker
• Pluralsight Author

So what does it take to be a
data scientist?

Ask 10 data scientists what they do
And you’ll get 20 different answers.
^

Data science is
multidisciplinary

Why Python
Simple, clean, easy to learn
‘Close to the metal’
‘Can keep it in my head’
Cross-platform and open source
Vibrant and diverse community support
Become dangerous in a weekend
And useful in a week

Why Python
Hello World in Java
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello, World");
}
}
Hello World in Python
print('Hello, World')

Why Python
Reverse String in Java
import java.util.Scanner;
class ReverseofaString
{
public static void main(String[] arg)
{
ReverseofaString rev=new ReverseofaString();
Scanner sc=new Scanner(System.in);
System.out.print("Enter a string : ");
String str=sc.nextLine();
System.out.println("Reverse of a String
is : "+rev.reverse(str));
}
static String reverse(String s)
{
String rev="";
for(int j=s.length();j>0;--j)
{
rev=rev+(s.charAt(j-1));
}
return rev;
}
}
Reverse String in Python
def reverse(s):
reverse_string = ''
for c in s:
reverse_string = c + reverse_string
return reverse_string
word = input('Enter a string')
print(reverse(word))

Why Python
Reverse String in Java
import java.util.Scanner;
class ReverseofaString
{
public static void main(String[] arg)
{
ReverseofaString rev=new ReverseofaString();
Scanner sc=new Scanner(System.in);
System.out.print("Enter a string : ");
String str=sc.nextLine();
System.out.println("Reverse of a String
is : "+rev.reverse(str));
}
static String reverse(String s)
{
String rev="";
for(int j=s.length();j>0;--j)
{
rev=rev+(s.charAt(j-1));
}
return rev;
}
}
Reverse String in Python
word = input('Enter a string')
print("".join(reversed(word)))

Wait! Does this mean
I’m not finished with
school?

“The Jupyter Notebook is an open-source web
application that allows you to create and share
documents that contain live code, equations,
visualizations and narrative text.”
Interactive computing environment
Based on IPython

IPython enhances the default Python REPL
Automatic Indentation
Syntax highlighting
Tab Completion
And more!

“NumPy is the fundamental package for scientific
computing with Python.”
Arrays
Random number generation

>>> list_fib = [0, 1, 1, 2, 3, 5, 8]
>>> list_mat = [
[1, 2, ... 99, 100],
[2, 3, ... 100, 101],
...
[100, 101, ... 199, 200]
]
>>> list_mat += 1
>>> for row in list_mat:
for idx, el in enumerate(row):
row[idx] = el + 1
TypeError: 'int' object is not iterable
>>> sum = [1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
>>> difference = [1, 2, 3] - [4, 5, 6]
TypeError: unsupported operand type(s)
>>> difference = [x - y for (x, y) in zip([1, 2, 3], [4, 5, 6])]
[-3, -3, -3]

ndarray
>>> import numpy as np
>>> a = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = a + 1
array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> a + b
array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])
>>> a - b
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1])

But I thought ‘nd’ meant
‘n-dimensional’?

>>> a = np.arange(18)
>>> a.shape
(18,)
>>> b = a.reshape(6, 3)
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]])
>>> mat = np.random.randint(0, 10, 18).reshape(6, 3)
array([[7, 7, 6],
[5, 4, 2],
[6, 0, 0],
[8, 6, 7],
[7, 8, 4],
[3, 6, 3]])
>>> mat + 1
array([[8, 8, 7],
...
[4, 7, 4]])

>>> mat[1][1]
>>> mat[1, 1]
4
>>> mat[:,1]
array([7, 4, 0, 6, 8, 6])
>>> mat < 5
array([[False, False, False],
...
[ True, False, True]])
>>> mat[mat < 5]
array([4, 2, 0, 0, 4, 3, 3])
4
>>> mat
array([[7, 7, 6],
[5, 4, 2],
[6, 0, 0],
[8, 6, 7],
[7, 8, 4],
[3, 6, 3]])

>>> np.linspace(0, 10, 11)
>>> x = np.linspace(-np.pi*2, np.pi*2, 721)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
array([-6.28318531, -6.26573201, -6.24827872, -6.23082543, -6.21337214,
-6.19591884, -6.17846555, -6.16101226, -6.14355897, -6.12610567,
-6.10865238, -6.09119909, -6.0737458 , -6.0562925 , -6.03883921,
...
6.10865238, 6.12610567, 6.14355897, 6.16101226, 6.17846555,
6.19591884, 6.21337214, 6.23082543, 6.24827872, 6.26573201,
6.28318531])
>>> y = np.sin(x)
>>> np.linspace(0, 1, 11)
array([ 0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])

“pandas is an open source, BSD-licensed library providing
high-performance, easy-to-use data structures and data
analysis tools for the Python programming language.”
A more intellectually palatable API on top of numpy
No more ‘big globs of numbers’ to worry about

>>> import pandas as pd
>>> df = pd.read_csv(‘dow_jones_index.csv’)
>>> df
>>> df.columns
quarter stock date open high low close volume
0 1 AA 1/7/2011 $15.82 $16.72 $15.78 $16.42 239655616
1 1 AA 1/14/2011 $16.71 $16.71 $15.64 $15.97 242963398
2 1 AA 1/21/2011 $16.19 $16.38 $15.60 $15.79 138428495
3 1 AA 1/28/2011 $15.87 $16.63 $15.82 $16.13 151379173
4 1 AA 2/4/2011 $16.18 $17.39 $16.18 $17.14 154387761
5 1 AA 2/11/2011 $17.33 $17.48 $16.97 $17.37 114691279
6 1 AA 2/18/2011 $17.39 $17.68 $17.28 $17.28 80023895
7 1 AA 2/25/2011 $16.98 $17.15 $15.96 $16.68 132981863
8 1 AA 3/4/2011 $16.81 $16.94 $16.13 $16.58 109493077
9 1 AA 3/11/2011 $16.58 $16.75 $15.42 $16.03 114332562
10 1 AA 3/18/2011 $15.95 $16.33 $15.43 $16.11 130374108
Index(['quarter', 'stock', 'date', 'open', 'high', 'low', 'close', 'volume’,
'percent_change_price', 'percent_change_volume_over_last_wk’, 'previous_weeks_volume’,
'next_weeks_open', 'next_weeks_close','percent_change_next_weeks_price’,
'days_to_next_dividend','percent_return_next_dividend’], dtype='object')

>>> df[‘stock’]
>>> df.columns[1:8]
>>> v = df.loc[:, df.columns[1:8]].copy()
stock date open high low close volume
0 AA 1/7/2011 $15.82 $16.72 $15.78 $16.42 239655616
1 AA 1/14/2011 $16.71 $16.71 $15.64 $15.97 242963398
2 AA 1/21/2011 $16.19 $16.38 $15.60 $15.79 138428495
3 AA 1/28/2011 $15.87 $16.63 $15.82 $16.13 151379173
4 AA 2/4/2011 $16.18 $17.39 $16.18 $17.14 154387761
5 AA 2/11/2011 $17.33 $17.48 $16.97 $17.37 114691279
6 AA 2/18/2011 $17.39 $17.68 $17.28 $17.28 80023895
7 AA 2/25/2011 $16.98 $17.15 $15.96 $16.68 132981863
8 AA 3/4/2011 $16.81 $16.94 $16.13 $16.58 109493077
9 AA 3/11/2011 $16.58 $16.75 $15.42 $16.03 114332562
10 AA 3/18/2011 $15.95 $16.33 $15.43 $16.11 130374108
>>> v
>>> v.volume.max()
1453438639
>>> v.close[0]
'$16.42'

>>> for column in v.columns[2:6]:
v.loc[:, column] = v.loc[:, column].apply(lambda x: float(x[1:]), 1)
stock date open high low close volume
0 AA 1/7/2011 15.82 16.72 15.78 16.42 239655616
1 AA 1/14/2011 16.71 16.71 15.64 15.97 242963398
2 AA 1/21/2011 16.19 16.38 15.60 15.79 138428495
3 AA 1/28/2011 15.87 16.63 15.82 16.13 151379173
4 AA 2/4/2011 16.18 17.39 16.18 17.14 154387761
5 AA 2/11/2011 17.33 17.48 16.97 17.37 114691279
6 AA 2/18/2011 17.39 17.68 17.28 17.28 80023895
7 AA 2/25/2011 16.98 17.15 15.96 16.68 132981863
8 AA 3/4/2011 16.81 16.94 16.13 16.58 109493077
9 AA 3/11/2011 16.58 16.75 15.42 16.03 114332562
>>> v
>>> v[v.stock == 'DIS']
>>> v[v.stock == 'DIS']['close']
>>> close_index = v[v.stock == 'DIS']['close'].idxmax()
>>> v.loc[close_index, 'volume']
53096584

“Matplotlib is a Python 2D plotting library which produces
publication quality figures in a variety of hardcopy formats
and interactive environments across platforms.”
Visualizations

>>> x = np.linspace(0, 2 * np.pi, 361)
>>> y = np.sin(x)
>>> import matplotlib.pyplot as plt
>>> plt.plot(x, y)
>>> y2 = np.cos(x)
>>> plt.plot(x, y)
plt.plot(x, y2, color=‘r’)

>>> plt.figure(figsize=(3, 6)) # height is 2x the width
plt.subplot(2, 1, 1) # 2 rows, 1 column, position 1
plt.plot(x, y)
plt.subplot(2, 1, 2) # position 2
plt.plot(x, y2, color=‘r’)

fns = [np.sin, np.cos, lambda x: x ** 2, lambda x: np.sin(x) ** 2, lambda x: np.cos(x) ** 2, np.log]
colors = list('rgbcmk')
markers = list('.ov+xd')
data = zip(fns, colors, markers)
plt.figure(figsize=(30, 20))
for i, (fn, color, marker) in enumerate(data):
plt.subplot(2, 3, i + 1) # 1-3 on first row, 4-6 on second
plt.plot(x[np.arange(0, 360, 6)], fn(x[np.arange(0, 360, 6)]), color=color, marker=marker)

>>> import seaborn as sns
sns.distplot(data, rug=True)
sns.jointplot(x='x', y='y', data=df)
sns.catplot(data=data, kind=box')
sns.catplot(data=data, kind='violin')

scikit-learn is an open source Python machine learning package with
implementations of many popular machine learning algorithms.
>>> (X_train, X_test), (y_train, y_test) = get_data()
>>> reg = linear_model.LinearRegression()
>>> reg.fit(X_train, y_train)
>>> pred = reg.predict(X_test)
>>> points, targets = make_blobs()
>>> clf = GaussianNB()
>>> clf.fit(points, targets)
>>> clf.predict(np.array([5, -10]).reshape(1, -1)
array([0])
>>> clf.predict(np.array([10, 10]).reshape(1, -1)
array([1])

SymPy is a Python library for symbolic mathematics. It aims to
become a full-featured computer algebra system (CAS) while
keeping the code as simple as possible in order to be
comprehensible and easily extensible.
>>> (x, y) = symbols(‘x y’)
>>> z = x ** 2 + y
>>> z
x ** 2 + y
>>> z.subs([(x, 3), (y, 4)])
13
>>> init_printing()
>>> z
>>> diff(4 * x ** 3 + 2 * x ** 2 - 5, x)
>>> Derivative(2 * x ** 2, x)

Statsmodels is a Python module that provides classes and
functions for the estimation of many different statistical
models, as well as for conducting statistical tests, and statistical
data exploration.
>>> sm.OLS(y, X).fit()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.215
Model: OLS Adj. R-squared: 0.198
Method: Least Squares F-statistic: 13.25
Date: Mon, 24 Jun 2019 Prob (F-statistic): 8.15e-06
Time: 17:30:48 Log-Likelihood: -15.067
No. Observations: 100 AIC: 36.13
Df Residuals: 97 BIC: 43.95
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.5386 0.077 19.869 0.000 1.385 1.692
x1 -0.0672 0.101 -0.664 0.508 -0.268 0.134
x2 0.5048 0.099 5.090 0.000 0.308 0.702
==============================================================================
Omnibus: 24.327 Durbin-Watson: 2.228
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.273
Skew: 0.253 Prob(JB): 0.0434
Kurtosis: 1.883 Cond. No. 5.44
==============================================================================

• The defacto data science language
• Open source implementation of the S language
• Does one thing and does it well
• Very quirky
• High-level and fast
• Can rival native languages
• Like a Pythonic R (kinda sorta)
• Relatively new

DATA IS WORTHLESS
ANALYTICS ARE WORTH PENNIES
DECISIONS ARE WORTH DOLLARS

http://bit.ly/jupyter-notebook

THANK YOU!
douglas@douglasstarnes.com
@poweredbyaltnet
http://douglasstarnes.com
(returning soon)

Data Science for Folks Without (or With!) a Ph.D.

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (13)

Similaire à Data Science for Folks Without (or With!) a Ph.D.

Similaire à Data Science for Folks Without (or With!) a Ph.D. (20)

Dernier

Dernier (20)

Data Science for Folks Without (or With!) a Ph.D.