SlideShare une entreprise Scribd logo
1  sur  63
GOOD MORNING
DATA IS WORTHLESS
Merely having more data does not
give Amazon a strategic advantage.
ANALYTICS ARE WORTH
PENNIES
DECISIONS
ARE
WORTH
DOLLARS
VS
Data Science for Folks Without (or With!) a Ph.D.
Douglas Starnes
Kansas City Developers Conference 2019
Who Am I?
• Memphis, TN area
• Polyglot ninja
• Co-director of Memphis Python User Group
• Conference Speaker
• Pluralsight Author
So what does it take to be a
data scientist?
Ask 10 data scientists what they do
And you’ll get 20 different answers.
^
Data science is
multidisciplinary
Programming
Hi! I’m Python!
Why Python
Simple, clean, easy to learn
‘Close to the metal’
‘Can keep it in my head’
Cross-platform and open source
Vibrant and diverse community support
Become dangerous in a weekend
And useful in a week
Why Python
Hello World in Java
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello, World");
}
}
Hello World in Python
print('Hello, World')
Why Python
Reverse String in Java
import java.util.Scanner;
class ReverseofaString
{
public static void main(String[] arg)
{
ReverseofaString rev=new ReverseofaString();
Scanner sc=new Scanner(System.in);
System.out.print("Enter a string : ");
String str=sc.nextLine();
System.out.println("Reverse of a String
is : "+rev.reverse(str));
}
static String reverse(String s)
{
String rev="";
for(int j=s.length();j>0;--j)
{
rev=rev+(s.charAt(j-1));
}
return rev;
}
}
Reverse String in Python
def reverse(s):
reverse_string = ''
for c in s:
reverse_string = c + reverse_string
return reverse_string
word = input('Enter a string')
print(reverse(word))
Why Python
Reverse String in Java
import java.util.Scanner;
class ReverseofaString
{
public static void main(String[] arg)
{
ReverseofaString rev=new ReverseofaString();
Scanner sc=new Scanner(System.in);
System.out.print("Enter a string : ");
String str=sc.nextLine();
System.out.println("Reverse of a String
is : "+rev.reverse(str));
}
static String reverse(String s)
{
String rev="";
for(int j=s.length();j>0;--j)
{
rev=rev+(s.charAt(j-1));
}
return rev;
}
}
Reverse String in Python
word = input('Enter a string')
print("".join(reversed(word)))
Mathematics
Business
Art
This is NOT
Photoshopped!
Wait! Does this mean
I’m not finished with
school?
www.anaconda.com/download
“The Jupyter Notebook is an open-source web
application that allows you to create and share
documents that contain live code, equations,
visualizations and narrative text.”
Interactive computing environment
Based on IPython
IPython enhances the default Python REPL
Automatic Indentation
Syntax highlighting
Tab Completion
And more!
Azure Notebooks
Google Colab
“NumPy is the fundamental package for scientific
computing with Python.”
Arrays
Random number generation
>>> list_fib = [0, 1, 1, 2, 3, 5, 8]
>>> list_mat = [
[1, 2, ... 99, 100],
[2, 3, ... 100, 101],
...
[100, 101, ... 199, 200]
]
>>> list_mat += 1
>>> for row in list_mat:
for idx, el in enumerate(row):
row[idx] = el + 1
TypeError: 'int' object is not iterable
>>> sum = [1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
>>> difference = [1, 2, 3] - [4, 5, 6]
TypeError: unsupported operand type(s)
>>> difference = [x - y for (x, y) in zip([1, 2, 3], [4, 5, 6])]
[-3, -3, -3]
ndarray
>>> import numpy as np
>>> a = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = a + 1
array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> a + b
array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])
>>> a - b
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
But I thought ‘nd’ meant
‘n-dimensional’?
>>> a = np.arange(18)
>>> a.shape
(18,)
>>> b = a.reshape(6, 3)
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]])
>>> mat = np.random.randint(0, 10, 18).reshape(6, 3)
array([[7, 7, 6],
[5, 4, 2],
[6, 0, 0],
[8, 6, 7],
[7, 8, 4],
[3, 6, 3]])
>>> mat + 1
array([[8, 8, 7],
...
[4, 7, 4]])
>>> mat[1][1]
>>> mat[1, 1]
4
>>> mat[:,1]
array([7, 4, 0, 6, 8, 6])
>>> mat < 5
array([[False, False, False],
...
[ True, False, True]])
>>> mat[mat < 5]
array([4, 2, 0, 0, 4, 3, 3])
4
>>> mat
array([[7, 7, 6],
[5, 4, 2],
[6, 0, 0],
[8, 6, 7],
[7, 8, 4],
[3, 6, 3]])
>>> np.linspace(0, 10, 11)
>>> x = np.linspace(-np.pi*2, np.pi*2, 721)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
array([-6.28318531, -6.26573201, -6.24827872, -6.23082543, -6.21337214,
-6.19591884, -6.17846555, -6.16101226, -6.14355897, -6.12610567,
-6.10865238, -6.09119909, -6.0737458 , -6.0562925 , -6.03883921,
...
6.10865238, 6.12610567, 6.14355897, 6.16101226, 6.17846555,
6.19591884, 6.21337214, 6.23082543, 6.24827872, 6.26573201,
6.28318531])
>>> y = np.sin(x)
>>> np.linspace(0, 1, 11)
array([ 0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
“pandas is an open source, BSD-licensed library providing
high-performance, easy-to-use data structures and data
analysis tools for the Python programming language.”
A more intellectually palatable API on top of numpy
No more ‘big globs of numbers’ to worry about
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv(‘dow_jones_index.csv’)
>>> df
>>> df.columns
quarter stock date open high low close volume 
0 1 AA 1/7/2011 $15.82 $16.72 $15.78 $16.42 239655616
1 1 AA 1/14/2011 $16.71 $16.71 $15.64 $15.97 242963398
2 1 AA 1/21/2011 $16.19 $16.38 $15.60 $15.79 138428495
3 1 AA 1/28/2011 $15.87 $16.63 $15.82 $16.13 151379173
4 1 AA 2/4/2011 $16.18 $17.39 $16.18 $17.14 154387761
5 1 AA 2/11/2011 $17.33 $17.48 $16.97 $17.37 114691279
6 1 AA 2/18/2011 $17.39 $17.68 $17.28 $17.28 80023895
7 1 AA 2/25/2011 $16.98 $17.15 $15.96 $16.68 132981863
8 1 AA 3/4/2011 $16.81 $16.94 $16.13 $16.58 109493077
9 1 AA 3/11/2011 $16.58 $16.75 $15.42 $16.03 114332562
10 1 AA 3/18/2011 $15.95 $16.33 $15.43 $16.11 130374108
Index(['quarter', 'stock', 'date', 'open', 'high', 'low', 'close', 'volume’,
'percent_change_price', 'percent_change_volume_over_last_wk’, 'previous_weeks_volume’,
'next_weeks_open', 'next_weeks_close','percent_change_next_weeks_price’,
'days_to_next_dividend','percent_return_next_dividend’], dtype='object')
>>> df[‘stock’]
>>> df.columns[1:8]
>>> v = df.loc[:, df.columns[1:8]].copy()
stock date open high low close volume
0 AA 1/7/2011 $15.82 $16.72 $15.78 $16.42 239655616
1 AA 1/14/2011 $16.71 $16.71 $15.64 $15.97 242963398
2 AA 1/21/2011 $16.19 $16.38 $15.60 $15.79 138428495
3 AA 1/28/2011 $15.87 $16.63 $15.82 $16.13 151379173
4 AA 2/4/2011 $16.18 $17.39 $16.18 $17.14 154387761
5 AA 2/11/2011 $17.33 $17.48 $16.97 $17.37 114691279
6 AA 2/18/2011 $17.39 $17.68 $17.28 $17.28 80023895
7 AA 2/25/2011 $16.98 $17.15 $15.96 $16.68 132981863
8 AA 3/4/2011 $16.81 $16.94 $16.13 $16.58 109493077
9 AA 3/11/2011 $16.58 $16.75 $15.42 $16.03 114332562
10 AA 3/18/2011 $15.95 $16.33 $15.43 $16.11 130374108
>>> v
>>> v.volume.max()
1453438639
>>> v.close[0]
'$16.42'
>>> for column in v.columns[2:6]:
v.loc[:, column] = v.loc[:, column].apply(lambda x: float(x[1:]), 1)
stock date open high low close volume
0 AA 1/7/2011 15.82 16.72 15.78 16.42 239655616
1 AA 1/14/2011 16.71 16.71 15.64 15.97 242963398
2 AA 1/21/2011 16.19 16.38 15.60 15.79 138428495
3 AA 1/28/2011 15.87 16.63 15.82 16.13 151379173
4 AA 2/4/2011 16.18 17.39 16.18 17.14 154387761
5 AA 2/11/2011 17.33 17.48 16.97 17.37 114691279
6 AA 2/18/2011 17.39 17.68 17.28 17.28 80023895
7 AA 2/25/2011 16.98 17.15 15.96 16.68 132981863
8 AA 3/4/2011 16.81 16.94 16.13 16.58 109493077
9 AA 3/11/2011 16.58 16.75 15.42 16.03 114332562
>>> v
>>> v[v.stock == 'DIS']
>>> v[v.stock == 'DIS']['close']
>>> close_index = v[v.stock == 'DIS']['close'].idxmax()
>>> v.loc[close_index, 'volume']
53096584
“Matplotlib is a Python 2D plotting library which produces
publication quality figures in a variety of hardcopy formats
and interactive environments across platforms.”
Visualizations
>>> import numpy as np
>>> x = np.linspace(0, 2 * np.pi, 361)
>>> y = np.sin(x)
>>> import matplotlib.pyplot as plt
>>> plt.plot(x, y)
>>> y2 = np.cos(x)
>>> plt.plot(x, y)
plt.plot(x, y2, color=‘r’)
>>> plt.figure(figsize=(3, 6)) # height is 2x the width
plt.subplot(2, 1, 1) # 2 rows, 1 column, position 1
plt.plot(x, y)
plt.subplot(2, 1, 2) # position 2
plt.plot(x, y2, color=‘r’)
fns = [np.sin, np.cos, lambda x: x ** 2, lambda x: np.sin(x) ** 2, lambda x: np.cos(x) ** 2, np.log]
colors = list('rgbcmk')
markers = list('.ov+xd')
data = zip(fns, colors, markers)
plt.figure(figsize=(30, 20))
for i, (fn, color, marker) in enumerate(data):
plt.subplot(2, 3, i + 1) # 1-3 on first row, 4-6 on second
plt.plot(x[np.arange(0, 360, 6)], fn(x[np.arange(0, 360, 6)]), color=color, marker=marker)
>>> import seaborn as sns
sns.distplot(data, rug=True)
sns.jointplot(x='x', y='y', data=df)
sns.catplot(data=data, kind=box')
sns.catplot(data=data, kind='violin')
DEMO
scikit-learn is an open source Python machine learning package with
implementations of many popular machine learning algorithms.
>>> (X_train, X_test), (y_train, y_test) = get_data()
>>> reg = linear_model.LinearRegression()
>>> reg.fit(X_train, y_train)
>>> pred = reg.predict(X_test)
>>> points, targets = make_blobs()
>>> clf = GaussianNB()
>>> clf.fit(points, targets)
>>> clf.predict(np.array([5, -10]).reshape(1, -1)
array([0])
>>> clf.predict(np.array([10, 10]).reshape(1, -1)
array([1])
SymPy is a Python library for symbolic mathematics. It aims to
become a full-featured computer algebra system (CAS) while
keeping the code as simple as possible in order to be
comprehensible and easily extensible.
>>> (x, y) = symbols(‘x y’)
>>> z = x ** 2 + y
>>> z
x ** 2 + y
>>> z.subs([(x, 3), (y, 4)])
13
>>> init_printing()
>>> z
>>> diff(4 * x ** 3 + 2 * x ** 2 - 5, x)
>>> Derivative(2 * x ** 2, x)
Statsmodels is a Python module that provides classes and
functions for the estimation of many different statistical
models, as well as for conducting statistical tests, and statistical
data exploration.
>>> sm.OLS(y, X).fit()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.215
Model: OLS Adj. R-squared: 0.198
Method: Least Squares F-statistic: 13.25
Date: Mon, 24 Jun 2019 Prob (F-statistic): 8.15e-06
Time: 17:30:48 Log-Likelihood: -15.067
No. Observations: 100 AIC: 36.13
Df Residuals: 97 BIC: 43.95
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.5386 0.077 19.869 0.000 1.385 1.692
x1 -0.0672 0.101 -0.664 0.508 -0.268 0.134
x2 0.5048 0.099 5.090 0.000 0.308 0.702
==============================================================================
Omnibus: 24.327 Durbin-Watson: 2.228
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.273
Skew: 0.253 Prob(JB): 0.0434
Kurtosis: 1.883 Cond. No. 5.44
==============================================================================
• The defacto data science language
• Open source implementation of the S language
• Does one thing and does it well
• Very quirky
• High-level and fast
• Can rival native languages
• Like a Pythonic R (kinda sorta)
• Relatively new
DATA IS WORTHLESS
ANALYTICS ARE WORTH PENNIES
DECISIONS ARE WORTH DOLLARS
http://bit.ly/jupyter-notebook
THANK YOU!
douglas@douglasstarnes.com
@poweredbyaltnet
http://douglasstarnes.com
(returning soon)

Contenu connexe

Tendances

Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
Siva Arunachalam
 
Python data structures
Python data structuresPython data structures
Python data structures
Harry Potter
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
pugpe
 

Tendances (13)

PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
 
Python data structures
Python data structuresPython data structures
Python data structures
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and Dictionaries
 
Produce nice outputs for graphical, tabular and textual reporting in R-Report...
Produce nice outputs for graphical, tabular and textual reporting in R-Report...Produce nice outputs for graphical, tabular and textual reporting in R-Report...
Produce nice outputs for graphical, tabular and textual reporting in R-Report...
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
Introduction To MySQL Lecture 1
Introduction To MySQL Lecture 1Introduction To MySQL Lecture 1
Introduction To MySQL Lecture 1
 
Benefits of Kotlin
Benefits of KotlinBenefits of Kotlin
Benefits of Kotlin
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
Elixir
ElixirElixir
Elixir
 
Stacks
Stacks Stacks
Stacks
 
Intro
IntroIntro
Intro
 
Introduction to Python and TensorFlow
Introduction to Python and TensorFlowIntroduction to Python and TensorFlow
Introduction to Python and TensorFlow
 

Similaire à Data Science for Folks Without (or With!) a Ph.D.

Thinking Functionally In Ruby
Thinking Functionally In RubyThinking Functionally In Ruby
Thinking Functionally In Ruby
Ross Lawley
 
関数潮流(Function Tendency)
関数潮流(Function Tendency)関数潮流(Function Tendency)
関数潮流(Function Tendency)
riue
 

Similaire à Data Science for Folks Without (or With!) a Ph.D. (20)

python-cheatsheets.pdf
python-cheatsheets.pdfpython-cheatsheets.pdf
python-cheatsheets.pdf
 
python-cheatsheets that will be for coders
python-cheatsheets that will be for coderspython-cheatsheets that will be for coders
python-cheatsheets that will be for coders
 
R programming language
R programming languageR programming language
R programming language
 
01_introduction_lab.pdf
01_introduction_lab.pdf01_introduction_lab.pdf
01_introduction_lab.pdf
 
Slides ads ia
Slides ads iaSlides ads ia
Slides ads ia
 
IA-advanced-R
IA-advanced-RIA-advanced-R
IA-advanced-R
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Thinking Functionally In Ruby
Thinking Functionally In RubyThinking Functionally In Ruby
Thinking Functionally In Ruby
 
Welcome to python
Welcome to pythonWelcome to python
Welcome to python
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
 
Introduction to Python3 Programming Language
Introduction to Python3 Programming LanguageIntroduction to Python3 Programming Language
Introduction to Python3 Programming Language
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Data types
Data typesData types
Data types
 
Pythonlearn-08-Lists.pptx
Pythonlearn-08-Lists.pptxPythonlearn-08-Lists.pptx
Pythonlearn-08-Lists.pptx
 
Brief tour of psp-std
Brief tour of psp-stdBrief tour of psp-std
Brief tour of psp-std
 
Python Fundamentals - Basic
Python Fundamentals - BasicPython Fundamentals - Basic
Python Fundamentals - Basic
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
 
関数潮流(Function Tendency)
関数潮流(Function Tendency)関数潮流(Function Tendency)
関数潮流(Function Tendency)
 
High Performance GPU computing with Ruby, Rubykaigi 2018
High Performance GPU computing with Ruby, Rubykaigi 2018High Performance GPU computing with Ruby, Rubykaigi 2018
High Performance GPU computing with Ruby, Rubykaigi 2018
 
Python.pdf
Python.pdfPython.pdf
Python.pdf
 

Dernier

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Dernier (20)

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

Data Science for Folks Without (or With!) a Ph.D.

  • 3.
  • 4. Merely having more data does not give Amazon a strategic advantage.
  • 5.
  • 7.
  • 8.
  • 10. VS
  • 11. Data Science for Folks Without (or With!) a Ph.D. Douglas Starnes Kansas City Developers Conference 2019
  • 12. Who Am I? • Memphis, TN area • Polyglot ninja • Co-director of Memphis Python User Group • Conference Speaker • Pluralsight Author
  • 13. So what does it take to be a data scientist?
  • 14. Ask 10 data scientists what they do And you’ll get 20 different answers. ^
  • 18. Why Python Simple, clean, easy to learn ‘Close to the metal’ ‘Can keep it in my head’ Cross-platform and open source Vibrant and diverse community support Become dangerous in a weekend And useful in a week
  • 19. Why Python Hello World in Java public class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World"); } } Hello World in Python print('Hello, World')
  • 20. Why Python Reverse String in Java import java.util.Scanner; class ReverseofaString { public static void main(String[] arg) { ReverseofaString rev=new ReverseofaString(); Scanner sc=new Scanner(System.in); System.out.print("Enter a string : "); String str=sc.nextLine(); System.out.println("Reverse of a String is : "+rev.reverse(str)); } static String reverse(String s) { String rev=""; for(int j=s.length();j>0;--j) { rev=rev+(s.charAt(j-1)); } return rev; } } Reverse String in Python def reverse(s): reverse_string = '' for c in s: reverse_string = c + reverse_string return reverse_string word = input('Enter a string') print(reverse(word))
  • 21. Why Python Reverse String in Java import java.util.Scanner; class ReverseofaString { public static void main(String[] arg) { ReverseofaString rev=new ReverseofaString(); Scanner sc=new Scanner(System.in); System.out.print("Enter a string : "); String str=sc.nextLine(); System.out.println("Reverse of a String is : "+rev.reverse(str)); } static String reverse(String s) { String rev=""; for(int j=s.length();j>0;--j) { rev=rev+(s.charAt(j-1)); } return rev; } } Reverse String in Python word = input('Enter a string') print("".join(reversed(word)))
  • 24. Art
  • 26. Wait! Does this mean I’m not finished with school?
  • 28.
  • 29. “The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.” Interactive computing environment Based on IPython
  • 30. IPython enhances the default Python REPL Automatic Indentation Syntax highlighting Tab Completion And more!
  • 31.
  • 32.
  • 33.
  • 35. “NumPy is the fundamental package for scientific computing with Python.” Arrays Random number generation
  • 36. >>> list_fib = [0, 1, 1, 2, 3, 5, 8] >>> list_mat = [ [1, 2, ... 99, 100], [2, 3, ... 100, 101], ... [100, 101, ... 199, 200] ] >>> list_mat += 1 >>> for row in list_mat: for idx, el in enumerate(row): row[idx] = el + 1 TypeError: 'int' object is not iterable >>> sum = [1, 2, 3] + [4, 5, 6] [1, 2, 3, 4, 5, 6] >>> difference = [1, 2, 3] - [4, 5, 6] TypeError: unsupported operand type(s) >>> difference = [x - y for (x, y) in zip([1, 2, 3], [4, 5, 6])] [-3, -3, -3]
  • 37. ndarray >>> import numpy as np >>> a = np.arange(10) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> b = a + 1 array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >>> a + b array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]) >>> a - b array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
  • 38. But I thought ‘nd’ meant ‘n-dimensional’?
  • 39. >>> a = np.arange(18) >>> a.shape (18,) >>> b = a.reshape(6, 3) array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [12, 13, 14], [15, 16, 17]]) >>> mat = np.random.randint(0, 10, 18).reshape(6, 3) array([[7, 7, 6], [5, 4, 2], [6, 0, 0], [8, 6, 7], [7, 8, 4], [3, 6, 3]]) >>> mat + 1 array([[8, 8, 7], ... [4, 7, 4]])
  • 40. >>> mat[1][1] >>> mat[1, 1] 4 >>> mat[:,1] array([7, 4, 0, 6, 8, 6]) >>> mat < 5 array([[False, False, False], ... [ True, False, True]]) >>> mat[mat < 5] array([4, 2, 0, 0, 4, 3, 3]) 4 >>> mat array([[7, 7, 6], [5, 4, 2], [6, 0, 0], [8, 6, 7], [7, 8, 4], [3, 6, 3]])
  • 41. >>> np.linspace(0, 10, 11) >>> x = np.linspace(-np.pi*2, np.pi*2, 721) array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]) array([-6.28318531, -6.26573201, -6.24827872, -6.23082543, -6.21337214, -6.19591884, -6.17846555, -6.16101226, -6.14355897, -6.12610567, -6.10865238, -6.09119909, -6.0737458 , -6.0562925 , -6.03883921, ... 6.10865238, 6.12610567, 6.14355897, 6.16101226, 6.17846555, 6.19591884, 6.21337214, 6.23082543, 6.24827872, 6.26573201, 6.28318531]) >>> y = np.sin(x) >>> np.linspace(0, 1, 11) array([ 0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
  • 42. “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.” A more intellectually palatable API on top of numpy No more ‘big globs of numbers’ to worry about
  • 43. >>> import numpy as np >>> import pandas as pd >>> df = pd.read_csv(‘dow_jones_index.csv’) >>> df >>> df.columns quarter stock date open high low close volume 0 1 AA 1/7/2011 $15.82 $16.72 $15.78 $16.42 239655616 1 1 AA 1/14/2011 $16.71 $16.71 $15.64 $15.97 242963398 2 1 AA 1/21/2011 $16.19 $16.38 $15.60 $15.79 138428495 3 1 AA 1/28/2011 $15.87 $16.63 $15.82 $16.13 151379173 4 1 AA 2/4/2011 $16.18 $17.39 $16.18 $17.14 154387761 5 1 AA 2/11/2011 $17.33 $17.48 $16.97 $17.37 114691279 6 1 AA 2/18/2011 $17.39 $17.68 $17.28 $17.28 80023895 7 1 AA 2/25/2011 $16.98 $17.15 $15.96 $16.68 132981863 8 1 AA 3/4/2011 $16.81 $16.94 $16.13 $16.58 109493077 9 1 AA 3/11/2011 $16.58 $16.75 $15.42 $16.03 114332562 10 1 AA 3/18/2011 $15.95 $16.33 $15.43 $16.11 130374108 Index(['quarter', 'stock', 'date', 'open', 'high', 'low', 'close', 'volume’, 'percent_change_price', 'percent_change_volume_over_last_wk’, 'previous_weeks_volume’, 'next_weeks_open', 'next_weeks_close','percent_change_next_weeks_price’, 'days_to_next_dividend','percent_return_next_dividend’], dtype='object')
  • 44. >>> df[‘stock’] >>> df.columns[1:8] >>> v = df.loc[:, df.columns[1:8]].copy() stock date open high low close volume 0 AA 1/7/2011 $15.82 $16.72 $15.78 $16.42 239655616 1 AA 1/14/2011 $16.71 $16.71 $15.64 $15.97 242963398 2 AA 1/21/2011 $16.19 $16.38 $15.60 $15.79 138428495 3 AA 1/28/2011 $15.87 $16.63 $15.82 $16.13 151379173 4 AA 2/4/2011 $16.18 $17.39 $16.18 $17.14 154387761 5 AA 2/11/2011 $17.33 $17.48 $16.97 $17.37 114691279 6 AA 2/18/2011 $17.39 $17.68 $17.28 $17.28 80023895 7 AA 2/25/2011 $16.98 $17.15 $15.96 $16.68 132981863 8 AA 3/4/2011 $16.81 $16.94 $16.13 $16.58 109493077 9 AA 3/11/2011 $16.58 $16.75 $15.42 $16.03 114332562 10 AA 3/18/2011 $15.95 $16.33 $15.43 $16.11 130374108 >>> v >>> v.volume.max() 1453438639 >>> v.close[0] '$16.42'
  • 45. >>> for column in v.columns[2:6]: v.loc[:, column] = v.loc[:, column].apply(lambda x: float(x[1:]), 1) stock date open high low close volume 0 AA 1/7/2011 15.82 16.72 15.78 16.42 239655616 1 AA 1/14/2011 16.71 16.71 15.64 15.97 242963398 2 AA 1/21/2011 16.19 16.38 15.60 15.79 138428495 3 AA 1/28/2011 15.87 16.63 15.82 16.13 151379173 4 AA 2/4/2011 16.18 17.39 16.18 17.14 154387761 5 AA 2/11/2011 17.33 17.48 16.97 17.37 114691279 6 AA 2/18/2011 17.39 17.68 17.28 17.28 80023895 7 AA 2/25/2011 16.98 17.15 15.96 16.68 132981863 8 AA 3/4/2011 16.81 16.94 16.13 16.58 109493077 9 AA 3/11/2011 16.58 16.75 15.42 16.03 114332562 >>> v >>> v[v.stock == 'DIS'] >>> v[v.stock == 'DIS']['close'] >>> close_index = v[v.stock == 'DIS']['close'].idxmax() >>> v.loc[close_index, 'volume'] 53096584
  • 46. “Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.” Visualizations
  • 47. >>> import numpy as np >>> x = np.linspace(0, 2 * np.pi, 361) >>> y = np.sin(x) >>> import matplotlib.pyplot as plt >>> plt.plot(x, y) >>> y2 = np.cos(x) >>> plt.plot(x, y) plt.plot(x, y2, color=‘r’)
  • 48. >>> plt.figure(figsize=(3, 6)) # height is 2x the width plt.subplot(2, 1, 1) # 2 rows, 1 column, position 1 plt.plot(x, y) plt.subplot(2, 1, 2) # position 2 plt.plot(x, y2, color=‘r’)
  • 49. fns = [np.sin, np.cos, lambda x: x ** 2, lambda x: np.sin(x) ** 2, lambda x: np.cos(x) ** 2, np.log] colors = list('rgbcmk') markers = list('.ov+xd') data = zip(fns, colors, markers) plt.figure(figsize=(30, 20)) for i, (fn, color, marker) in enumerate(data): plt.subplot(2, 3, i + 1) # 1-3 on first row, 4-6 on second plt.plot(x[np.arange(0, 360, 6)], fn(x[np.arange(0, 360, 6)]), color=color, marker=marker)
  • 50. >>> import seaborn as sns sns.distplot(data, rug=True) sns.jointplot(x='x', y='y', data=df) sns.catplot(data=data, kind=box') sns.catplot(data=data, kind='violin')
  • 51. DEMO
  • 52. scikit-learn is an open source Python machine learning package with implementations of many popular machine learning algorithms. >>> (X_train, X_test), (y_train, y_test) = get_data() >>> reg = linear_model.LinearRegression() >>> reg.fit(X_train, y_train) >>> pred = reg.predict(X_test) >>> points, targets = make_blobs() >>> clf = GaussianNB() >>> clf.fit(points, targets) >>> clf.predict(np.array([5, -10]).reshape(1, -1) array([0]) >>> clf.predict(np.array([10, 10]).reshape(1, -1) array([1])
  • 53. SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. >>> (x, y) = symbols(‘x y’) >>> z = x ** 2 + y >>> z x ** 2 + y >>> z.subs([(x, 3), (y, 4)]) 13 >>> init_printing() >>> z >>> diff(4 * x ** 3 + 2 * x ** 2 - 5, x) >>> Derivative(2 * x ** 2, x)
  • 54. Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. >>> sm.OLS(y, X).fit() OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.215 Model: OLS Adj. R-squared: 0.198 Method: Least Squares F-statistic: 13.25 Date: Mon, 24 Jun 2019 Prob (F-statistic): 8.15e-06 Time: 17:30:48 Log-Likelihood: -15.067 No. Observations: 100 AIC: 36.13 Df Residuals: 97 BIC: 43.95 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 1.5386 0.077 19.869 0.000 1.385 1.692 x1 -0.0672 0.101 -0.664 0.508 -0.268 0.134 x2 0.5048 0.099 5.090 0.000 0.308 0.702 ============================================================================== Omnibus: 24.327 Durbin-Watson: 2.228 Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.273 Skew: 0.253 Prob(JB): 0.0434 Kurtosis: 1.883 Cond. No. 5.44 ==============================================================================
  • 55. • The defacto data science language • Open source implementation of the S language • Does one thing and does it well • Very quirky • High-level and fast • Can rival native languages • Like a Pythonic R (kinda sorta) • Relatively new
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61. DATA IS WORTHLESS ANALYTICS ARE WORTH PENNIES DECISIONS ARE WORTH DOLLARS