SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Simpler 
Machine Learning 
with SKLL 1.0 
Dan Blanchard 
Educational Testing Service 
dblanchard@ets.org 
PyData NYC 2014
Survived Perished
Survived Perished 
first class, 
female, 
1 sibling, 
35 years old 
third class, 
female, 
2 siblings, 
18 years old 
second class, 
male, 
0 siblings, 
50 years old
Survived Perished 
first class, 
female, 
1 sibling, 
35 years old 
third class, 
female, 
2 siblings, 
18 years old 
second class, 
male, 
0 siblings, 
50 years old 
Can we predict survival from data?
SciKit-Learn Laboratory
SKLL 
It's where the learning happens
Learning to Predict Survival 
1. Split up given training set: train (80%) and dev (20%) 
$ ./make_titanic_example_data.py 
Loading train.csv... done 
Writing titanic/train/socioeconomic.csv...done 
Writing titanic/train/family.csv...done 
Writing titanic/train/vitals.csv...done 
Writing titanic/train/misc.csv...done 
Writing titanic/train+dev/socioeconomic.csv...done 
Writing titanic/train+dev/family.csv...done 
Writing titanic/train+dev/vitals.csv...done 
Writing titanic/train+dev/misc.csv...done 
Writing titanic/dev/socioeconomic.csv...done 
Writing titanic/dev/family.csv...done 
Writing titanic/dev/vitals.csv...done 
Writing titanic/dev/misc.csv...done 
Loading test.csv... done 
Writing titanic/test/socioeconomic.csv...done 
Writing titanic/test/family.csv...done 
Writing titanic/test/vitals.csv...done 
Writing titanic/test/misc.csv...done
Learning to Predict Survival 
2. Pick classifiers to try: 
1. Decision Tree 
2. Naive Bayes 
3. Random forest 
4. Support Vector Machine (SVM)
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory with feature files for 
training learner
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory with feature files for 
evaluating performance
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory # of siblings, = train 
spouses, parents, children 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
departure port
Learning to Predict Survival 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
fare & passenger class 
3. Create configuration file for SKLL
Learning to Predict Survival 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
sex & age 
3. Create configuration file for SKLL
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory to store evaluation results
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory to store trained models
Learning to Predict Survival 
4. Run the configuration file with run_experiment 
$ run_experiment evaluate.cfg 
Loading train/family.csv... done 
Loading train/misc.csv... done 
Loading train/socioeconomic.csv... done 
Loading train/vitals.csv... done 
Loading dev/family.csv... done 
Loading dev/misc.csv... done 
Loading dev/socioeconomic.csv... done 
Loading dev/vitals.csv... done 
Loading train/family.csv... done 
Loading train/misc.csv... done 
Loading train/socioeconomic.csv... done 
Loading train/vitals.csv... done 
Loading dev/family.csv... done 
...
Learning to Predict Survival 
5. Examine results 
Experiment Name: Titanic_Evaluate_Untuned 
SKLL Version: 1.0.0 
Training Set: train (712) 
Test Set: dev (179) 
Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] 
Learner: RandomForestClassifier 
Scikit-learn Version: 0.15.2 
Total Time: 0:00:02.065403 
+-------+------+------+-----------+--------+-----------+ 
| | 0.0 | 1.0 | Precision | Recall | F-measure | 
+-------+------+------+-----------+--------+-----------+ 
| 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | 
+-------+------+------+-----------+--------+-----------+ 
| 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | 
+-------+------+------+-----------+--------+-----------+ 
(row = reference; column = predicted) 
Accuracy = 0.8100558659217877
Aggregate Evaluation Results 
Dev. Accuracy Learner 
0.8101 RandomForestClassifier 
0.7989 DecisionTreeClassifier 
0.7709 SVC 
0.7095 MultinomialNB
[General] 
experiment_name = Titanic_Evaluate 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Tuning] 
grid_search = true 
objective = accuracy 
[Output] 
results = output 
Tuning learner 
Can we do better than default hyperparameters?
Tuned Evaluation Results 
Untuned Accuracy Tuned Accuracy Learner 
0.8101 0.8380 RandomForestClassifier 
0.7989 0.7989 DecisionTreeClassifier 
0.7709 0.8156 SVC 
0.7095 0.7095 MultinomialNB
Using All Available Data 
Use training and dev to generate predictions on test 
[General] 
experiment_name = Titanic_Predict 
task = predict 
[Input] 
train_directory = train+dev 
test_directory = test 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Tuning] 
grid_search = true 
objective = accuracy 
[Output] 
results = output
Test Set Accuracy 
Train only Train + Dev 
Learner 
Untuned Tuned Untuned Tuned 
0.727 0.756 0.746 0.780 RandomForestClassifier 
0.703 0.742 0.670 0.742 DecisionTreeClassifier 
0.608 0.679 0.612 0.679 SVC 
0.627 0.627 0.622 0.622 MultinomialNB
Advanced SKLL Features 
• Read & write .arff, .csv, 
.jsonlines, .libsvm, .megam, 
.ndj, and .tsv data 
• Parameter grids for all 
supported scikit-learn learners 
• Custom learners 
• Parallelize experiments on 
DRMAA clusters via GridMap 
• Ablation experiments 
• Collapse/rename classes from 
config file 
• Feature scaling 
• Rescale predictions to be closer 
to observed data 
• Command-line tools for joining, 
filtering, and converting feature 
files 
• Python API
Currently Supported Learners 
Classifiers Regressors 
Linear Support Vector Machine Elastic Net 
Logistic Regression Lasso 
Multinomial Naive Bayes Linear 
AdaBoost 
Decision Tree 
Gradient Boosting 
K-Nearest Neighbors 
Random Forest 
Stochastic Gradient Descent 
Support Vector Machine
Contributors 
• Nitin Madnani 
• Mike Heilman 
• Nils Murrugarra Llerena 
• Aoife Cahill 
• Diane Napolitano 
• Keelan Evanini 
• Ben Leong
References 
• Dataset: kaggle.com/c/titanic-gettingStarted 
• SKLL GitHub: github.com/EducationalTestingService/skll 
• SKLL Docs: skll.readthedocs.org 
• Titanic configs and data splitting script in examples dir on GitHub 
@dsblanch 
dan-blanchard
Bonus Slides
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# confusion Load test matrix 
examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
overall accuracy on 
# Load test examples test set 
and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
precision, recall, f-score for 
# Load test examples and each evaluate 
class 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
tuned model 
# Load test examples and evaluate 
parameters 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
objective function 
# Load test examples and evaluate 
score on test set 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
# Perform 10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
per-fold evaluation 
# Perform results 
10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
per-fold training set 
# Perform 10-fold cross-obj. validation scores 
with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
# Perform 10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
import numpy as np 
from os.path import join 
from skll import FeatureSet, NDJWriter, Writer 
# Create some training examples 
labels = [] 
ids = [] 
features = [] 
for i in range(num_train_examples): 
labels.append("dog" if i % 2 == 0 else "cat") 
ids.append("{}{}".format(y, i)) 
features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)}) 
feat_set = FeatureSet('training', ids, labels=labels, features=features) 
# Write them to a file 
train_path = join(_my_dir, 'train', 'test_summary.jsonlines') 
Writer.for_path(train_path, feat_set).write() 
# Or 
NDJWriter.(train_path, feat_set).write()

Contenu connexe

En vedette

streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormDaniel Blanchard
 
9 Facts Maine Small Business Owners Should Know About Portland Radio
9 Facts Maine Small Business Owners Should Know About Portland  Radio9 Facts Maine Small Business Owners Should Know About Portland  Radio
9 Facts Maine Small Business Owners Should Know About Portland RadioLarry Julius
 
Chemistry and Application of Leuco Dyes
Chemistry and Application of Leuco DyesChemistry and Application of Leuco Dyes
Chemistry and Application of Leuco DyesAsma Khan
 
Unad’s festival film 2016
Unad’s festival film 2016Unad’s festival film 2016
Unad’s festival film 2016Luis Rodriguez
 
Apparato locomotore
Apparato locomotoreApparato locomotore
Apparato locomotorepgiac
 
Le giunture
Le giuntureLe giunture
Le giunturepgiac
 
Torace e bacin0
Torace e bacin0Torace e bacin0
Torace e bacin0pgiac
 

En vedette (7)

streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with storm
 
9 Facts Maine Small Business Owners Should Know About Portland Radio
9 Facts Maine Small Business Owners Should Know About Portland  Radio9 Facts Maine Small Business Owners Should Know About Portland  Radio
9 Facts Maine Small Business Owners Should Know About Portland Radio
 
Chemistry and Application of Leuco Dyes
Chemistry and Application of Leuco DyesChemistry and Application of Leuco Dyes
Chemistry and Application of Leuco Dyes
 
Unad’s festival film 2016
Unad’s festival film 2016Unad’s festival film 2016
Unad’s festival film 2016
 
Apparato locomotore
Apparato locomotoreApparato locomotore
Apparato locomotore
 
Le giunture
Le giuntureLe giunture
Le giunture
 
Torace e bacin0
Torace e bacin0Torace e bacin0
Torace e bacin0
 

Dernier

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Simpler Machine Learning with SKLL 1.0

  • 1. Simpler Machine Learning with SKLL 1.0 Dan Blanchard Educational Testing Service dblanchard@ets.org PyData NYC 2014
  • 2.
  • 3.
  • 4.
  • 6. Survived Perished first class, female, 1 sibling, 35 years old third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old
  • 7. Survived Perished first class, female, 1 sibling, 35 years old third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old Can we predict survival from data?
  • 9. SKLL It's where the learning happens
  • 10. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%) $ ./make_titanic_example_data.py Loading train.csv... done Writing titanic/train/socioeconomic.csv...done Writing titanic/train/family.csv...done Writing titanic/train/vitals.csv...done Writing titanic/train/misc.csv...done Writing titanic/train+dev/socioeconomic.csv...done Writing titanic/train+dev/family.csv...done Writing titanic/train+dev/vitals.csv...done Writing titanic/train+dev/misc.csv...done Writing titanic/dev/socioeconomic.csv...done Writing titanic/dev/family.csv...done Writing titanic/dev/vitals.csv...done Writing titanic/dev/misc.csv...done Loading test.csv... done Writing titanic/test/socioeconomic.csv...done Writing titanic/test/family.csv...done Writing titanic/test/vitals.csv...done Writing titanic/test/misc.csv...done
  • 11. Learning to Predict Survival 2. Pick classifiers to try: 1. Decision Tree 2. Naive Bayes 3. Random forest 4. Support Vector Machine (SVM)
  • 12. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 13. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory with feature files for training learner
  • 14. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory with feature files for evaluating performance
  • 15. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 16. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory # of siblings, = train spouses, parents, children test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 17. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output departure port
  • 18. Learning to Predict Survival [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output fare & passenger class 3. Create configuration file for SKLL
  • 19. Learning to Predict Survival [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output sex & age 3. Create configuration file for SKLL
  • 20. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 21. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory to store evaluation results
  • 22. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory to store trained models
  • 23. Learning to Predict Survival 4. Run the configuration file with run_experiment $ run_experiment evaluate.cfg Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done Loading dev/misc.csv... done Loading dev/socioeconomic.csv... done Loading dev/vitals.csv... done Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done ...
  • 24. Learning to Predict Survival 5. Examine results Experiment Name: Titanic_Evaluate_Untuned SKLL Version: 1.0.0 Training Set: train (712) Test Set: dev (179) Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Scikit-learn Version: 0.15.2 Total Time: 0:00:02.065403 +-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8100558659217877
  • 25. Aggregate Evaluation Results Dev. Accuracy Learner 0.8101 RandomForestClassifier 0.7989 DecisionTreeClassifier 0.7709 SVC 0.7095 MultinomialNB
  • 26. [General] experiment_name = Titanic_Evaluate task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Tuning] grid_search = true objective = accuracy [Output] results = output Tuning learner Can we do better than default hyperparameters?
  • 27. Tuned Evaluation Results Untuned Accuracy Tuned Accuracy Learner 0.8101 0.8380 RandomForestClassifier 0.7989 0.7989 DecisionTreeClassifier 0.7709 0.8156 SVC 0.7095 0.7095 MultinomialNB
  • 28. Using All Available Data Use training and dev to generate predictions on test [General] experiment_name = Titanic_Predict task = predict [Input] train_directory = train+dev test_directory = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Tuning] grid_search = true objective = accuracy [Output] results = output
  • 29. Test Set Accuracy Train only Train + Dev Learner Untuned Tuned Untuned Tuned 0.727 0.756 0.746 0.780 RandomForestClassifier 0.703 0.742 0.670 0.742 DecisionTreeClassifier 0.608 0.679 0.612 0.679 SVC 0.627 0.627 0.622 0.622 MultinomialNB
  • 30. Advanced SKLL Features • Read & write .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv data • Parameter grids for all supported scikit-learn learners • Custom learners • Parallelize experiments on DRMAA clusters via GridMap • Ablation experiments • Collapse/rename classes from config file • Feature scaling • Rescale predictions to be closer to observed data • Command-line tools for joining, filtering, and converting feature files • Python API
  • 31. Currently Supported Learners Classifiers Regressors Linear Support Vector Machine Elastic Net Logistic Regression Lasso Multinomial Naive Bayes Linear AdaBoost Decision Tree Gradient Boosting K-Nearest Neighbors Random Forest Stochastic Gradient Descent Support Vector Machine
  • 32. Contributors • Nitin Madnani • Mike Heilman • Nils Murrugarra Llerena • Aoife Cahill • Diane Napolitano • Keelan Evanini • Ben Leong
  • 33. References • Dataset: kaggle.com/c/titanic-gettingStarted • SKLL GitHub: github.com/EducationalTestingService/skll • SKLL Docs: skll.readthedocs.org • Titanic configs and data splitting script in examples dir on GitHub @dsblanch dan-blanchard
  • 35. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 36. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # confusion Load test matrix examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 37. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) overall accuracy on # Load test examples test set and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 38. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) precision, recall, f-score for # Load test examples and each evaluate class test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 39. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) tuned model # Load test examples and evaluate parameters test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 40. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) objective function # Load test examples and evaluate score on test set test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 41. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples)
  • 42. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 43. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold evaluation # Perform results 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 44. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold training set # Perform 10-fold cross-obj. validation scores with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 45. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 46. SKLL API import numpy as np from os.path import join from skll import FeatureSet, NDJWriter, Writer # Create some training examples labels = [] ids = [] features = [] for i in range(num_train_examples): labels.append("dog" if i % 2 == 0 else "cat") ids.append("{}{}".format(y, i)) features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)}) feat_set = FeatureSet('training', ids, labels=labels, features=features) # Write them to a file train_path = join(_my_dir, 'train', 'test_summary.jsonlines') Writer.for_path(train_path, feat_set).write() # Or NDJWriter.(train_path, feat_set).write()