As the popularity of machine learning techniques spreads to new areas of industry and science, the number of potential machine learning users is growing rapidly. While the fantastic scikit-learn library is widely used in the Python community for tackling such tasks, there are two significant hurdles in place for people working on new machine learning problems:
• Scikit-learn requires writing a fair amount of boilerplate code to run even simple experiments.
• Obtaining good performance typically requires tuning various model parameters, which can be particularly challenging for beginners.
SciKit-Learn Laboratory (SKLL) is an open source Python package, originally developed by the NLP & Speech group at the Educational Testing Service (ETS), that addresses these issues by providing the ability to run scikit-learn experiments with tuned models without writing any code beyond what generates the features. This talk will provide an overview of performing common machine learning tasks with SKLL, and highlight some of the new features that are present as of the 1.0 release.
6. Survived Perished
first class,
female,
1 sibling,
35 years old
third class,
female,
2 siblings,
18 years old
second class,
male,
0 siblings,
50 years old
7. Survived Perished
first class,
female,
1 sibling,
35 years old
third class,
female,
2 siblings,
18 years old
second class,
male,
0 siblings,
50 years old
Can we predict survival from data?
28. Using All Available Data
Use training and dev to generate predictions on test
[General]
experiment_name = Titanic_Predict
task = predict
[Input]
train_directory = train+dev
test_directory = test
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC",
"MultinomialNB"]
id_col = PassengerId
label_col = Survived
[Tuning]
grid_search = true
objective = accuracy
[Output]
results = output
29. Test Set Accuracy
Train only Train + Dev
Learner
Untuned Tuned Untuned Tuned
0.727 0.756 0.746 0.780 RandomForestClassifier
0.703 0.742 0.670 0.742 DecisionTreeClassifier
0.608 0.679 0.612 0.679 SVC
0.627 0.627 0.622 0.622 MultinomialNB
30. Advanced SKLL Features
• Read & write .arff, .csv,
.jsonlines, .libsvm, .megam,
.ndj, and .tsv data
• Parameter grids for all
supported scikit-learn learners
• Custom learners
• Parallelize experiments on
DRMAA clusters via GridMap
• Ablation experiments
• Collapse/rename classes from
config file
• Feature scaling
• Rescale predictions to be closer
to observed data
• Command-line tools for joining,
filtering, and converting feature
files
• Python API
31. Currently Supported Learners
Classifiers Regressors
Linear Support Vector Machine Elastic Net
Logistic Regression Lasso
Multinomial Naive Bayes Linear
AdaBoost
Decision Tree
Gradient Boosting
K-Nearest Neighbors
Random Forest
Stochastic Gradient Descent
Support Vector Machine
32. Contributors
• Nitin Madnani
• Mike Heilman
• Nils Murrugarra Llerena
• Aoife Cahill
• Diane Napolitano
• Keelan Evanini
• Ben Leong
33. References
• Dataset: kaggle.com/c/titanic-gettingStarted
• SKLL GitHub: github.com/EducationalTestingService/skll
• SKLL Docs: skll.readthedocs.org
• Titanic configs and data splitting script in examples dir on GitHub
@dsblanch
dan-blanchard
35. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
36. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# confusion Load test matrix
examples and evaluate
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
37. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
overall accuracy on
# Load test examples test set
and evaluate
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
38. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
precision, recall, f-score for
# Load test examples and each evaluate
class
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
39. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
tuned model
# Load test examples and evaluate
parameters
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
40. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
objective function
# Load test examples and evaluate
score on test set
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
41. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
42. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVM
learner = Learner('SVC')
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
43. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
per-fold evaluation
# Perform results
10-fold cross-validation with a radial SVM
learner = Learner('SVC')
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
44. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
per-fold training set
# Perform 10-fold cross-obj. validation scores
with a radial SVM
learner = Learner('SVC')
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
45. SKLL API
from skll import Learner, Reader
# Load training examples
train_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = Reader.for_path('test.tsv').read()
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVM
learner = Learner('SVC')
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
46. SKLL API
import numpy as np
from os.path import join
from skll import FeatureSet, NDJWriter, Writer
# Create some training examples
labels = []
ids = []
features = []
for i in range(num_train_examples):
labels.append("dog" if i % 2 == 0 else "cat")
ids.append("{}{}".format(y, i))
features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)})
feat_set = FeatureSet('training', ids, labels=labels, features=features)
# Write them to a file
train_path = join(_my_dir, 'train', 'test_summary.jsonlines')
Writer.for_path(train_path, feat_set).write()
# Or
NDJWriter.(train_path, feat_set).write()