Handwritten Text Recognition for manuscripts and early printed texts
Gentle introduction to Machine Learning
1. 1
Roman Orac, 1Tap Machine Learning & Data Analysis
A Gentle introduction to Machine Learning
2. 1Tap is a Automated Accounting Platform
For the Self Employed*
* Sole Trader, Sole Proprietor, Freelancer, Contractor, Independent, Non Incorporated
Businesses
Fully
3. The Self Employed can’t buy the stuff they want
Profit…Welfare…
Taxes…
No idea
That is a problem for
the new year...
Denied...
Hopefully I get better
real soon...
Credit…
6
8. What is Machine Learning?
Training data
Machine Learning
algorithm
ClassifierNew samples Prediction
Pre-processing
● Machine Learning is the science of getting computers to act without
being explicitly programmed
9. Predict survival on the Titanic
In 1912 the Titanic sank, killing 1,502 out
of 2,224 passengers and crew.
Some groups of people were more
likely to survive than others.
10. Let’s look at the data
Abbreviations
● Embarked: Port of embarkation
○ C = Cherbourg
○ Q = Queenstown
○ S = Southampton
● Parch: Number of parents/children
aboard
● Pclass: Passenger's class
● SibSp: Number of siblings/spouses
aboard
● Survived: Survived (1) or died (0)
● Ticket: Ticket number
11. Understanding the data
● Distributions of the fare of passengers who survived or did
not survive
● Many passengers with cheaper fares died
● Is fare a good predictive variable?
12. Most Important Step: Data preprocessing
Original data Preprocessed data
preprocessing
● Clean the data
● Encode attributes
● Fill in missing values
● Add new attributes
13. Decision Tree
● Use training set and build a decision tree model
● Use the model to predict new samples
15. Receipt categorization
Initial receipt categorization
based on company’s industry
deterministic categorization
many mis-categorization
The Numbers
600K categorized receipts
40K users
80K new receipts every month
17. ● Features:
○ user’s profession
○ vendor name, date, expense total and text
● Preprocessing:
○ Filter receipts
○ Recategorize most obvious receipts
● Train a classifier that categorizes receipts
● This approach improves categorization as receipt text adds more context
Receipt categorization with ML