Reproducible data science and business solutions

From reproducible data
science to business
solutions
April 21st, 2021

● Translation of business problems to technical solutions
● Secure medical records
● Problems in computer vision
Image quality enhancement aka ‘beautification’
Image similarity evaluation aka ‘matching’
Image classification aka ‘tagging’
We’ll be talking about

Antonio Rueda-Toicen
Senior Data Scientist at Parkling GmbH
● Work on computer vision
● Background in computer science & biomedical applications
● Previously worked in academia, now teach data science at DSR and Thinkful
● Currently host the Berlin Computer Vision Group (look us up in Meetup!)
About me

https://airmedfoundation.thechain.tech/
Airmed Foundation: Secure medical records with IPFS and
Hyperledger Fabric

https://airmedfoundation.thechain.tech/
Airmed Foundation: secure medical records with IPFS and
Hyperledger Fabric

https://github.com/the-chain/airmedfoundation-terminal
Airmed Foundation: secure medical records with IPFS and
Hyperledger Fabric

What is ‘computer vision’?
What a human sees What the computer ‘sees’

● We are a search engine of vacation rentals
● We have 17 million offers and hundreds of millions of
images, the largest vacation rental inventory in the world
● Users want to envision the experience of a rental before
booking
Why we do computer vision at
HomeToGo?

Image quality enhancement
aka ‘beautification’

Industry story - AirBnB case
10

https://www.airbnb.com/professional_photography
11

https://www.airbnb.com/professional_photography
12

Why do we need image beautification
at HomeToGo?
13

Problem: we don’t control image
acquisition
14

Iphone 3GS camera Canon 70D (DSLR camera)
3 MP 20 MP
2048 x 1536 image size 3648 x 2432 image size
Original Blurred Original Blurred
How does image quality change look?
15

Industry’s current practices for
enhancing images
16

Let’s look at some beautified images
18

19

20

Image Similarity Evaluation
aka ‘Matching’

Why do we need to match offers
● Inventory understanding (we have a lot of it!)
● Providing the best deals for our users (sample use case: strike prices)
22

● Semantic similarity can be different to perceptual similarity
● We use a variety of distance and similarity metrics
● We also use different models ensembled in a deduplication pipeline
Evaluating similarity
23

Perceptual Hashing
94088af86c03827 94088af86c03827
Edit distance = 0 24

Perceptual Hashing
94088af86c03827 94088af86c03899
Edit distance = 2 25

How we evaluate our matching algorithms
True Positive = duplicate labeled as duplicate
True Negative = non duplicate labeled as non duplicate
False Positive = non duplicate labeled as duplicate
False Negative = duplicate labeled as non duplicate
26

Convolutional neural networks as feature
extractors
28

extractors
29
Cosine similarity = 0.65

extractors
30

extractors
31

Image classification aka
‘Tagging’

Image Classification
● Outdoor
● Building
● Snow
what we see

Image Classification
● Outdoor
● Building
● Snow? Do we care about snow?
○ Enough of these images need
to be shown to the algorithm
what the computer “sees”

Why we do image classification?
● Inventory understanding
○ How many of our offers have pools, balconies, sea views?
○ Which images have better conversion rates?
● Targeted advertisement (SEO, CRM)
○ Newsletters
○ SEO landing pages

What do users care about?
● We do user research to define data
taxonomies
● We also define which rules are
convenient/feasible for our
algorithms
○ E.g. ‘if the sky is visible but we
are looking at it through a
window, the image should be
labeled as “indoor”’
36

Labels for hard cases
● Bedroom
● Terrace
● Desk
● Vegetation
● Do we have enough images
that combine these things?
38

Labels for hard cases
● Should we have
added ‘neon lights’ to our
taxonomy?
● How many of these things
we have?
● Should we invest on this?
39

Getting more out of the humans in the loop
“Anybody that is trying to solve the problem of image tagging within a company
ends up rediscovering ‘active learning’, which is just using your model to guide
your labeling. Why should we be labeling everything if the machine is only doing
mistakes on these two hard classes?”
Jeremy Howard
● Services like Amazon SageMaker Groundtruth and human labeling in the
Google Vision API platform make this easier
42

Summary
● Creating value for starts with a careful consideration of the business problem
:)
43

44
https://datascienceretreat.com/

https://www.meetup.com/Berlin-Computer-Vision-Group/

Reproducible data science and business solutions

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

Similaire à Reproducible data science and business solutions

Similaire à Reproducible data science and business solutions (20)

Dernier

Dernier (20)

Reproducible data science and business solutions