Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
•
•
•
Understanding of the
subject (human)
Extract features (human)
Search query
Predict search scores
Review a few results
(hum...
Understanding of the
subject (human)
Parse and vectorize
documents
Predict scores
Review a few results
(human)
Document
la...
•
•
•
•
•
•
•
•
•
•
•
•
•
•
github.com/FreeDiscovery/FreeDiscovery
Text vectorization
(BoW, n-grams)
Latent Semantic
Indexing (LSI/ LSA)
Raw documents
Logistic Regression,
SVM, xgboost, ..
...
FreeDiscovery core
Model / data
persistence
Document ID
mapping
REST API server
Nginx proxy
(optional)
• flask
• marshmallow webargs
• Werkzeug gunicorn
• flask-apispec
•
bootprint-openapi
Sphinx
•
•
•
•
•
•
•
•
•
MiniBatchKMeans
BIRCH DBSCAN
HDBSCAN,
¹ hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
●
●
●
simhash-py
●
●
●
=>
Recall vs Documents Retrieved (Logistic Regression CV)
The average performance variation from baseline run with
Logistic Regression CV (BOW, log TF-IDF weight) for the ERDM
data...
Reviewer
Document
labels
TAR system
Scores
predict
•
•
•
•
●
joblib
●
● pandas
● HashingVectorizer
Search query:
time grid search
Better search query?
scikit-learn project:
● 9100 issues / PR
● 850 open issues
● 540 open ...
●
●
●
●
David Grossman,
Eugene Yang, Ophir Frieder
➔
➔
➔
➔
➔
➔ github.com/FreeDiscovery/FreeDiscovery
➔ freediscovery.io/doc/stable
@RomanYurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak
Prochain SlideShare
Chargement dans…5
×

FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak

550 vues

Publié le

PyParis 2017
http://pyparis.org

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

FreeDiscovery - information retrieval and e-Discovery in Python, Roman Yurchak

  1. 1. • • •
  2. 2. Understanding of the subject (human) Extract features (human) Search query Predict search scores Review a few results (human) Search results • •
  3. 3. Understanding of the subject (human) Parse and vectorize documents Predict scores Review a few results (human) Document labels Train supervised machine learning model
  4. 4. • • • • • • • • •
  5. 5. • • • • • github.com/FreeDiscovery/FreeDiscovery
  6. 6. Text vectorization (BoW, n-grams) Latent Semantic Indexing (LSI/ LSA) Raw documents Logistic Regression, SVM, xgboost, .. K-Nearest Neighbors Birch + cluster labeling DBSCAN, I-Match, simhash JWZ algorithm Sparse matrix (10-100k dim) Dense matrix (100-300 dim)
  7. 7. FreeDiscovery core Model / data persistence Document ID mapping REST API server Nginx proxy (optional)
  8. 8. • flask • marshmallow webargs • Werkzeug gunicorn • flask-apispec • bootprint-openapi Sphinx •
  9. 9. • • •
  10. 10. • • • • • MiniBatchKMeans BIRCH DBSCAN HDBSCAN, ¹ hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
  11. 11. ● ● ● simhash-py ● ● ● =>
  12. 12. Recall vs Documents Retrieved (Logistic Regression CV)
  13. 13. The average performance variation from baseline run with Logistic Regression CV (BOW, log TF-IDF weight) for the ERDM dataset (1000 train size, 700000 test size).
  14. 14. Reviewer Document labels TAR system Scores
  15. 15. predict • • • •
  16. 16. ● joblib ● ● pandas ● HashingVectorizer
  17. 17. Search query: time grid search Better search query? scikit-learn project: ● 9100 issues / PR ● 850 open issues ● 540 open PR ● 90k comments
  18. 18. ● ● ● ●
  19. 19. David Grossman, Eugene Yang, Ophir Frieder
  20. 20. ➔ ➔ ➔ ➔ ➔
  21. 21. ➔ github.com/FreeDiscovery/FreeDiscovery ➔ freediscovery.io/doc/stable @RomanYurchak

×