Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Document Classification using the Python Natural Language Toolkit

12 526 vues

Publié le

Publié dans : Technologie, Business
  • Soyez le premier à commenter

Document Classification using the Python Natural Language Toolkit

  1. 1. Document Classification using the Natural Language Toolkit<br />Ben Healey<br />http://benhealey.info<br />@BenHealey<br />
  2. 2. Source: IStockPhoto<br />
  3. 3. http://upload.wikimedia.org/wikipedia/commons/b/b6/FileStack_retouched.jpg<br />The Need for Automation<br />
  4. 4. Take urpick!<br />http://upload.wikimedia.org/wikipedia/commons/d/d6/Cat_loves_sweets.jpg<br />
  5. 5. Features:<br />- # Words<br />- % ALLCAPS<br />- Unigrams<br />- Sender<br />- And so on.<br />Class:<br />The Development Set<br />Classification<br />Algo.<br />Trained Classifier<br />(Model)<br />New Document<br />(Class Unknown)<br />Classified<br />Document.<br />Document<br />Features<br />
  6. 6. Relevant NLTK Modules<br />Feature Extraction<br />from nltk.corpus import words, stopwords<br />from nltk.stem import PorterStemmer<br />from nltk.tokenize import WordPunctTokenizer<br />from nltk.collocations import BigramCollocationFinder<br />from nltk.metrics import BigramAssocMeasures<br />See http://text-processing.com/demo/ for examples<br />Machine Learning Algos and Tools<br />from nltk.classify import NaiveBayesClassifier<br />from nltk.classify import DecisionTreeClassifier<br />from nltk.classify import MaxentClassifier<br />from nltk.classify import WekaClassifier<br />from nltk.classify.util import accuracy<br />
  7. 7. NaiveBayesClassifier<br />P(label|features)=P(label) ∗ P(features|label) P(features)<br /> <br />P(label|features)=P(label) ∗ P(f1|label)∗...∗ P(fn|label)  P(features)<br /> <br />http://61.153.44.88/nltk/0.9.5/api/nltk.classify.naivebayes-module.html<br />
  8. 8. http://www.educationnews.org/commentaries/opinions_on_education/91117.html<br />
  9. 9. 517,431 Emails<br />Source: IStockPhoto<br />
  10. 10. Prep: Extract and Load<br />Sample* of 20,581 plaintext files<br />import MySQLdb, os, random, string<br /> MySQL via Python ODBC interface<br />File, string manipulation<br />Key fields separated out<br />To, From, CC, Subject, Body<br />* Folders for 7 users with a large number of email. So not representative!<br />
  11. 11. Prep: Extract and Load<br />Allocation of random number<br />Some feature extraction<br />#To, #CCd, #Words, %digits, %CAPS<br />Note: more cleaning could be done<br />Code at benhealey.info<br />
  12. 12. From: james.steffes@enron.com<br />To: louise.kitchen@enron.com<br />Subject: Re: Agenda for FERC Meeting RE: EOL<br />Louise --<br />We had decided that not having Mark in the room gave us the ability to wiggle if questions on CFTC vs. FERC regulation arose. As you can imagine, FERC is starting to grapple with the issue that financial trades in energy commodities is regulated under the CEA, not the Federal Power Act or the Natural Gas Act. <br />Thanks,<br />Jim<br />
  13. 13. From: pete.davis@enron.com<br />To: pete.davis@enron.com<br />Subject: Start Date: 1/11/02; HourAhead hour: 5;<br />Start Date: 1/11/02; HourAhead hour: 5; No ancillary schedules awarded. No variances detected. <br /> LOG MESSAGES:<br />PARSING FILE -->> O:PortlandWestDeskCalifornia SchedulingISO Final Schedules2002011105.txt<br />
  14. 14. Class[es] assigned for 1,000 randomly selected messages:<br />
  15. 15. Prep: Show us ur Features<br />NLTK toolset<br />from nltk.corpus import words, stopwords<br />from nltk.stem import PorterStemmer<br />from nltk.tokenize import WordPunctTokenizer<br />from nltk.collocations import BigramCollocationFinder<br />from nltk.metrics import BigramAssocMeasures<br />Custom code<br />def extract_features(record,stemmer,stopset,tokenizer):<br />…<br />Code at benhealey.info<br />
  16. 16. Prep: Show us ur Features<br />Features in boolean or nominal form<br />if record['num_words_in_body']<=20:<br />features['message_length']='Very Short'<br />elif record['num_words_in_body']<=80:<br /> features['message_length']='Short'<br />elif record['num_words_in_body']<=300:<br /> features['message_length']='Medium'<br />else:<br /> features['message_length']='Long'<br />
  17. 17. Prep: Show us ur Features<br />Features in boolean or nominal form<br />text=record['msg_subject']+" "+record['msg_body']<br />tokens = tokenizer.tokenize(text)<br />words = [stemmer.stem(x.lower()) for x in tokens if x not in stopset and len(x) > 1]<br />for word in words:<br /> features[word]=True<br />
  18. 18. Sit. Say. Heel.<br />random.shuffle(dev_set)<br />cutoff = len(dev_set)*2/3<br />train_set=dev_set[:cutoff]<br />test_set=dev_set[cutoff:]<br />classifier = NaiveBayesClassifier.train(train_set)<br />print 'accuracy for > ',subject,':', accuracy(classifier, test_set)<br />classifier.show_most_informative_features(10)<br />
  19. 19. Most Important Features<br />
  20. 20. Most Important Features<br />
  21. 21. Most Important Features<br />
  22. 22. Performance: ‘IT’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
  23. 23. Performance: ‘Deal’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
  24. 24. Performance: ‘Social’ Model<br />IMPORTANT: These are ‘cheat’ scores!<br />
  25. 25. Don’t get burned.<br /><ul><li>Biased samples
  26. 26. Accuracy and rare events
  27. 27. Features and prior knowledge
  28. 28. Good modelling is iterative!
  29. 29. Resampling and robustness
  30. 30. Learning cycles</li></ul>http://www.ugo.com/movies/mustafa-in-austin-powers<br />
  31. 31. Resources<br />NLTK: <br />www.nltk.org/<br />http://www.nltk.org/book<br />Enron email datasets:<br />http://www.cs.umass.edu/~ronb/enron_dataset.html <br />Free online Machine Learning course from Stanford <br />http://ml-class.com/ (starts in October)<br />StreamHacker blog by Jacob Perkins<br />http://streamhacker.com<br />

×