SlideShare une entreprise Scribd logo
1  sur  8
Empower Public Health
through Social Media
Zhen Wang, Ph.D.
Insight Health Data Science
Text
Cleaning, Tokenizing
Convert to Feature Vectors
“I like food!”
“Food is good!”
“I had some good food.”
i, like, food
food, is, good
i, had, some, good, food
e.g., TF-IDF
I’m really good
with numbers!
i like food is good had some
1 1 1 0 0 0 0
0 0 1 1 1 0 0
1 0 1 0 1 1 1
Downweight, Normalize
Machine
Learning
Numbers
Natural Language Processing
Text Classification
Normalized Retweet Counts
NumberofTweets
Distribution of Tweets
● Sample Imbalance
● Classification (0/1: Not / Retweeted)
● Logistic Regression
Threshold: 0.005
Misclassification Error: 22%
0 01 1
Train Test
downsampling
0.81
0.740.26
0.19
Normalized Confusion Matrix
Codes: github.com/zweinstein/SpreadHealth_dev
Zhen (Jen) Wang
Beta Tester
Since 2015 Editor since 2015
Traditional Medicine Science Fiction
Public Speaking Online Education
Ph.D. in Physical Chemistry
Thank you!
See the App in Action:
Text Preprocessing Pipeline
Text Cleaning:
● Convert to lower case
● Replace URL, #, and @
● Remove special characters other than
emoticons
● Remove stopwords
Tokenizing:
● Splitting each documents into individual
elements
● Bag-of-Words or N-grams
● Stemming
○ Porter Stemmer was used
○ Snowball or Lancaster stemmer faster but
more aggressive
○ Lemmatization computationally more
expensive but little impact on the
performance of text classification
Term Frequency-Inverse Document
Frequency (tf-idf):
Term Frequency--tf(t,d): the number of times
a term t occurs in a document d
Used to downweight frequently occurring
words in the feature vectors tf(t,d)
Document Frequency--df(d,f): the number of
documents d that contain a term t.
The implementation in Scikit-learn
● Train Dataset: 10000 tweets on diabetes (4782 retweeted);
● Test Set Accuracy (Random Chance 0.49 on positive class):
○ KNN: 60%
○ Naive Bayes: 67%
○ Logistic regression: 75% (chosen and tested on imbalanced test data)
● Potential Improvements:
○ Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost)
○ Other Features:
■ Polarity & Sentiment
■ Length
● Out-of-Core Incremental Learning with Stochastic Gradient Descent
(Advantages of Logistic Regression…)
● Automatic Update to SQLite Database and to the Classifier
Prediction Algorithms

Contenu connexe

En vedette

Our Opening Title sequence presentation
Our Opening Title sequence presentationOur Opening Title sequence presentation
Our Opening Title sequence presentationchloe-carman
 
Hareket Magazine 12.
Hareket  Magazine 12.Hareket  Magazine 12.
Hareket Magazine 12.Hareket
 
3D Game Environment Workflow
3D Game Environment Workflow3D Game Environment Workflow
3D Game Environment Workflowraimondklavins
 
Hareket Magazine-19-2016
Hareket Magazine-19-2016Hareket Magazine-19-2016
Hareket Magazine-19-2016Hareket
 
What is a dance music video
What is a dance music videoWhat is a dance music video
What is a dance music videochloe-carman
 
Rich Aquilone- Top 5 Rock Drummers
Rich Aquilone- Top 5 Rock DrummersRich Aquilone- Top 5 Rock Drummers
Rich Aquilone- Top 5 Rock DrummersRichard Aquilone
 
1.1 ingles sistema operativo
1.1 ingles sistema operativo1.1 ingles sistema operativo
1.1 ingles sistema operativodenissecollins94
 
A step by-step guide on i doc-ale between two sap servers
A step by-step guide on i doc-ale between two sap serversA step by-step guide on i doc-ale between two sap servers
A step by-step guide on i doc-ale between two sap serverskrishna RK
 

En vedette (8)

Our Opening Title sequence presentation
Our Opening Title sequence presentationOur Opening Title sequence presentation
Our Opening Title sequence presentation
 
Hareket Magazine 12.
Hareket  Magazine 12.Hareket  Magazine 12.
Hareket Magazine 12.
 
3D Game Environment Workflow
3D Game Environment Workflow3D Game Environment Workflow
3D Game Environment Workflow
 
Hareket Magazine-19-2016
Hareket Magazine-19-2016Hareket Magazine-19-2016
Hareket Magazine-19-2016
 
What is a dance music video
What is a dance music videoWhat is a dance music video
What is a dance music video
 
Rich Aquilone- Top 5 Rock Drummers
Rich Aquilone- Top 5 Rock DrummersRich Aquilone- Top 5 Rock Drummers
Rich Aquilone- Top 5 Rock Drummers
 
1.1 ingles sistema operativo
1.1 ingles sistema operativo1.1 ingles sistema operativo
1.1 ingles sistema operativo
 
A step by-step guide on i doc-ale between two sap servers
A step by-step guide on i doc-ale between two sap serversA step by-step guide on i doc-ale between two sap servers
A step by-step guide on i doc-ale between two sap servers
 

Similaire à Zhen wang demo3

Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe
 
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...The Research Council of Norway, IKTPLUSS
 
Predicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksPredicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksAnaelia Ovalle
 
Machine Learning Foundations
Machine Learning FoundationsMachine Learning Foundations
Machine Learning FoundationsAlbert Y. C. Chen
 
Natural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsNatural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsMMS Holdings
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Automated health responses
Automated health responses Automated health responses
Automated health responses Austin Powell
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmVaibhav Varshney
 
How deep learning reshapes medicine
How deep learning reshapes medicineHow deep learning reshapes medicine
How deep learning reshapes medicineHongyoon Choi
 
Biostatistics and DNA for NCKU iGEM
Biostatistics and DNA for NCKU iGEMBiostatistics and DNA for NCKU iGEM
Biostatistics and DNA for NCKU iGEMPo-Jen Wu
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
Deep learning for natural language understanding
Deep learning for natural language understandingDeep learning for natural language understanding
Deep learning for natural language understandingDavid Talby
 
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Twitter Inc.
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan BegumNurjahan Begum
 

Similaire à Zhen wang demo3 (20)

Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
 
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...
 
Predicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural NetworksPredicting Thyroid Disorder with Deep Neural Networks
Predicting Thyroid Disorder with Deep Neural Networks
 
Machine Learning Foundations
Machine Learning FoundationsMachine Learning Foundations
Machine Learning Foundations
 
Natural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsNatural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health Records
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Automated health responses
Automated health responses Automated health responses
Automated health responses
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
How deep learning reshapes medicine
How deep learning reshapes medicineHow deep learning reshapes medicine
How deep learning reshapes medicine
 
Biostatistics and DNA for NCKU iGEM
Biostatistics and DNA for NCKU iGEMBiostatistics and DNA for NCKU iGEM
Biostatistics and DNA for NCKU iGEM
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Deep learning for natural language understanding
Deep learning for natural language understandingDeep learning for natural language understanding
Deep learning for natural language understanding
 
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan Begum
 

Dernier

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Zhen wang demo3

  • 1. Empower Public Health through Social Media Zhen Wang, Ph.D. Insight Health Data Science
  • 2. Text Cleaning, Tokenizing Convert to Feature Vectors “I like food!” “Food is good!” “I had some good food.” i, like, food food, is, good i, had, some, good, food e.g., TF-IDF I’m really good with numbers! i like food is good had some 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 Downweight, Normalize Machine Learning Numbers Natural Language Processing
  • 3. Text Classification Normalized Retweet Counts NumberofTweets Distribution of Tweets ● Sample Imbalance ● Classification (0/1: Not / Retweeted) ● Logistic Regression Threshold: 0.005 Misclassification Error: 22% 0 01 1 Train Test downsampling 0.81 0.740.26 0.19 Normalized Confusion Matrix Codes: github.com/zweinstein/SpreadHealth_dev
  • 4. Zhen (Jen) Wang Beta Tester Since 2015 Editor since 2015 Traditional Medicine Science Fiction Public Speaking Online Education Ph.D. in Physical Chemistry
  • 6. See the App in Action:
  • 7. Text Preprocessing Pipeline Text Cleaning: ● Convert to lower case ● Replace URL, #, and @ ● Remove special characters other than emoticons ● Remove stopwords Tokenizing: ● Splitting each documents into individual elements ● Bag-of-Words or N-grams ● Stemming ○ Porter Stemmer was used ○ Snowball or Lancaster stemmer faster but more aggressive ○ Lemmatization computationally more expensive but little impact on the performance of text classification Term Frequency-Inverse Document Frequency (tf-idf): Term Frequency--tf(t,d): the number of times a term t occurs in a document d Used to downweight frequently occurring words in the feature vectors tf(t,d) Document Frequency--df(d,f): the number of documents d that contain a term t. The implementation in Scikit-learn
  • 8. ● Train Dataset: 10000 tweets on diabetes (4782 retweeted); ● Test Set Accuracy (Random Chance 0.49 on positive class): ○ KNN: 60% ○ Naive Bayes: 67% ○ Logistic regression: 75% (chosen and tested on imbalanced test data) ● Potential Improvements: ○ Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost) ○ Other Features: ■ Polarity & Sentiment ■ Length ● Out-of-Core Incremental Learning with Stochastic Gradient Descent (Advantages of Logistic Regression…) ● Automatic Update to SQLite Database and to the Classifier Prediction Algorithms

Notes de l'éditeur

  1. http://54.191.168.240