SlideShare a Scribd company logo
1 of 35
Running Word2Vec with Chinese Wikipedia
dump
Similarity
1. if two words have high similarity, it means they have strong
relationship
2. use wikipedia to let machine has general sense about our
world
"魯夫" is main charactrer in "海賊王"
"東京" is capital city in "日本"
Related Application
1. voice-driven assistants
(Siri, Google Now, Microsoft Cortana)
2. e-commerce recommandation
(Alibaba, Rakuten)
3. question answering(IBM Waston)
4. others(Flipboard, SmartNews)
Related Application
Build you own smart AI
My current progress
Download Wikipedia
1. https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-
pages-articles.xml.bz2
2. it contains traditional chinese and simplified chinese
articles
3. 1G file size, 230,000 articles, 150,000,000 words
Preprocessing
1. use OpenCC to translate from simplified chinese to
traditional chinese
2. support C、C++、Python、PHP、Java、Ruby、Node.js
3. compatible with Linux, Windows and Mac
4. “智能手机” -> “智慧手機”, “信息” -> “資訊”
5. you can play it on the website http://opencc.byvoid.com/
opencc -i zhwiki.txt -o twwiki.txt -c /usr/share/opencc/s2twp.json
Preprocessing
1. use gensim to extract article from Wikipedia dump
2. 2G memory is required
Preprocessing
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
inp, outp = sys.argv[1:3]
output = open(outp,'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "n")
output.close()
gensim provides iterator to extract sentences from
compressed wiki dump
Segmentation
1. english uses some notation(whitespace, dot, etc) to
separate words,
but not all language follow this practice
2. "下雨天/留客天/留我/不留", "下雨/天留客/天留/我不留"
3. new word keep to be generated(such as "小確幸", "物聯網")
Segmentation
Jieba supports full and search mode
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'今天讓我們來測試中文斷詞'
seg_list = jieba.cut(input_str, cut_all=True) # full mode
print(', '.join(seg_list))
seg_list = jieba.cut(input_str, cut_all=False) # search mode
print(', '.join(seg_list))
今天, 讓, 我, 們, 來, 測, 試, 中文, 斷, 詞
今天, 讓, 我們, 來, 測試, 中文, 斷詞
Segmentation
sometimes the result is a little bit funny
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'張無忌來大都找我吧!哈哈哈哈哈哈'
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
張無忌, 來, 大都, 找, 我, 吧, !, 哈哈哈, 哈哈哈
Segmentation
good dictionary, good result
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'舒潔衛生紙買一送一'
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
jieba.set_dictionary('./data/dict.txt.big')
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
舒潔衛, 生紙, 買, 一送, 一
舒潔, 衛生紙, 買一送一
Segmentation
verb? nouns? adjective? adverb?
#encoding=UTF-8
import pseg
if __name__ == '__main__':
input_str = u'今天讓我們來測試中文斷詞'
seg_list = pseg.cut(input_str)
for seg, flag in seg_list:
print u'{}:{}'.format(seg, flag)
今天:t 讓:v 我們:r 來:v 測試:vn 中文:nz 斷詞:n
Segmentation
keyword extraction
#encoding=UTF-8
import jieba
import jieba.analyse
if __name__ == '__main__':
input_str = u'我的故鄉在台灣, I am Taiwanese'
jieba.set_dictionary('./data/dict.txt.big')
seg_list = jieba.analyse.extract_tags(input_str, topK=3)
print(', '.join(seg_list))
jieba.analyse.set_stop_words('./data/stop_words.txt')
seg_list = jieba.analyse.extract_tags(input_str, topK=3)
print(', '.join(seg_list))
台灣, am, 故鄉
台灣, 故鄉, Taiwanese
Finding Similarity
1. How to do that ? Word2Vec is super star !
Word2Vec
transform from word to vector, distance between vector
implies degree of similarity
vector("首爾") - vector("日本") > vector("東京") - vector("日本")
vector("東京") - vector("日本") + vector("首爾") = vector("南韓")
Word2Vec
word2vec targets the word is asked to predict the
surrounding context
在日本,[ 青森 的 "蘋果" 又 甜 ]又好吃
今年,新版的[ Macbook 是 "蘋果" 發表 的 ]重點之一
"青森" and "Macbook" have high simlaritiy with “蘋果"
training from previous window, "青森" and "日本" also have
high simlaritiy
Word2Vec
word2vec uses skip-gram neural network to predict neighbor
context
Training Word2Vec model by gensim
words already preprocessed and separated by whitespace.
#encoding=UTF-8
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
inp = sys.argv[1]
model = Word2Vec(LineSentence(inp),
size=100,
window=10,
min_count=10,
workers=multiprocessing.cpu_count())
it doesn't work for me, gensim's word2vec run out of memory
Move to Spark MLlib
1. Spark offer over 80 operators that make it easy to build
parallel application
2. Databrick company uses Spark to break world record in
2014 1TB sort benchmark completition
3. MLlib is Spark's machine learning library.
Spark cluster overview
1. Spark is master-slave architecture, which likes YARN
2. cluster management is master, it handle resource
managemnet and slave health management.
3. when you launch application,
master will assign a slave to be driver.
driver request resource from master,
execute main function and assign task to slave
Spark cluster deployment
1. use Linode API to create and boot new instance rapidly
2. use standalone Spark cluster
it also can deploy on Mesos or YARN cluster
3. install Java,Scala and put pre-built Spark, finally launch
slave executor!
4. use ansible to deploy spark executor and use LZ4 to speed
up decompress pre-built Spark package
Training Word2Vec model by Spark cluster
RDD is the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements
that can be operated on in parallel
val input:RDD[String] = sc.textFile(inp, 5).cache()
val token:RDD[Seq[String]] = input.map(article => tokenize(article))
val word2vec = new Word2Vec()
word2vec.setNumPartitions(5)
val model = word2vec.fit(token)
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://....")
Querying Word2Vec model by Spark cluster
val model = sc.objectFile[Word2VecModel]("hdfs://....").first()
val synonyms = model.findSynonyms("熱火",10)
for((synonyms, cosineSim) <- synonyms){
println(synonyms+":"+cosineSim)
}
load model from HDFS
compare with model training, resource requirement is cheap
on finding similarity
Query Word2Vec by Spark cluster
Example of "man"
Example of "luffy"(one piece comic's man
charactrer)
Example of "cell phone"
Thank you

More Related Content

Viewers also liked

Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes👋 Christopher Moody
 
Word2vec (中文)
Word2vec (中文)Word2vec (中文)
Word2vec (中文)Yiwei Chen
 
Machine Learning : comparing neural network methods
Machine Learning : comparing neural network methodsMachine Learning : comparing neural network methods
Machine Learning : comparing neural network methodsNichochar
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Spark Summit
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Image Recognition with TensorFlow
Image Recognition with TensorFlowImage Recognition with TensorFlow
Image Recognition with TensorFlowAltoros
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDY淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDYBilly Yang
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLPhytae
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from IntelEdge AI and Vision Alliance
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithmAndrew Koo
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learningmustafa sarac
 
藏頭詩產生器
藏頭詩產生器藏頭詩產生器
藏頭詩產生器Mark Chang
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Yuya Unno
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowAltoros
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 

Viewers also liked (20)

Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
 
Word2vec (中文)
Word2vec (中文)Word2vec (中文)
Word2vec (中文)
 
Machine Learning : comparing neural network methods
Machine Learning : comparing neural network methodsMachine Learning : comparing neural network methods
Machine Learning : comparing neural network methods
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Image Recognition with TensorFlow
Image Recognition with TensorFlowImage Recognition with TensorFlow
Image Recognition with TensorFlow
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDY淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDY
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learning
 
藏頭詩產生器
藏頭詩產生器藏頭詩產生器
藏頭詩產生器
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 

Similar to Running Word2Vec with Chinese Wikipedia dump

College Project - Java Disassembler - Description
College Project - Java Disassembler - DescriptionCollege Project - Java Disassembler - Description
College Project - Java Disassembler - DescriptionGanesh Samarthyam
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkIan Pointer
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesJamund Ferguson
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with ClojureHenrik Eneroth
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit AutomationMoabi.com
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR mattersAlexandre Moneger
 
Work Queues
Work QueuesWork Queues
Work Queuesciconf
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys AdminsPuppet
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1AjayRawat971036
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Railsfreelancing_god
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniterErik Giberti
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
 
Inside Bokete: Web Application with Mojolicious and others
Inside Bokete:  Web Application with Mojolicious and othersInside Bokete:  Web Application with Mojolicious and others
Inside Bokete: Web Application with Mojolicious and othersYusuke Wada
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 

Similar to Running Word2Vec with Chinese Wikipedia dump (20)

College Project - Java Disassembler - Description
College Project - Java Disassembler - DescriptionCollege Project - Java Disassembler - Description
College Project - Java Disassembler - Description
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
DSLs in JavaScript
DSLs in JavaScriptDSLs in JavaScript
DSLs in JavaScript
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with Clojure
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters
 
Work Queues
Work QueuesWork Queues
Work Queues
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Rails
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniter
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Inside Bokete: Web Application with Mojolicious and others
Inside Bokete:  Web Application with Mojolicious and othersInside Bokete:  Web Application with Mojolicious and others
Inside Bokete: Web Application with Mojolicious and others
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Running Word2Vec with Chinese Wikipedia dump

  • 1. Running Word2Vec with Chinese Wikipedia dump
  • 2. Similarity 1. if two words have high similarity, it means they have strong relationship 2. use wikipedia to let machine has general sense about our world "魯夫" is main charactrer in "海賊王" "東京" is capital city in "日本"
  • 3. Related Application 1. voice-driven assistants (Siri, Google Now, Microsoft Cortana) 2. e-commerce recommandation (Alibaba, Rakuten) 3. question answering(IBM Waston) 4. others(Flipboard, SmartNews)
  • 5. Build you own smart AI
  • 6.
  • 8. Download Wikipedia 1. https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest- pages-articles.xml.bz2 2. it contains traditional chinese and simplified chinese articles 3. 1G file size, 230,000 articles, 150,000,000 words
  • 9. Preprocessing 1. use OpenCC to translate from simplified chinese to traditional chinese 2. support C、C++、Python、PHP、Java、Ruby、Node.js 3. compatible with Linux, Windows and Mac 4. “智能手机” -> “智慧手機”, “信息” -> “資訊” 5. you can play it on the website http://opencc.byvoid.com/ opencc -i zhwiki.txt -o twwiki.txt -c /usr/share/opencc/s2twp.json
  • 10. Preprocessing 1. use gensim to extract article from Wikipedia dump 2. 2G memory is required
  • 11. Preprocessing from gensim.corpora import WikiCorpus if __name__ == '__main__': inp, outp = sys.argv[1:3] output = open(outp,'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(space.join(text) + "n") output.close() gensim provides iterator to extract sentences from compressed wiki dump
  • 12. Segmentation 1. english uses some notation(whitespace, dot, etc) to separate words, but not all language follow this practice 2. "下雨天/留客天/留我/不留", "下雨/天留客/天留/我不留" 3. new word keep to be generated(such as "小確幸", "物聯網")
  • 13. Segmentation Jieba supports full and search mode #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'今天讓我們來測試中文斷詞' seg_list = jieba.cut(input_str, cut_all=True) # full mode print(', '.join(seg_list)) seg_list = jieba.cut(input_str, cut_all=False) # search mode print(', '.join(seg_list)) 今天, 讓, 我, 們, 來, 測, 試, 中文, 斷, 詞 今天, 讓, 我們, 來, 測試, 中文, 斷詞
  • 14. Segmentation sometimes the result is a little bit funny #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'張無忌來大都找我吧!哈哈哈哈哈哈' seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) 張無忌, 來, 大都, 找, 我, 吧, !, 哈哈哈, 哈哈哈
  • 15. Segmentation good dictionary, good result #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'舒潔衛生紙買一送一' seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) jieba.set_dictionary('./data/dict.txt.big') seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) 舒潔衛, 生紙, 買, 一送, 一 舒潔, 衛生紙, 買一送一
  • 16. Segmentation verb? nouns? adjective? adverb? #encoding=UTF-8 import pseg if __name__ == '__main__': input_str = u'今天讓我們來測試中文斷詞' seg_list = pseg.cut(input_str) for seg, flag in seg_list: print u'{}:{}'.format(seg, flag) 今天:t 讓:v 我們:r 來:v 測試:vn 中文:nz 斷詞:n
  • 17. Segmentation keyword extraction #encoding=UTF-8 import jieba import jieba.analyse if __name__ == '__main__': input_str = u'我的故鄉在台灣, I am Taiwanese' jieba.set_dictionary('./data/dict.txt.big') seg_list = jieba.analyse.extract_tags(input_str, topK=3) print(', '.join(seg_list)) jieba.analyse.set_stop_words('./data/stop_words.txt') seg_list = jieba.analyse.extract_tags(input_str, topK=3) print(', '.join(seg_list)) 台灣, am, 故鄉 台灣, 故鄉, Taiwanese
  • 18. Finding Similarity 1. How to do that ? Word2Vec is super star !
  • 19. Word2Vec transform from word to vector, distance between vector implies degree of similarity vector("首爾") - vector("日本") > vector("東京") - vector("日本") vector("東京") - vector("日本") + vector("首爾") = vector("南韓")
  • 20. Word2Vec word2vec targets the word is asked to predict the surrounding context 在日本,[ 青森 的 "蘋果" 又 甜 ]又好吃 今年,新版的[ Macbook 是 "蘋果" 發表 的 ]重點之一 "青森" and "Macbook" have high simlaritiy with “蘋果" training from previous window, "青森" and "日本" also have high simlaritiy
  • 21. Word2Vec word2vec uses skip-gram neural network to predict neighbor context
  • 22. Training Word2Vec model by gensim words already preprocessed and separated by whitespace. #encoding=UTF-8 import multiprocessing from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == '__main__': inp = sys.argv[1] model = Word2Vec(LineSentence(inp), size=100, window=10, min_count=10, workers=multiprocessing.cpu_count()) it doesn't work for me, gensim's word2vec run out of memory
  • 23. Move to Spark MLlib 1. Spark offer over 80 operators that make it easy to build parallel application 2. Databrick company uses Spark to break world record in 2014 1TB sort benchmark completition 3. MLlib is Spark's machine learning library.
  • 24. Spark cluster overview 1. Spark is master-slave architecture, which likes YARN 2. cluster management is master, it handle resource managemnet and slave health management. 3. when you launch application, master will assign a slave to be driver. driver request resource from master, execute main function and assign task to slave
  • 25. Spark cluster deployment 1. use Linode API to create and boot new instance rapidly 2. use standalone Spark cluster it also can deploy on Mesos or YARN cluster 3. install Java,Scala and put pre-built Spark, finally launch slave executor! 4. use ansible to deploy spark executor and use LZ4 to speed up decompress pre-built Spark package
  • 26. Training Word2Vec model by Spark cluster RDD is the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel val input:RDD[String] = sc.textFile(inp, 5).cache() val token:RDD[Seq[String]] = input.map(article => tokenize(article)) val word2vec = new Word2Vec() word2vec.setNumPartitions(5) val model = word2vec.fit(token) sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://....")
  • 27. Querying Word2Vec model by Spark cluster val model = sc.objectFile[Word2VecModel]("hdfs://....").first() val synonyms = model.findSynonyms("熱火",10) for((synonyms, cosineSim) <- synonyms){ println(synonyms+":"+cosineSim) } load model from HDFS compare with model training, resource requirement is cheap on finding similarity
  • 28. Query Word2Vec by Spark cluster
  • 29.
  • 31. Example of "luffy"(one piece comic's man charactrer)
  • 32.
  • 34.