RyotaHiguchi_Manpu2022.pdf

Analyzing Textual Sources Attributes of Comics
Based on Word Frequency and Meaning
Kansai University
◎Ryota Higuchi, Ryosuke Yamanishi, Mitsunori Matsushita

Abstract
• This paper
-distinguished what vocabulary was common / different
1 /19
• The purpose of this research
- constructing a vocabulary set that characterizes comics.
-focused on two different textual sources : Explanations and Reviews

Introduction
2 /19
• Huge number of new comic books
• How to a user choose comics?
- retrieves using web services
- Typical queries = Meta-information
Meta-information is not sufficient for retrieving comics
based on user preferences.
action/adventure
ONE PIECE
Hero defeats
the villain
• A user can not retrieve based on a story

Differences of Textual Sources
• Multiple sources of information on the same topic
Ex.) Texts on the web
-Explanations :
-Reviews :
-Outlines :
-Q&A :
3 /19

-Explanations
-Reviews
-Outlines
-Q&A
4 /19
-The character’s features
• contain an overview of the work
-significant episodes
• From Wikipedia

-Explanations
-Reviews
-Outlines
-Q&A
5 /19
-What readers liked/disliked
• contain impressions and evaluations
-feedback
• From Amazon

-Explanations :
-Reviews :
-Outlines :
-Q&A :
6 /19
the textual details vary from each source.
Whilst these texts from different information sources represent the same content,
Overview of the work
Reader’s impressions and evaluations
Overview of the work
sharing the knowledge

Selecting Information Sources is Difficult.
Differences in textual sources in comics have received little attention.
7 /19
• We should conduct a study using suitable information sources.
Problem
• there are few cases
the sources are selected with quantitative reasons.

Selecting Information Sources
Are you selecting information sources
based on your experience?
We've found a new use for web text, using trendy AI!!!
Well, the results are so-so...but is this the correct data for input???
Information sources must be selected
by discussing quantitative reasons
8 /19
• We should conduct a study using suitable information sources.
Purpose
1. the different attributes : selecting appropriate information resources
2. the common attributes : accessing to a large amount of data
<Providing two advantages>
by combing different types of sources
distinguishing what vocabulary was common/different
between 2 textual sources

Analysis Method
1. Datasets Construction
-Two type of sources：Explanations and Reviews about comic books
2. Construction of Classification Dictionary
3. Classification of Words Semantically
- Calculating word frequencies by using the dictionary
9 /19
Analyzing how frequently the words with what meanings appear
in each textual sources

1. Datasets Construction : Explanations
10 /19
• Information sources : Internet encyclopedias
• Data size : 6,250 points, 2,067 comic characters
• Texts describing the comic characters in detail
Website Features
Wikipedia Famous online encyclopedias
Niconico Pedia
character dialogue and net slang
Pixiv encyclopedia
Aniotawiki some description rules

1. Datasets Construction : Reviews
11 /19
• The purpose of this website
• Information sources : review website “Sakuhin Database”
-Different from shopping website like Amazon
-Evaluating works and Collecting information
• Texts about popular comics were included.
• Data size : 6,250points

1. Datasets Construction : Data Cleaning
-only common nouns
-stop words based on Slothlib
-removed low frequency word
12 /19
• Data cleaning
• The total number of word differences
-Explanations : 7,136 words -Reviews : 3,092 words
• Train data : 10,000 points
• Test data : 2,500 points

2. Construction of Classification Dictionary
• To analyze “what mean of words exist,”
Class An example of words in the class
hard battle, comrades in arms, first game
black, white, brown, complexion
idol, shortcoming, gym, position
13 /19
Word class sets are obtained using word embedding and k-means clustering.
-The elbow method shows 63 classes.
(激戦) (戦友) (初戦)
(⿊) (褐⾊)
(⽩) (顔⾊)
(アイドル)(コンプレックス)(ジム)(ポジション)
The average number of words : 118.8points
The resulting class sets of word was
a class dictionary
to use for content analysis of comics.

3. Classification of Frequently Appearing Words
0 1 62
𝒃𝟎 1 1 0
𝑏" 1 0 0
𝑏#$%% 1 0 1
…
Input : test data
A B C
14 /19
Classifying frequent words using the class dictionary
𝑡! = [vitality, bravery, male]
𝑡" = [impression, vitality, anime]
𝑡#$%% = [smile, captain, sister]
…
0 1 62
vitality
bravery
vigor
smile
male
female
sex
affinity
sister
brother
cousin
parent
63 classes dictionary output：63-bits vectors
One test data 𝒕𝟎 contains
the word “male”. 𝒕𝟎 contains an element of class 1.
Class 1 of the dictionary also contains
the word “male”.
We put “1”
in the corresponding
location.

Discussion Points and Evaluation
15 /19
• Calculating the relative difference using the binary array
-It was defined as an absolute value
of the difference ratio in each source.
Ex.)
- If both sources have the same ratio...
- If the ratio is biased to one side...
the relative difference : 0%
the relative difference : 100%
• Discussion Points
-Which classes the frequent words correspond to in each source
-What the meanings words are included in the classes
with a big difference between the two sources

Results : Frequently Appearing Classes in Explanations
Class words
Relative
difference
body, body length, familiar
74.2
parent, brother, sister
63.7
16 /19
(⾝⻑)
(⾝) (⾝近)
(親) (姉)
(兄)
• In Wikipedia for Tanjiro Kamado,
“His body length is 165cm.”
• In Pixib encyclopedia for Sabo,
“He is Luffy's brother.”
There are many words that
describe a character's feature and the content of works.
Application examples :
constructing a vocabulary set
that characterizes comics

17 /19
Results : Frequently Appearing Classes in Reviews
Class words
Relative
difference
comic, movie, illustration
35.5
work, cartoonist, masterpiece
19.8
(映画)
(漫画) (イラスト)
(作家)
(作品) (傑作)
• I love the illustrations this cartoonist draws!
• This cartoonist style will have
a great influence on future generations.
There are many words that
represent meta-information about comic works.
Application examples : research on genre analysis, topic classification

hairstyle, check, plastic model,
character, animation, diet, etc.
Result : Common Attribute
18 /19
• Most of the test data corresponded to classes containing many foreign words.
(チェック)
(ヘアスタイル) (プラモ)
(アニメ)
(キャラ) (ダイエット)
73％ of the total
-3 kinds of characters in Japanese : “Hiragana”, “Katakana”, “Kanji”
-Katakana is used to describe something from foreign countries.
• Japanese language
• The words in this class are written in Katakana.

Result : Common Attribute
19 /19
• The class is not a semantic set.
-There are new or unknown words
in explanation and reviews.
• The reason for this result
hairstyle, check, plastic model,
character, animation, diet, etc.
(チェック)
(ヘアスタイル) (プラモ)
(アニメ)
(キャラ) (ダイエット)
• Improvement Plan
Reconsidering word embedding models and training corpus

Summary
• Background :
• Problems :
• Purpose :
• Method :
• Conclusion :
-Explanations :
-Reviews :
the same content, but the textual details vary from each source
Semantic classification of frequently appearing words
describe the content of comics
represent meta-information about works
Differences of sources have received little attention
Thank you for your attention.
distinguishing what vocabulary was common/different
between 2 textual sources

RyotaHiguchi_Manpu2022.pdf

Recommandé

Recommandé

Contenu connexe

Similaire à RyotaHiguchi_Manpu2022.pdf

Similaire à RyotaHiguchi_Manpu2022.pdf (20)

Plus de Matsushita Laboratory

Plus de Matsushita Laboratory (20)

Dernier

Dernier (20)

RyotaHiguchi_Manpu2022.pdf