[Paper] EDA : easy data augmentation techniques for boosting performance on text classification tasks

Susang Kim(healess1@gmail.com)
Data Augmentation in NLP
EDA: Easy Data Augmentation Techniques for Boosting
Performance on Text Classification Tasks

Data Augmentation in NLP (2017)
https://www.slideshare.net/healess/python-tensorflow-ai-chatbot
Text Augmentation
W2V(Embedding) + CNN
Classification (2017년 BERT이전)
피자주문 하고 싶어 / 여행 정보
알려줘 / 호텔 예약해줘
주문, 정보, 예약의 3가지 의도에 대해
Word Similarity (피자, 피쟈 / 정보,
갈만한데)와 Random Swap과
Random Delete 사용

Data Augmentation (NLP vs CV)
컴퓨터 비전에서Augmentation의 경우 필수적인 요소로써
스케일링, 뒤집기, 회전만으로도 모델의 성능을 견고하게
만들 수 있는 반면
NLP에서는 나라별 언어의 특징도 다르고 단어의 위치만
바뀌어도 의미자체가 완전 달라지기 때문에 NLP에서의
Augmentation 기법은 상대적으로 어려운 분야
(“눈을 보다”, “카메라를 찍다”)
Are You Happy <-> You Are Happy
아버지가 방에 들어가신다 <-> 아버지 가방에 들어가신다
수상하다, 수상 스키를 타다
나는 너를 좋아해 <-> 너는 나를 좋아해
https://www.kdnuggets.com/2018/05/data-augmentation-deep-learning-limited-data.html
이미지의 경우 Flip, Scaling, Crop, Translation, Rotation,
Gaussian Noise등의 Augmentation으로도 성능향상이 가능

EMNLP 2019에 발표된 논문으로 Text
Classification Task에서 EDA(Easy Data
Augmentation)라는 텍스트 에디팅 기법을
통해 기존 CNN/RNN모델을 통한 5개의
benchmark classification tasks에서 성능
개선을 이루어냄
외부데이터나 생성모델 없이
적은데이터를 활용하여 단순한 Data
Augmentation만으로 NLP Data
Augmentation에 대한 한계를 개선함
※ Transformer Model(BERT등)은
사용하지 않고 CNN과 RNN모델로
Classification Task에만 적용)
Not Exploratory Data Analysis
(탐색적 데이터 분석과의 용어 혼선 유의)
Easy Data Augmentation

Introduction
Test Text Classification Task에 랜덤으로 4가지 방법 중 한개를 선택하여 적용
To compensate, we vary the number of words changed,n, for SR, RI,and RS based on
the sentence length l with the for-mulan=αl, where α is a parameter that indicates the
percent of the words in a sentence are changed(we usep=α for RD).

Benchmark Datasets
Five benchmark classification task
(1) SST-2: Stanford Sentiment Treebank (Socher et al., 2013)
(2) CR: customer reviews (Hu and Liu, 2004; Liu et al., 2015)
(3) SUBJ: subjectivity/objectivity dataset (Pangand Lee, 2004)
(4) TREC: question type dataset(Li and Roth, 2002)
(5) PC: Pro-Con dataset(Ganapathibhotla and Liu, 2008)
random subset of the full training set with N
train={500,2,000,5,000, all available data}.

Performance on benchmark text
학습데이터 비율에 따른 성능 비교
적은 데이터에서 Overfitting 경향을
보이기는 하지만 데이터가 적을
수록 성능개선이 큼
EDA적용에 따라 평균 88.3%에서
88.6%의 성능 향상을 가져옴
(전체 데이터셋에 적용시 0.8%
개선된 반면 500개의 Data에서는
EDA적용시 3% 개선)

Does EDA conserve true labels?
Augmentation 적용 따른 Class가 실재 라벨과 일차하는 정도를 Binary
Classification Task에 적용
RNN으로 Augmentation적용을 하지 않은 데이터로 학습 후
Augmentation이 적용된 데이터를 Test해 봄으로써 EDA가 실재
라벨과 일치하는지를 t-SNE를 통해 2-D representations 시각화
→ 유사한 결과가 출력
Active Learning Literature Survey : http://burrsettles.com/pub/settles.activelearning.pdf
실무에서 Model을
견고하게 하기 위해
Data를 추가로
주입하는 Active
Learning에 적용 가능
(Inference가 예측한
값이 특정 Threshold
이하의 경우 사람이
확인 후 Label반영)

Ablation Study (How much augmentation?)
문장의 단어 비중에 따른 설정 augmentation
parameter α={0.05, 0.1, 0.2, 0.3, 0.4, 0.5}
α값이 낮을 수록 높은 성능을 보임 (문장의
단어가 많이 바뀜에 따른 의미가 변화됨)
실재 문장에 적용한 횟수 n_aug={1, 2, 4, 8,
16, 32}의 경우 데이터 수에 따라 다름

Implementation Details
Synonym thesaurus : WordNet (synonym dictionary)
Word embeddings : 300 dimensional word embeddings trained using GloVe
CNN Model
input layer, 1D convolutional layer of 128 filters of size 5,
global 1D max pool layer,
dense layer of 20 hidden units with ReLU activation function,
softmax output layer.
We initialize this network with random normal weights
and train against the categorical cross-entropy loss function
with the adam optimizer. We use early stopping with a patience of 3 epochs.
RNN Model
input layer,
bi-directional hidden layer with 64 LSTM cells,
dropout layer with p=0.5, bi-directional layer of 32 LSTM cells,
dropout layer with p=0.5,
dense layer of 20 hidden units with ReLU activation,
softmax output layer.
We initialize this network with random normal weights and
train against the categorical crossentropy loss function with the adam optimizer.
We use early stopping with a patience of 3 epochs.

EDA’s Discussion and Limitations
EDA는 적은 데이터 셋에서 큰 성능 향상을 보인 부분과 LM이나 추가 데이터 없이 Augmentation을 적용한 부분에 있어서
의미가 있지만 전체 데이터에 적용시 1% 미만의 성능 향상이 있었고 5개의 분류 Task에서만 실험한 부분과 ULMFit 적용 시
성능 효과가 적었고 ELMo나 BERT 적용 시에도 같은 결과가 나올 것으로 예상되는 한계를 발견
NLP 연구는 Model과 Data의 성격 그리고 Task에 따라 결과가 다양하게 나오기에 공정한 비교 방법이 필요
Universal Language Model Fine-tuning for Text Classification : https://arxiv.org/pdf/1801.06146.pdf

Thanks
Any Questions?
You can send mail to
Susang Kim(healess1@gmail.com)

[Paper] EDA : easy data augmentation techniques for boosting performance on text classification tasks

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to [Paper] EDA : easy data augmentation techniques for boosting performance on text classification tasks

Similar to [Paper] EDA : easy data augmentation techniques for boosting performance on text classification tasks (20)

More from Susang Kim

More from Susang Kim (16)

[Paper] EDA : easy data augmentation techniques for boosting performance on text classification tasks