Research on character level language modelling using LSTM for semi-supervised learning. The objective is learning inner layer representations of the language model for transfer learning into a classification one.
Generalizing NLP processes using Bi-directional LSTMs to learn character(byte) level embeddings of financial news headlines up too 8 bits ( 2**8 -1) in order to study the relationship between character vectors in financial news headlines in order to transfer learning in to classification models using UTF-8 encoding. Many traditional NLP steps (lemmatize, POS, NER, stemming...) are skipped when diving to byte level making the process more universal in terms of scope then task specific.
2. Agenda
• Why Unstructured Data?
• Character2Vector using Byte Level Encoding
• Embeddings
• Tips for LSTM Layer
• Transfer Learning
• Show code implementation with the model
3. Why the hype on Unstructured Data?
• Natural language processing (NLP) has become mainstream through a focus
on creating value from unstructured data.
• The number of firms that only use unstructured data has shot up from 2% in
2018 to 17% in 2020, and only 3% of the firms surveyed report that they do
not use alternative data sources, down from 30% in 2018.
• Last week, Refinitivs’ 2020 survey shows 72% of firms’ models were
negatively impacted by COVID-19. Some 12% of firms declared their models
obsolete, and 15% are building new ones. The main problem was the lack of
agility to quickly adapt and include new data sets in models as
circumstances changed.
4. Benefits of Charac2vec
• Having the character embedding, every single word’s vector can be formed even if it is out-
of-vocabulary words (no Bag-of Words necessary). On the other hand, word embedding can
only handle those seen words.
• Good fits for misspelling words
• Handles infrequent words better than word2vec embedding as later one suffers from lack
of enough training opportunity for those rare words
• Reduces model complexity and improving the performance (in terms of speed)
• All this comes at a cost of training on larger sparse sequences thus longer time to train and
optimize!
5. Why not Byte Level
Even?
• When ASCII encoding is used, there is no difference
between reading characters or bytes. The ASCII-
way of encoding characters allows for 256
characters to be encoded and (surprise…) these
256 possible characters are stored as bytes. 256 for
8 bit slots.
• I will use only 127 of these possible character that
are common to the English language. (7 bit slots)
• 0-31, 127 Control Characters (Nonprintable)
• 32-126 Alphabets(Upper/Lower case), Numeric,
symbols and signs
6. Embeddings vs One-
hot Encoding
Binary mode returns an array denoting which tokens exist at least once
in the input, while int mode replaces each token by an integer, thus
preserving their order
7. Embeddings
Layer
• Gives relationship between characters. Based on how
characters accompany each other.
• Dense vector representation (n-Dimensional) of float point
values. Map(char/byte) to a dense vector.
• Embeddings are trainable weights/parameters by the model
equivalent to weights learned by dense layer.
• In our case each unique character/byte is represented with an
N-Dimensional vector of floating point values, where the
learned embedding forms a lookup table by "looking up" each
characters dense vector in the table to encode it.
• A simple integer encoding of our characters is not efficient for
the model to interpret since a linear classifier only learns the
weights for a single feature but not the relationship (probability
distribution) between each feature(characters) or there
encodings.
• A higher dimensional embedding can capture “fine-grained”
relationships between characters, but takes more data to
learn.(256-Dimensions our case)
8. Tips for LSTM Inputs
• The LSTM input layer must be 3D. [i.e batch_input_shape=(batch_size, n_timesteps, n_features)]
• The meaning of the 3 input dimensions are: samples, time steps, and features (sequences,
sequence_length, characters).
• The LSTM input layer is defined by the input_shape argument on the first hidden layer.
• The input_shape argument takes a tuple of two values that define the number of time steps and
features.
• The number of samples is assumed to be 1 or more. Specify to None for batch_input_shape otherwise or
None for the input_length.
• The reshape() function on NumPy arrays can be used to reshape your 1D or 2D data to be 3D.
• The reshape() function takes a tuple as an argument that defines the new shape
• The LSTM return the entire sequence of outputs for each sample (one vector per timestep per sample), if
you set return_sequences=True.
• Stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding
sequence in the previous batch left off. Our RNN model is stateless since each sample is different from
the other and they dont form a text corpus but are separate headlines.
9.
10. Semi-Supervised
(Transfer Learning)
• Previously, Word2Vec take two embedding layers
for each token to predict probability of a word
before and after. LSTM can handle that without
random sampling, just take the max logit or
probability of the outcome to predict next word
or character
• Unlabeled data can compensate for labeled
fewer data in asset pricing .