The amount of text data (news articles, blogs, social media etc.) on the web is increasing at a staggering rate. However, the amount of irrelevant information or noise on the web is increasing at a much higher rate than action-able information that can generate alpha. It is becoming increasingly difficult to mine for actionable stories on the web using standard, out of the box language processing techniques and libraries. Given that the performance, robustness and reliability of all data-centric models are directly dependent on the quality of the data, noise reduction becomes one of the most important steps in the data science pipeline. Thanks to the recent research advancements in the field of big data, deep learning and natural language processing technologies, we are now able to mine for actionable stories in millions of information pieces and hundreds of terabytes of data.
In this talk, we will highlight various approaches and technologies we employ as part of the noise cancellation mechanism at Accern. We will also compare the performance of trading strategies that use social analytics derived using standard versus sophisticated noise cancellation techniques, as well as those that utilize other advanced metrics.
6. -Ambiguity in Text
-Time Consuming
-Dedicated Infrastructure
-Manual Labor
Input
-Storage
-Backup
-Management
Processing Output
-Return on Investment
7. Why Bother?
Early Access:
-Low exposed sources
-Faster than popular media
-Order of hours
Trading Strategies:
-General sentiment
-Trend following
...
9. Big Data is getting Bigger
20+ million sources
200+ TB
200,000+ GB
7+ billion articles
1 Accern Year =
10. But Alpha-to-Noise Ratio is Decreasing
Daily 4M analyzed,
14K delivered*
Daily 6M analyzed,
17K delivered*
2014
2016
*approximately
11. How to increase this ratio?
Increase alpha/relevant information
and/or
Decrease noise
12. 3 high-level, practical points.
1. Reliable training data for every model in your system.
a. Requires a lot of manual labor and heuristics.
2. Sequencing noise filters in the order of increasing
computational complexity and cost.
a. Will reduce latency and infrastructure cost.
3. Relevancy module -- secret ingredient.
a. Define what is relevant to you. (e.g. M&A, Rumors, etc.)
13. Accern Noise Cancellation Pipeline (simplified)
Bad Data
Blacklisted Sources
Data from
20 million
sources
Structured Spam
Pattern Matching.
Spam Classifiers
Ensemble Learning
Language Rules
Semantic Analysis
Financial Mapping
Taxonomies
Relevance Scoring
Secret Ingredient
Analytics Pipeline Spam Reduction
Relevancy
Less than 3% noise
14. Performance comparison
Strategy: Long-only -- Buy stocks if long condition matches.
Backtest Companies: S&P 500
Benchmark: S&P 500 (SPY)
Metrics:
Average Daily Sentiment
Average of calculated sentiment of related articles in last 24 hours.
Impact
Probability*100 that the article may impact the stock prices by equal to, or more
than, 1% (increase or decrease) by the end of the trading day. Utilizes historical
information.
15. Performance comparison
Sentiment
Stanford CoreNLP
Python NLTK
Impact
CoreNLP Entity
Training Data:
Enron Spam Dataset
Sentiment
Stanford CoreNLP
Python NLTK
Impact
Accern Impact
Training Data:
Enron Spam Dataset
Sentiment
Accern Sentiment
Impact
Accern Impact
Training Data:
Historical data
processed using Accern Noise
Cancellation pipeline.
*Models can only trade on common entities for fair comparison.
**Each model can use the best configuration.
***Trading Period: July 1, 2013 - July 1, 2015
Standard Model Standard+Accern Model Accern Model
16. Standard Model
Using standard spam filters for noise cancellation, standard sentiment
model, and impact calculated using standard entity extraction.
17. Using standard spam filters for noise cancellation, standard
sentiment model, and Accern impact model.
Standard+Accern Model
18. Using Accern Noise Cancellation pipeline for noise
reduction, Accern sentiment and Accern impact models.
Accern Model
19. One more strategy: Drift Following
● Number of long/shorts determine the
weights of securities.
● Use 40 day article sentiment average.
● Weekly holding period.