Needle in the Haystack by Anshul Vikram Pandey at QuantCon 2016

•

1 like•270 views

The amount of text data (news articles, blogs, social media etc.) on the web is increasing at a staggering rate. However, the amount of irrelevant information or noise on the web is increasing at a much higher rate than action-able information that can generate alpha. It is becoming increasingly difficult to mine for actionable stories on the web using standard, out of the box language processing techniques and libraries. Given that the performance, robustness and reliability of all data-centric models are directly dependent on the quality of the data, noise reduction becomes one of the most important steps in the data science pipeline. Thanks to the recent research advancements in the field of big data, deep learning and natural language processing technologies, we are now able to mine for actionable stories in millions of information pieces and hundreds of terabytes of data. In this talk, we will highlight various approaches and technologies we employ as part of the noise cancellation mechanism at Accern. We will also compare the performance of trading strategies that use social analytics derived using standard versus sophisticated noise cancellation techniques, as well as those that utilize other advanced metrics.

Economy & Finance

Needle in the Haystack
Mining for Actionable Information in the Noisy Web

-Ambiguity in Text
-Time Consuming
-Dedicated Infrastructure
-Manual Labor
Input
-Storage
-Backup
-Management
Processing Output
-Return on Investment

Why Bother?
Early Access:
-Low exposed sources
-Faster than popular media
-Order of hours
Trading Strategies:
-General sentiment
-Trend following
...

Big Data is getting Bigger
20+ million sources
200+ TB
200,000+ GB
7+ billion articles
1 Accern Year =

But Alpha-to-Noise Ratio is Decreasing
Daily 4M analyzed,
14K delivered*
Daily 6M analyzed,
17K delivered*
2014
2016
*approximately

How to increase this ratio?
Increase alpha/relevant information
and/or
Decrease noise

3 high-level, practical points.
1. Reliable training data for every model in your system.
a. Requires a lot of manual labor and heuristics.
2. Sequencing noise filters in the order of increasing
computational complexity and cost.
a. Will reduce latency and infrastructure cost.
3. Relevancy module -- secret ingredient.
a. Define what is relevant to you. (e.g. M&A, Rumors, etc.)

Accern Noise Cancellation Pipeline (simplified)
Bad Data
Blacklisted Sources
Data from
20 million
sources
Structured Spam
Pattern Matching.
Spam Classifiers
Ensemble Learning
Language Rules
Semantic Analysis
Financial Mapping
Taxonomies
Relevance Scoring
Secret Ingredient
Analytics Pipeline Spam Reduction
Relevancy
Less than 3% noise

Performance comparison
Strategy: Long-only -- Buy stocks if long condition matches.
Backtest Companies: S&P 500
Benchmark: S&P 500 (SPY)
Metrics:
Average Daily Sentiment
Average of calculated sentiment of related articles in last 24 hours.
Impact
Probability*100 that the article may impact the stock prices by equal to, or more
than, 1% (increase or decrease) by the end of the trading day. Utilizes historical
information.

Performance comparison
Sentiment
Stanford CoreNLP
Python NLTK
Impact
CoreNLP Entity
Training Data:
Enron Spam Dataset
Sentiment
Stanford CoreNLP
Python NLTK
Impact
Accern Impact
Training Data:
Enron Spam Dataset
Sentiment
Accern Sentiment
Impact
Accern Impact
Training Data:
Historical data
processed using Accern Noise
Cancellation pipeline.
*Models can only trade on common entities for fair comparison.
**Each model can use the best configuration.
***Trading Period: July 1, 2013 - July 1, 2015
Standard Model Standard+Accern Model Accern Model

Standard Model
Using standard spam filters for noise cancellation, standard sentiment
model, and impact calculated using standard entity extraction.

Using standard spam filters for noise cancellation, standard
sentiment model, and Accern impact model.
Standard+Accern Model

Using Accern Noise Cancellation pipeline for noise
reduction, Accern sentiment and Accern impact models.
Accern Model

One more strategy: Drift Following
● Number of long/shorts determine the
weights of securities.
● Use 40 day article sentiment average.
● Weekly holding period.

Thank You
www.accern.com
Anshul Vikram Pandey
anshul@accern.com

What's hot

Fraudulent credit card cash-out detection On Graphs

TigerGraph

BIG Data & Hadoop Applications in Logistics

Skillspeed

Focusing on the significance of targets is one of the key drivers for quality of web search. Filtering targeted companies based on the significance of their business model for the expected search results was one of our “nice to haves” last year. Evaluating a number of artificial intelligence approaches based on neural networks, classical machine learning and semantic technologies lead us to a working hybrid approach.

IC-SDV 2018: Aleksandar Kapisoda (Boehringer) Using Machine Learning for Auto...

Dr. Haxel Consult

Key Failure Factors of Building a Data Science Team

Craig Chao

Predicting Customer Behavior With Big Data

Pactera_US

Business Opportunities, Challenges, Strategies and Execution in Big Data Era...

Craig Chao

Building a Data Driven Business

Konstantin Savenkov

Big Data and Analytics - 2016 CFO

John-Paul Della-Putta

As the Big Data market has evolved, the focus has shifted from data operations (storage, access and processing of data) to data science (understanding, analyzing and forecasting from data). And as new models are developed, organizations need a process for deploying analytics from research into the production environment. In this talk, we'll describe the five stages of real-time analytics deployment: Data distillation Model development Model validation and deployment Model refresh Real-time model scoring We'll review the technologies supporting each stage, and how Revolution Analytics software works with the entire analytics stack to bring Big Data analytics to real-time production environments.

Real-time Big Data Analytics: From Deployment to Production

Revolution Analytics

( Python Data Science Training : https://www.edureka.co/python ) This Edureka video on "Python For Data Science" explains the fundamental concepts of data science using python. It will also help you to analyze, manipulate and implement machine learning using various python libraries such as NumPy, Pandas and Scikit-learn. This video helps you to learn the below topics: 1. Need of Data Science 2. What is Data Science? 3. How Python is used for Data Science? 4. Data Manipulation in Python 5. Implement Machine Learning using Python 6. Demo Subscribe to our channel to get video updates. Hit the subscribe button above. Check out our Python Training Playlist: https://goo.gl/Na1p9G

Python for Data Science | Python Data Science Tutorial | Data Science Certifi...

Edureka!

Real-time Analytics in Financial: Use Case, Architecture and Challenges

DataWorks Summit/Hadoop Summit

Outlier and fraud detection using Hadoop

Pranab Ghosh

As companies around the world look to get a jump on AI efforts, there’s one major question: with dozens of potential AI use cases but limited resources, how can organizations prioritize the right projects? Carefully selected and well-designed Enterprise AI projects facilitate faster and more efficient collaboration among AI scientists and engineers and ultimately help organizations set up for AI success. In this presentation we share a comprehensive framework for: 1) Choosing relevant projects: defining needs from business lines and cost/benefit analysis. 2) Advanced project scoping: aligning stakeholders, anticipating key steps, defining success metrics. 3) Operationalization of AI projects: challenges and best practices.

AI projects - Lifecyle & Best Practices

Vincent de Stoecklin

The presentation aims to present key challenges and success factors when it comes to deploying high value customer-oriented AI projects. We focus on key use cases (churn, cross-sell, personalization…) and present best practices to help build and deploy AI projects, from scoping and data availability to operationalization and adoption. Key takeaways: ● What are the key AI use cases in Customer Intelligence? ● How do I prioritize and assess the ROI of my use cases? ● How can I ensure my AI projects are successful?

How artificial intelligence (AI) can help maximize customer intelligence ROI

Vincent de Stoecklin

[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms

Rakuten Group, Inc.

Domino and AWS: collaborative analytics and model governance at financial ser...

Domino Data Lab

Big image analytics for (Re-) insurer

Flavio Trolese

Sustainability Investment Research Using Cognitive Analytics

Cambridge Semantics

Graph Databases and Machine Learning | November 2018

TigerGraph

Quantamental Investing - Merging Machine Learning, Fundamentals, & Insight

gurrajsangha

What's hot (20)

Fraudulent credit card cash-out detection On Graphs

BIG Data & Hadoop Applications in Logistics

IC-SDV 2018: Aleksandar Kapisoda (Boehringer) Using Machine Learning for Auto...

Key Failure Factors of Building a Data Science Team

Predicting Customer Behavior With Big Data

Business Opportunities, Challenges, Strategies and Execution in Big Data Era...

Building a Data Driven Business

Big Data and Analytics - 2016 CFO

Real-time Big Data Analytics: From Deployment to Production

Python for Data Science | Python Data Science Tutorial | Data Science Certifi...

Real-time Analytics in Financial: Use Case, Architecture and Challenges

Outlier and fraud detection using Hadoop

AI projects - Lifecyle & Best Practices

How artificial intelligence (AI) can help maximize customer intelligence ROI

[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms

Domino and AWS: collaborative analytics and model governance at financial ser...

Big image analytics for (Re-) insurer

Sustainability Investment Research Using Cognitive Analytics

Graph Databases and Machine Learning | November 2018

Quantamental Investing - Merging Machine Learning, Fundamentals, & Insight

Viewers also liked

The proliferation of novel data sources has awoken quantitative investors to the promise of “Big Data”. Billions of venture capital funding has created an ecosystem of companies to help investors extract information out of unstructured text, sensors, etc. A “Vision for Quants in the Data Economy” is nice, but what does it take to turn that vision into reality? Join Data Capital Management as we discuss some of the breakthroughs by companies like Twitter, Google and Facebook that are empowering quantitative investors to extract alpha from “Big Data."

Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016

Quantopian

Time in an automated trading system does not move in a constant deterministic fashion. Instead, it is a random variable drawn from a distribution. This happens because messages enter and exit automated systems though different gateways and then race across a complex infrastructure of parallel cables, safeguards, throttles and routers into and out of the central limit order books. Understanding latency means you are eating lunch rather than being someone else's lunch. Add to it market fragmentation and you get a pretty complex picture about the effects of latency on price formation.

Latency in Automated Trading Systems by Andrei Kirilenko at QuantCon 2016

Quantopian

Meb Faber at QuantCon 2016

Quantopian

Machine Learning Based Cryptocurrency Trading by Arshak Navruzyan at QuantCon...

Quantopian

Welcome to QuantCon 2016 John "fawce" Fawcett, Founder and CEO of Quantopian

Quantopian

Traditionally, commodities futures models incorporate metrics like inventory numbers, supply demand numbers. While supply chain disruptions, outages and other significant events play a crucial role in the spot and futures prices, however modeling them is not trivial. In this talk Sameena will talk about how her team captured significant events from news and modeled their impact on oil futures returns.

Improving Predictability of Oil via Reuters News Text by Sameena Shah at Quan...

Quantopian

Quants are faced with a complex data environment. Data is everywhere and it's increasingly challenging to analyze, explore and evaluate, all in one language and in one environment. Quants need a unified environment where they are able to write expressions and conduct pushdown processes, all without having to move the data and having the ability to deploy anywhere, anytime. Organizations need to better marshal the data and have visibility to conduct a clean transformation. This session will discuss how businesses gain a better understanding of their data, leading to better results. In the FinServ industry, fluidity in understanding the data will help create better risk models and trading strategies. Ransom will discuss how organizations address these challenges and future proof their work.

Light up Your Dark Data by Lance Ransom at QuantCon 2016

Quantopian

Financial Engineering and Its Discontents by Emanuel Derman at QuantCon 2016

Quantopian

Since the seminal work of Markowitz, covariance estimates has prime importance for portfolio construction. Running naive portfolio optimizations on sample covariance estimates can be hazardous to the health of one's portfolio though. The recent developments in machine learning, in particular in deep-learning, suggest that high-level abstractions and deep architectural representations are key for success when dealing with non-linear, noisy real-life data. Motivated by this, here we demonstrate a novel form of robust-covariance estimation based on the ideas borrowed from deep-learning domain. In a pedagogical setting, we will show how to use TensorFlow, a recently open-sourced deep-learning library by Google, to build a robust-covariance estimator via denoising autoencoders.

Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...

Quantopian

As mass adoption of social networks progresses the speed, reach, and mechanics of modern communication, the arc of data dissemination flattens greatly diminishing the value of conventional financial news flow. The multiplicity of chatter that propagates through large social user communities presents an atypical opportunity to monitor the evolving landscape of products, technology, media, entertainment, culture, and news quicker and more efficiently than any conventional form of financial research. But how do we, as investors, analysts and journalists, discover actionable insights hidden within terabytes of non-financial news flow and unstructured social data?

The Evolution of Social Listening for Capital Markets by Chris Camillo at Qua...

Quantopian

Deep Value and the Aquirer's Multiple by Tobias Carlisle for QuantCon 2016

Quantopian

Trading Strategies Based on Market Impact of Macroeconomic Announcementsby A...

Quantopian

Although the Fixed-Income market overall still lacks liquidity and overall transparency, the Eurodollar futures are a very liquid and accessible portion of it. Eurodollar market is defined by a set of key features: pro-rata matching, large tick size, overlapping and highly correlated set of contracts, hidden implied liquidity and sticky price quotes. We will describe methodologies suitable for dealing with the market's complexity, making the case that high-frequency market-making, alpha trading & algorithmic execution need to be linked closely to achieve continued success.

Quantitative Trading in Eurodollar Futures Market by Edith Mandel at QuantCon...

Quantopian

Patrick will explore how to combine the value factor with other stock selection factors to build a superior stock selection strategy. He will discuss unique ways of using momentum, share buybacks, and quality factors to improve on a simple value screen. He will discuss portfolio concentration, rebalancing, and risk management. He will also explain why the best versions of these strategies are only possible for smaller firms and investors.

Combining the Best Stock Selection Factors by Patrick O'Shaughnessy at QuantC...

Quantopian

Dr. Vesna Straser will discuss the differences in expected slippage between live trading, simulation trading and backtesting. Typically in backtesting signal generation and order fill assumptions are simplified to obtain strategy performance data faster. For example, many commercial back testing software providers will work with sampled data such as minute open or close price points and assume that the signal is triggered at the close of one bar and filled at the close price of the next bar, per the assumed slippage model. Simulation trading, however, will typically run on tick trading data (live or replayed) potentially resulting in quite different dynamics versus back testing. Orders are filled per fill assumptions that may vary significantly between different providers. In live trading, orders are triggered and executed immediately under real market conditions and order type. Depending on the trading strategy, live trading results can differ dramatically from back-testing and/or simulation trading. Vesna will outline the issues, analytics to track, factors to consider and how to account for them to achieve “realistic” back-testing results.

From Backtesting to Live Trading by Vesna Straser at QuantCon 2016

Quantopian

Trading leveraged derivatives using only technical analysis or speculative analysis can lead to windfall losses for even the most disciplined trader and investor. Statistics are often an ignored area of work when it comes to derivatives trading. Our talk shall focus upon how volatility can be used for dynamically adjusting the stop losses. It will talk about how correlation is an essential method to diversify the class of derivatives being traded or hedged. It will focus on co-integration as a key method to distinguish a mean reverting time series to a non-mean reverting time series. It will touch upon other essential time series econometrics like OU process, VRT as well as statistical tools like PCA, ARCH, GARCH etc. which are essential for derivatives pricing and forecasting the volatility.

Statistics - The Missing Link Between Technical Analysis and Algorithmic Trad...

Quantopian

To some, the debate of passive versus active investing is akin to Eagles vs. Cowboys or Coke vs. Pepsi. In short, once our preference for one style over the other is established is can become so overwhelming that it becomes a proven fact or incontrovertible reality in our minds. We cannot overemphasize that alpha in the market is no cakewalk. More importantly, being smart, having superior stockpicking skills, or amassing an army of PhDs to crunch data is only half of the equation. Even with those tools, you are still only one shark in a tank filled with other sharks. All sharks are smart, all sharks have a MBA or PhD from a fancy school, and all the sharks know how to analyze a company. Maintaining an edge in these shark infested waters is no small feat, and one that only a handful (e.g., we can count them in one hand) of investors has successfully accomplished. In order too achieve sustainable success as an active investing, one needs both skill and an understanding of human psychology and market incentives (behavioral finance). We start our journey where mine began: as an aspiring PhD student studying under Eugene Fama at the University of Chicago. Let the adventure begin...

The Sustainable Active Investing Framework: Simple, but Not Easy by Wesley Gr...

Quantopian

Gary will begin by reviewing the most common investment vehicles throughout history while explaining their advantages and disadvantages. He will then show how momentum can help accentuate the positives and eliminate the negatives. Using easily understood examples and historical research findings, Gary will show how relative strength momentum can enhance investment return, while trend-following absolute momentum can dramatically decrease bear market exposure. Finally, Gary will show how you can implement and easily maintain your very own dual momentum portfolio using the best assets classes.

Dual Momentum Investing by Gary Antonacci QuantCon 2016

Quantopian

Return predictability has been a controversial topic in finance for a long time. We show there is substantial predictive power in combining forecasting variables. We apply correlation screening to combine twenty variables that have been proposed in the return predictability literature, and demonstrate forecasting power at a six-month horizon. We illustrate the economic significance of return predictability through a simulation which takes positions in SPY proportional to the model forecast. The simulated strategy yields annual returns more than twice that of the buy-and-hold strategy, with a Sharpe ratio four times as large. This application of big data ideas to return predictability serves to shift the sentiment associated with market timing.

Market Timing, Big Data, and Machine Learning by Xiao Qiao at QuantCon 2016

Quantopian

Viewers also liked (19)