Investing in the Age of Big Data

Each day, oceans of new financial data are being generated by the internet, smartphones and social media. This data is often referred to as “big data”. Furthermore, the term has generated excitement in the investing world because of its potential to vastly improve the decision-making process of investors and financial analysts. Modern businesses have begun to store data on everything, right from what a consumer likes, to how they interact with a particular product, to the amazing restaurant that opened up in Italy last weekend.

As more data is recorded every year, the possibilities of what can be done with so much raw data continue to grow. At the same time, advancements in information technology have led to the emergence of new tools such as machine learning and sentiment analysis. Investors are looking to leverage those new tools and technologies to gain advantages over market players that rely purely on traditional data.

$62.10 billion
Market of Data Analytics
in Banking by 2025 (Source)
99.5%
of Collected Data
Gets Never Analyzed (Source)

Bringing Structure to Big Data

In the past computers could only analyze structured data or data that is easily quantifiable and organized in a set format. However, about 80% of all generated data is unstructured or expressed in a format that is not easily quantified. An example would be textual news data, which often does not come naturally mapped to a particular company or topic. The mapping process for such data sets is complex and resource intensive, especially for "hidden" concepts and topics that do not appear explicitly as keywords in the text. Financial analysts face a major problem when working with unstructured textual data: How can a machine understand text like a human and interpret it? How can a system understand and apply domain specific knowledge? About 90% of businesses report that unstructured data is their primary big data problem. As a result, most investors do not yet use textual data to gain new insights.

Filtering and Reducing the Data

Over the last years, there has been an exponential increase in data that is available to investors. Thousands of news and opinions are shared on social media every second and an overwhelming amount of metadata can be derived from it. However, not all data is relevant to a particular use case or useful for financial analysis. Choosing the right technological tools is critical. In Addition, analysts usually encounter data quality issues like missing, incomplete or duplicated data, even when obtaining data from reliable sources. Therefore, big data must be "cleaned" and filtered before it can be further processed to improve investment strategies.

Sentiment Analysis in Finance

Investor sentiment can be described as a belief about future stock performance and investment risks that is not justified by the facts at hand. Extensive Research has shown that the question is no longer whether investor sentiment affects stock prices, but how to measure investor sentiment and quantify its effects. Furthermore, recent publications in Natural Language Processing (NLP) have yielded promising results when analyzing social media posts or news articles to predict stock prices. StockBrain builds on top of current research to deliver an accurate "stock sentiment" indicator, which captures whether investors are likely to increase or withdraw their investments into a particular stock.

Introducing the StockBrain Data Pipeline

The following paragraph will present a broad overview of how StockBrain's system empowers equity investors:

1) Data collection & Relevancy Filtering

Every day, StockBrain collects thousands of news articles from over 500 sources in 6 different languages. All articles are passed through several quality filters to remove unrelated content or spam from our processing pipeline.

2) Topic and Knowledge Extraction

StockBrain extracts key concepts and phrases from each news story and uses this information to form relationships between articles and concepts. To achieve this, we use a combined Vector Space Model (VSM) and Latent Dirichlet Allocation (LDA) approach. In Addition to that, we use curated dictionaries to identify concepts that relate to investment, business and economical topics. As a result, StockBrain can track which news topics were dominating the headlines when a particular stock price change occurred.

3) News Sentiment Modelling

To determine the sentiment of an article, StockBrain uses a mix of stochastic and statistical methods, that are based on state-of-the-art sentiment dictionaries such as VADER or NTUSD-Fin. In Addition to that, we increase the precision of existing approaches by also considering topic knowledge that has been extracted before (see previous paragraph). This allows us to not only generate one overall sentiment value for a news story, but also to measure and weight the sentiment of each sentence.

4) Stock Sentiment Calculation

All sentiment information that relates to a particular company is combined in one single metric that ranges between -1 (most pessimistic) to 1 (most optimistic). In order to provide actionable insights, we also display which topics or news stories contribute to the current investor sentiment.