Introduction

In the ever-evolving financial market, investors and financial analysts are consistently seeking ways to gain an edge in predicting stock price movements. Financial Metrics derived from financial statements are widely used as indicators of a company's performance and can provide valuable insights into its future prospects. This project aims to develop a machine learning model to predict the percentage change in stock prices of top listed US companies from 2002 to 2022, using key financial metrics derived from their financial statements. By accurately predicting stock price percentage change, this model can potentially serve as a valuable decision-making tool for investors and analysts, enabling them to make more informed investment decisions and optimise their portfolio management strategies. The success of this model will be evaluated based on its predictive accuracy and its ability to outperform conventional forecasting methods in predicting stock price movements, such as fundamental analysis or technical analysis.

Methodology

Data Pipeline

To efficiently manage free flow of data right from data extraction to modelling, construction of data pipelines is undertaken. These data pipelines are constructed using Mage - an Apache Airflow alternative and was used extensively to schedule, manage and orchestrate the entire dataflow

Architecture

Data Collection

Data is collected from 3 disparate sources and then collated to generate a novel dataset for the purpose of this project. This was made possible by building a Mage data pipeline to handle recursive data pull requests and simultaneous insertion into a database.

The financial metrics data was collected from a financial data provider, FinancialModelPrep, using its API. The data covers 1360 tickers from 2002 to 2022. To avoid exceeding the API limits per day, a trigger was set up to fetch 50 ticker data at a time.
Stock Price Data - Extracted from Yahoo finance via its API against each row of financial metrics data and limited each run to fetch 50 new ticker data to avoid hitting API limits.
A separate test dataset comprising 200 tickers was collected using the same process to evaluate the performance of our model on unseen data. The findings and interpretations of the test results are presented in the following section.
The Company News Data was obtained from FinnHub, a financial news provider, using their free API tier. Due to the API limit, news data was fetched for 50 tickers every 10 minutes. The data consists of news articles for all 1360 tickers and covers news for the year 2022

<aside> <img src="/icons/bookmark_gray.svg" alt="/icons/bookmark_gray.svg" width="40px" /> A detailed variables and target description is provided in Appendix A

</aside>

Introduction

Methodology

Data Pipeline

Data Collection

Data Ingestion