Quantitative Financial Statement Analysis

Categorizes companies into clusters using unsupervised learning

Background

In the role of a business or investment analyst, financial statements are relied on heavily as they are key factors when determining the overall success of a company and if an investment will provide a return on investment. Over recent years, different supervised machine learning models have been applied to help gain insights from the large number of datasets available.

Problem and Objectives

The problem that is being explored in this project is how to help classify and separate different companies in various industries based on the financial information provided in 10K filings. The objective is to separate companies from various industries into different clusters and identify the key attributes in each cluster to explore any similarities between companies.

How to use our Tool:

On the clusters page, each company is listed along with their cluster number. By clicking on each different company, you'll be able to see other similar companies that are in the same cluster, along with some of the characteristics that were used to form those clusters. A copy of the shapley values plot is also included. You may also search for companies using their ticker symbol or their name, using the Search Box located on the top-right of the page.

How the Model Works:

The dataset was first processed by removing all NaN values by various strategies, such as removing attributes and rows that had high NaN rates.
Various clustering techniques were used, such as k-means clustering, DBSCAN, expectation maximum, and more. After normalizing the data, it was found that there was still a lot of variation in attributes for the companies. This led to many companies being marked as noise, causing many clustering techniques such as DBSCAN to be rendered useless.
However, clustering using Affinity Propagation and Gaussian Mixture allowed for clear distinct clusters to form.

This final model uses Affinity Propagation clustering to form the company groups.
Some levels of supervised learning using Random Forests and Shapley values were also used to find some of the factors involved in forming the clusters.
The summary of the top factors impacting all clusters can be found below (analyzed using Shapley Values):
Shapley Values Plot for All Clusters
An example of a more specific clustering can be seen below, where the top factors for cluster 0 is shown:
Shapley Values Plot for Cluster 0
Here, you can see that the cluster tends to have high Average Receivables, and high Average Payables ratios.