Statistical Arbitrage: Foundations of a Mean-Reversion Model in Oil Futures

November 5, 2024

Author: Lewin Hafner

_{Background & Disclaimer: This is neither a strategy nor a model that can and should be deployed in the market. It primarily serves the purpose of illustrating some knowledge and ideas I've accumulated over time. All text is authored by me as I do not rely on AI tools such as ChatGPT to do so - therefore, there might be errors in grammar and punctuation. The Python and R code is included at the bottom of this page.}

Introduction

Contrary to what a majority of retail participants think, markets can be viewed through either a directional or a relative lense. While the former is concerned with bets on where a certain market is headed (up or down), relative value deals with the relative pricing of two or more assets. Think of Pepsi and Cola: as both companies share many of the same industry characteristics and their product displays a high degree of substitutability, their share prices are expected to follow roughly the same path. History shows that they do so - but to varying degrees. This gives rise to statistical arbitrage, a class of strategies that employ statistical rigor to exploit perceived relative mispricings. One key advantage such strategies provide is their beta-neutral profile, potentially allowing for orthogonal returns relative to the market. The following piece of work delves deeper into statistical arbitrage by (i) sheding light on some of the theory, assumptions and concepts involved with it and (ii) developing the foundations of a strategy that aims to capture such relative mispricings in oil futures.

Efficient Market Hypothesis and Market Anomalies

One primary function of all sorts of markets is to match buyers and sellers and ensure efficienct pricing of tradables. Central to this tenet is what financial theory refers to as the "efficient market hypothesis" (EMH), which states that asset prices reflect all available information and are thus always aligned with their fair value. Perhaps not explicitly - but EMH implies that alpha generation (excess returns over a benchmark) through stock picking and market timing is not possible and participants would be better off with passive exposure to the market. Accordingly, market bubbles wouldn't (or do not) exist, research is obsolete and market regulation as well as government intervention should be limited. EMHs statement that information is instantly embedded within prices also implies that prices react only to novel information - and as information is not predictable, asset prices are not predictable but random (see Malkiel (2003) and Dupernex (2007) for discussions).

I believe there are three points worth highlighting. First, numerous studies prove the existence of anomalies, patterns and abnormal behavior in the context of financial markets. Hubermann and Regev (2001), for example, present a case where a NY Times article led to a parabolic move in biotech stocks, although the news had been published in a scientific journal 5 months prior. This goes to show that the diffusion of news can create inefficiencies, i.e. a situation where prices of assets do not fully reflect their fair values. Another anomaly that is extensively covered by research is the momentum effect - the observation that rising (declining) prices tend to be self-reinforcing. If market progressions were to follow a "random walk" (hence the market would have no "memory"), there would be no room for self-reinforcing dynamics within the system. Artefacts like the liquidation cascade in Gamestop in 2020, however, make it hard to believe so. For a great review and discussion, check out Schwert (2002) and Malkiel (2003).

Second, the debate about whether markets are efficient and any deviations from this assumption cannot be exploited are actually two debates: just because inefficiencies are (seemingly) not exploitable to the extent that they can generate alpha, doesn't mean they don't exist. The pursuit to generate above-market returns is hindered by market frictions like transaction costs etc. Therefore, asset prices can be dislocated for prolonged periods.

Third, the spirit of EMH seems to be tied to the idea of an homo oeconomicus - an economic model that postulates that humans behave in ways that maximize their utility (i.e. fully rational). Accordingly, humans are assumed to be rational in the way they process new information, evaluate different options and make financial trading decisions. Insights from behavioral finance clearly reject these assumptions: humans are prone to a variety of cognitive biases such as confirmation bias (i.e. the filtering of information in favor of pre-existing beliefs), recency bias (i.e. giving more weight to recent events than to those in the distant past), overconfidence (i.e. an overestimation of one's abilities) and loss aversion (i.e. losses instill more pain than wins induce pleasure) (see XYZ for a great review). Taken together, such distortions pave the way for periods in which asset prices decouple from their fundamental value and create mispricing opportunities. Lastly, my humble opinion is this: ask any practicioner that deals in discretionary trading and you'll realize that markets are everything but rational.

Wall Street Cartoon by Robert Mankoff (New Yorker, 1981)

Mean-Reversion and Cointegration

The discussion of EMH above introduced inefficiencies, a key assumption underlying many StatArb models that try to exploit those. The question arises how exactly inefficiencies are identified.

StatArb - and from a narrower point of view pairs trading - is centered around the concept of mean-reversion, a statistical phenomenon whereby time-series tend to fluctuate around a mean. Formally known as "regression toward the mean", it refers to the observation that the larger the deviation of a single observation from its population mean, the higher the probability that the next observation is closer to the mean. Intuitively, the mean acts like a central tendency and the larger a single value deviates from it, the higher the gravitational force to revert back to it.

Normally, relevant long-short pairs are identified in a process that considers dozens of possible combinations or even entire groups of similarily-behaving assets. Common approaches include the distance approach that leverages distance metrics such as Euclidean Distance; cointegration that seeks to find long-term equilibrias; stochastic control; and others (see Krauss (2017) for a profound review). We center our work around cointegration and focus on one specific pair for which we have fundamental reasons to believe they are cointegrated.

Cointegration is a statistical property of two or more time-series' that tells us whether those have a long-term equilibrium-relationship. Cointegration is well-suited for pairs trading for two reasons. First, pairs trading revolves around the idea of trading similarily-behaving assets - cointegration allows us to identify co-related pairs in a quantitative manner. Second, models built on the premise of mean-reversion require consistent statistical properties (mean etc.) to ensure model profitability throughout time. As we will see, cointegration naturally tells us whether that's the case or not.

Formally, two variables X and Y are said to be cointegrated if both are integrated of order d > 0 and a linear combination a*X + b*Y of order < d exists. Order of integration, in this context, refers to the degree to which variables need to be differenced in order to make them stationary (stationarity ensures consistency in mean, variance etc.).

\( X \sim I(d) \quad \text{and} \quad Y \sim I(d), \quad d > 0 \)

\( \Delta^d X \quad \text{and} \quad \Delta^d Y \quad \text{are stationary} \)

\( aX + bY \sim I(d') \quad \text{with} \quad d' < d \)

Intuitive visualization for cointegration

Engineering a mean-reverting spread

Now that some of the prerequisites have been visited, we can proceed with the actual development of our model.

Different Crude Oils mainly differ in their geographical origination, density and sulfur content. Two widely quoted types of Oil are WTI and Brent, with the latter being a widely used Benchmark. These differences, along with market frictions such as transaction and transporation costs, give rise to a spread in the price of the two (see Geyer-Klingeberg and Rathgeber (2019) for a more nuanced view). Within the context of pairs trading, a spread refers either to a) the absolute difference in two prices, b) a ratio that expresses the price of one asset in terms of another, somewhat like a relative strength measure, or c) the residuals of a linear regression that establishes a linear relationship between two variables. These different ways of engineering a spread have distinct implications. In the case of a synthetic ratio, buying (selling) the spread translates to longing (shorting) Brent and shorting (longing) WTI, effectively hedging our directional exposure in dollar-terms. A truly beta-neutral implementation, on the other hand, incorporates a hedge ratio b (derived from linear regression) that establishes how many units of WTI need to be shorted (longed) per unit of longed (shorted) Brent. It will become clear later on how cointegration relates to a beta-neutral ratio.

\( \text{Synthetic Ratio} = \frac{X}{Y} = \frac{\text{Brent}}{\text{WTI}} \)

\( \text{Linear Relationship} = X = \alpha + \beta Y + \epsilon \)

\( \text{Spread} = \epsilon = X - (\alpha + \beta Y) \)

Brent and WTI Futures Prices 2007-2024

As laid out above, cointegration is essential to pairs trading as it ensures a) a meaningful statistical relationship between two assets and b) a linear combination of the two has consistent properties that align with the concept of mean-reversion. One possible way to test for cointegration is through the use of the Engle-Granger-Test. EG testing is done in two steps which are briefly covered here.
1. Determine if X (WTI) and Y (Brent) are integrated of order d, with d > 0. To determine whether both variables are integrated I(d > 0) we test for stationarity using the ADF-test. Per the results below we cannot reject the null hypothesis for both WTI and Brent, indicating they are likely non-stationary processes.
2. Regress Y on X using OLS and determine whether the residuals are stationary. Again, we test for stationarity using the ADF-test. Per the results our residuals are stationary, leaving us with a spread that aligns with the principle of mean-reversion.

Augmented Dickey-Fuller Test Results

Metric	Brent	WTI	Residuals (Spread)
ADF Statistic	-2.329715278626235	-2.55957223666333	-3.9712827687091052
p-value	0.1625632000797413	0.10164400135034563	0.0015685775637219045
Critical Value (1%)	-3.4318748338660824	-3.4318748338660824	-3.4318748697792314
Critical Value (5%)	-2.86221537753002	-2.86221537753002	-2.8622138030507975
Critical Value (10%)	-2.5671295085622763	-2.5671295085622763	-2.567128670385915

Taken together, these results suggest that WTI and Brent are indeed cointegrated. More intuitively, it tells us that regardless of whether Brent and WTI decouple in the short run, they eventually end up at the same place in the long run. Additionally, we can infer that the residuals of our regression are stationary, and by the same token, that our beta-neutral spread is stationary.

We proceed with modelling by extracting these residuals (the spread) from our linear regression. We normalize the scale by calculating Z-Scores, thereby ensuring that values are expressed in terms of standard deviations from the mean. As the plot below shows, our times series is not of trending nature, aligning with the property of stationarity.

\( \text{Normalized Spread} = \frac{\text{Spread} - \mu_{\text{Spread}}}{\sigma_{\text{Spread}}} \)

Normalized beta-adjusted spread derived from OLS regression

As implied above, our aim is to buy the spread (long Brent; short WTI) or sell the spread (short Brent; long WTI) when it significantly deviates from its own mean. We need a quantifiable way to assess whether to buy or sell the spread (i.e. we need to construct an entry-model). To do so, we calculate Z-Scores based on rolling exponential moving averages and rolling standard deviations. This allows for an increased adaptability across different volatility regimes. Further, we set thresholds at Z-Scores of 1 and 1.5 standard deviations to generate buy and sell signals. The first plot below visualizes when those signals are flashed; the second plot shows how they translate to our spread (the one we would actually trade).

\[Z_{\mathrm{EMA}} = \frac{\mathrm{EMA}_5 - \mathrm{EMA}_{30}}{\sigma_{30}}\]

‍Limitations and Improvements
The results look promising, given the simplistic nature of our model. A great number of short-term swings are captured, indicating that our model could be of use. However, there are various factors we would need to consider before developing a strategy based off it. First, our lookback period is way too short: while our model covers 3.5 years worth of data, models in a real-world setting require +10 years of data to be considered significant. Second, models intended for real-world use need to be tested. This is usually done by splitting the data into two parts, using the former (in-sample data) to build a model and the latter (out-of-sample data) to test how the model performs when it is given new data. Third, costs associated with trading (transaction costs, slippage, rolling fees, market impact) need to be considered in order to render a realistic picture of model profitability. Fourth, in live trading - as the name implies - data is computed in real-time. Our model makes use of daily closing prices, meaning no intraday movements are considered. This could lead to situations in which our model does not flash buy/sell signals or simply does so too late, thereby limiting profitability.

With these things in mind, this piece of work serves as a great foundation for future explorations in relative value and statistical arbitrage.

‍

Literature

Dupernex, S.(2007). Why might share prices follow a random walk? Student Economic Review, 21.https://www.tcd.ie/Economics/assets/pdf/SER/
2007/Samuel_Dupernex.pdf

Geyer-Klingeberg,J., & Rathgeber, A. W. (2021). Determinants of the WTI-Brent price spreadrevisited. The Journal of Futures Markets, 41(5), 736-757. https://doi.org/10.1002/fut.22184

Huberman, G.,& Regev, T. (2001). Contagious Speculation and a Cure for Cancer: ANonevent That Made Stock Prices Soar. TheJournal of Finance, 56(1),387-396. https://www.jstor.org/stable/222474

Krauss, C. (2017).Statistical Arbitrage Pairs Trading Strategies: Review and Outlook. Journal of Economic Surveys, 31(2), 513-545. https://doi.org/10.1111/joes.12153

Malkiel, B. G.(2003). The Efficient Market Hypothesis and Its Critics. Journal of Economic Perspectives, 17(1), 59-82. https://doi.org/10.1257/089533003321164958

Schwert, G. W. (2002). Anomalies and Market Efficiency (NBER Working Paper Series, Ausgabe 9277). https://www.nber.org/system/files/
working_papers/w9277/w9277.pdf

‍

‍Appendix I: Python-Code


# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from ta.trend import EMAIndicator

# Fetch data for Brent from Excel and create dataframe
df_brent = pd.read_excel(r'C:\Users\Lewin Hafner\Documents\Financial Markets & Economics\General - Exploration\Simple Relative Value Strategy\Export\dfbrent.xlsx')

# Fetch data for WTI from Excel and create dataframe
df_wti = pd.read_excel(r'C:\Users\Lewin Hafner\Documents\Financial Markets & Economics\General - Exploration\Simple Relative Value Strategy\Export\dfwti.xlsx')

# Check heads and datatypes of both dataframes
df_brent.head(5)
df_brent.dtypes
df_wti.head(5)
df_wti.dtypes

# Merge dataframes by date
data = pd.merge(df_brent, df_wti, on = "date")

# Check head, tail, datatypes and drop NA values
data.head(5)
data.tail(5)
data.dtypes
data.dropna()

# Rename closing-price column for Brent
data = data.rename(columns = {'BZ.F.Close' : 'observation_brent'})

# Rename closing-price column for WTI
data = data.rename(columns = {'CL.F.Close' : 'observation_wti'})

# Create plot containing both time-series
fig, ax = plt.subplots(figsize = (16, 8))
plt.rcParams['font.family'] = 'Garamond'
ax.plot(data['date'], data['observation_brent'], color = '#CFBD9C', label = 'Brent')
ax.plot(data['date'], data['observation_wti'], color = '#711000', label = 'WTI')
ax.set_xlabel('Date', fontname = 'Garamond')
ax.set_ylabel('Price', fontname = 'Garamond')
plt.title("Continuous Contract Prices for Brent and WTI", fontname = 'Garamond')
ax.legend(frameon = False)
fig.savefig("brent_wti_prices.svg", format = "svg", bbox_inches = "tight", pad_inches = 0.2)
plt.show()

# Cointegration-Test (Engle-Granger Two-Step Procedure)
# Check whether Brent is integrated of order d > 0 (non-stationary)
adf_test_brent = adfuller(data['observation_brent'])
print("ADF Statistic Brent:", adf_test_brent[0])
print("p-value:", adf_test_brent[1])
for key, value in adf_test_brent[4].items():
    print(f"Critical Value ({key}): {value}")

# Check whether WTI is integrated of order d > 0 (non-stationary)
adf_test_wti = adfuller(data['observation_wti'])
print("ADF Statistic WTI:", adf_test_wti[0])
print("p-value:", adf_test_wti[1])
for key, value in adf_test_wti[4].items():
    print(f"Critical Value ({key}): {value}")

# Run OLS Regression
x = sm.add_constant(data['observation_wti'])
model = sm.OLS(data['observation_brent'], x)
result = model.fit()
residuals = result.resid
print(result.summary())

# Run ADF-Test on residuals to check for stationarity
adf_test = adfuller(residuals)
print("ADF Statistic:", adf_test[0])
print("p-value:", adf_test[1])
for key, value in adf_test[4].items():
    print(f"Critical Value ({key}): {value}")
    
# Add residuals as a new column 'spread' in data
data['spread'] = result.resid

# Plot spread
plt.figure(figsize = (16, 8))
plt.plot(data['date'], data['spread'], color = '#CFBD9C')
plt.title("Synthetic Brent/WTI Spread 2007-2024")

# Calculate Z-Scores for spread
data['Z-Score-Spread'] = ((data['spread'] - data['spread'].mean()) / data['spread'].std())

# Plot Z-Scores
plt.figure(figsize = (16, 8))
plt.plot(data['date'], data['Z-Score-Spread'], color = "#CFBD9C")
plt.title("Normalized Synthetic Brent/WTI (Spread) 2007-2024")
plt.savefig("normalized spread.svg", format = "svg", bbox_inches = "tight", pad_inches = 0.2)
data.head(100)

# Create subset of dataframe
datafiltered = data[(data['date'] >= '2021-01-01') & (data['date'] <= '2024-10-31')]

# Plot Z-Scores of spread
plt.figure(figsize = (16, 8))
plt.plot(datafiltered['date'], datafiltered['Z-Score-Spread'], color = '#CFBD9C')
plt.title("Normalized Synthetic Brent/WTI (Spread) 2021-2024")
plt.savefig("spread_plot_20212024.svg", format = "svg", bbox_inches = "tight", pad_inches = 0.2)
plt.show()

# Calculate 5-EMA for Spread
ema_indicator5 = EMAIndicator(close = data['Z-Score-Spread'], window = 5)
data['spread_5ema'] = ema_indicator5.ema_indicator()

# Calculate 20-EMA for Spread
ema_indicator30 = EMAIndicator(close = data['Z-Score-Spread'], window = 30)
data['spread_30ema'] = ema_indicator30.ema_indicator()

# Calculate rolling 30-Stdv for Spread
data['spread_30std'] = data['Z-Score-Spread'].rolling(window = 30).std()

# Calculate new Z-Scores based off rolling EMAs and Stdv
data['Rolling MA Z-Score Spread'] = ((data['spread_5ema'] - data['spread_30ema']) / data['spread_30std'])
data.head(3500)

# Plot Z-Scores with thresholds (1, 1.5, -1, -1.5 Stdv)
plt.figure(figsize = (16, 8))
plt.plot(data['date'], data['Rolling MA Z-Score Spread'], color = "#333333")
plt.axhline(y = -1, color = '#A43826', linestyle = '--')
plt.axhline(y = -1.5, color = '#A43826', linestyle = '--')    
plt.axhline(y = 1, color = '#A43826', linestyle = '--')
plt.axhline(y = 1.5, color = '#A43826', linestyle = '--')
plt.title("Rolling Z-Score Spread (EMA)")
#plt.savefig("zscoremaspread20212024.svg", format = "svg", bbox_inches = "tight", pad_inches = 0.2)
plt.show()

# Create subset dataframe for 2021-2024
filtered = data[(data['date'] >= '2021-01-01') & (data['date'] <= '2024-10-31')]

# Plot Z-Scores with thresholds (1, 1.5, -1, -1.5 Stdv)
plt.figure(figsize = (16, 8))
plt.plot(filtered['date'], filtered['Rolling MA Z-Score Spread'], color = "#333333")
plt.axhline(y = -1, color = '#A43826', linestyle = '--')
plt.axhline(y = -1.5, color = '#A43826', linestyle = '--')    
plt.axhline(y = 1, color = '#A43826', linestyle = '--')
plt.axhline(y = 1.5, color = '#A43826', linestyle = '--')
plt.title("Rolling Z-Score Spread (EMA)")
plt.savefig("zscoremaspread20212024.svg", format = "svg", bbox_inches = "tight", pad_inches = 0.2)
plt.show()

# Generate buy and sell signals based off Z-Scores thresholds
data['buy_signal_spread'] = np.where((data['Rolling MA Z-Score Spread'] <= -1) | (data['Rolling MA Z-Score Spread'] <= -1.5), data['spread'], np.nan)
data['sell_signal_spread'] = np.where((data['Rolling MA Z-Score Spread'] >= 1) | (data['Rolling MA Z-Score Spread'] >= 1.5), data['spread'], np.nan)
data.head(50)

# Create subset dataframe for 2021-2024
datasubset = data[(data['date'] >= '2021-01-01') & (data['date'] <= '2024-10-31')]

# Plot initial ratio with buy and sell signals
plt.figure(figsize = (16, 8))
plt.plot(datasubset['date'], datasubset['spread'], color = "#333333")
plt.plot(datasubset['date'], datasubset['buy_signal_spread'], '^', color = '#CFBD9C', label = 'Buy Signal (long Brent; short WTI)', markersize = 10)
plt.plot(datasubset['date'], datasubset['sell_signal_spread'], 'v', color = '#A43826', label = 'Sell Signal (short Brent; long WTI)', markersize = 10)
plt.title("Synthetic Spread with l/s signals")
plt.legend()
plt.savefig("Spread_signals_2021-2024.svg", format = "svg", bbox_inches = "tight", pad_inches = 0.2)

# Create four subsets/segments
datasubset1 = data[(data['date'] >= '2008-01-01') & (data['date'] <= '2011-12-31')]
datasubset2 = data[(data['date'] >= '2012-01-01') & (data['date'] <= '2015-12-31')]
datasubset3 = data[(data['date'] >= '2016-01-01') & (data['date'] <= '2019-12-31')]
datasubset4 = data[(data['date'] >= '2021-01-01') & (data['date'] <= '2024-10-31')]

# Prepare a list of subsets and their titles
subsets = [datasubset1, datasubset2, datasubset3, datasubset4]
titles = ['2008-2011', '2012-2015', '2016-2019', '2021-2024']

# Set up the figure and a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(16, 8))
fig.suptitle('Synthetic Spread with l/s signals (2008-2024)')

for i, (subset, title) in enumerate(zip(subsets, titles)):
    # Drop rows with missing values in 'date' or 'spread' to avoid dimension mismatches
    subset = subset.dropna(subset=['date', 'spread'])
    
    # Determine the correct subplot axis
    ax = axes[i // 2, i % 2]
    
    # Plot data
    ax.plot(subset['date'], subset['spread'], color = "#333333")
    ax.plot(subset['date'], subset['buy_signal_spread'], '^', color = '#CFBD9C', label = 'Buy Signal (long Brent; short WTI)', markersize = 4)
    ax.plot(subset['date'], subset['sell_signal_spread'], 'v', color = '#A43826', label = 'Sell Signal (short Brent; long WTI)', markersize = 4)
    ax.set_title(title)
    ax.set_xlabel('Date')
    ax.set_ylabel('Spread')

# Adjust layout
plt.tight_layout(rect=[0, 0, 1, 0.96])  # To make room for the suptitle
plt.savefig("Spread_with_signals_subplots.svg", format = "svg", bbox_inches = "tight", pad_inches = 0.2)
plt.show()

# Plot distribution of Spread Z-Scores
plt.figure(figsize = (16, 8))
plt.hist(datasubset['Rolling MA Z-Score Spread'], bins = 100, color = "#CFBD9C")
plt.title("Distribution of Rolling Z-Score (EMA)")
# plt.savefig("zscoresdistribution.svg", format = "svg", bbox_inches = "tight", pad_inches = 0.2)

Appendix II: R-Code


# Initialize libraries
library(quantmod)
library(dplyr)
library(ggplot2)
library(aTSA)
library(writexl)
library(tidyr)
library(dplyr)

# Fetch symbols from Yahoo
getSymbols(src = "yahoo", "BZ=F")
getSymbols(src = "yahoo", "CL=F")

# Rename xts
brentxts = `BZ=F`
wtixts = `CL=F`

# Create dataframes on the basis of xts'
dfbrent = data.frame(date = index(brentxts), value = brentxts$`BZ=F.Close`)
dfwti = data.frame(date = index(wtixts), value = wtixts$`CL=F.Close`)

# Merge dataframes by date, rename variables and delete all NAs
data = merge.data.frame(dfbrent, dfwti, by = "date", ALL = FALSE)
data = rename(data, brent = BZ.F.Close)
data = rename(data, wti = CL.F.Close)
data = na.omit(data)

# Plot the price for Brent and WTI Crude Oil
ggplot(data = data, mapping = aes(x = date, y = brent)) + 
  geom_line(color = "#CFBD9C") +
  geom_line(mapping = aes(x = date, y = wti), color = "#A43826") +
  xlab("Price") +
  ylab("Date")
  theme_minimal()

# Run Cointegration-Test for both variables
aTSA::coint.test(data$brent, data$wti)

## Alternative to Cointegration-Test: manual two-step process
# Check whether Brent is integrated of order d > 0 (non-stationary)
adf.test(data$brent)

# Check whether WTI is integrated of order d > 0 (non-stationary)
adf.test(data$wti)

# Run OLS Regression
lm = lm(data$brent~data$wti)
summary(lm)
residuals = residuals(lm)

# Run ADF-Test on residuals to check for stationarity
adf.test(residuals)

# Add residuals (spread) to dataframe
data$spread = residuals(lm)

# Plot spread
ggplot(data = data, mapping = aes(x = date, y = spread)) +
  geom_line(color = "#CFBD9C") +
  theme_minimal()

# Normalize spread using Z-Scores
data$spreadzscore = ((data$spread - mean(data$spread)) / sd(data$spread))

# Plot Z-Scores
ggplot(data = data, mapping = aes(x = date, y = spreadzscore)) +
  geom_line(color = "#CFBD9C") +
  theme_minimal()

# Calculate (exponential) rolling means and rolling standard deviation
data$fma = TTR::EMA(data$spreadzscore, n = 5)
data$tma = TTR::EMA(data$spreadzscore, n = 30)
data$tsd = rollapply(data$spreadzscore, width = 30, FUN = sd, fill = NA, align = "right")

# Calculate Z-Scores based on rolling EMAs and SD
data$spreadzscorema = ((data$fma - data$tma) / data$tsd)

# Create a new dataframe for the period 2021-2024
filtered = subset(data, date >= "2021-01-01" & date <= "2024-10-31")

# Plot Z-Scores with thresholds
ggplot(data = filtered, mapping = aes(x = date, y = spreadzscorema)) +
  geom_line(color = "#333333")+
  geom_hline(yintercept = -1, color = "#A43826") +
  geom_hline(yintercept = -1.5, color = "#A43826") +
  geom_hline(yintercept = 1, color = "#A43826") +
  geom_hline(yintercept = 1.5, color = "#A43826") +
  theme_minimal()

# Generate buy and sell signals based on thresholds
data = data %>% mutate (
  buy_signal = ifelse(spreadzscorema <= -1 | spreadzscorema <= -1.5, spread, NA),
  sell_signal = ifelse(spreadzscorema >= 1 | spreadzscorema >= 1.5, spread, NA)
)

# Create subset dataframe for the period 2021-2024
datasubset = subset(data, date >= "2021-01-01" & date <= "2024-10-31")

# Plot initial spread and indicate points of buying and selling
ggplot(datasubset, aes(x = date, y = spread)) +
  geom_line(color = "black") +
  geom_point(aes(y = buy_signal), color = "#CFBD9C", size = 3, shape = 17, na.rm = TRUE) +
  geom_point(aes(y = sell_signal), color = "#A43826", size = 3, shape = 17, na.rm = TRUE) +
       xlab("Date") +
       ylab("Ratio") +
  theme_minimal()

# Plot distribution of Z-Scores
ggplot(datasubset, aes(x = spreadzscorema)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.05, colour = "#CFBD9C", fill = "#CFBD9C") +
  theme_minimal()

# Export data to xlsx
write_xlsx(x = dfbrent, path = "C:/Users/Lewin Hafner/Documents/Financial Markets & Economics/General - Exploration/Simple Relative Value Strategy/Export/dfbrent.xlsx")
write_xlsx(x = dfwti, path = "C:/Users/Lewin Hafner/Documents/Financial Markets & Economics/General - Exploration/Simple Relative Value Strategy/Export/dfwti.xlsx")
write_xlsx(x = data, path = "C:/Users/Lewin Hafner/Documents/Financial Markets & Economics/General - Exploration/Simple Relative Value Strategy/Export/data.xlsx")
write_xlsx(x = datasubset, path = "C:/Users/Lewin Hafner/Documents/Financial Markets & Economics/General - Exploration/Simple Relative Value Strategy/Export/datasubset.xlsx")