Hurst exponent with OHLC (#18)

lsbardel · web-flow · commit 33f353f5f601 · 2024-12-30T19:53:49.000Z
diff --git a/notebooks/api/ta/index.rst b/notebooks/api/ta/index.rst
@@ -8,3 +8,4 @@ Timeseries Analysis
     :maxdepth: 1
 
     ohlc
+    paths
diff --git a/notebooks/api/ta/paths.rst b/notebooks/api/ta/paths.rst
@@ -2,7 +2,7 @@
 Paths
 ===========
 
-.. module:: quantflow.utils.paths
+.. module:: quantflow.ta.paths
 
 .. autoclass:: Paths
    :members:
diff --git a/notebooks/api/utils/index.rst b/notebooks/api/utils/index.rst
@@ -7,5 +7,4 @@ Utils
 .. toctree::
     :maxdepth: 1
 
-    paths
     marginal1d
diff --git a/notebooks/applications/hurst.md b/notebooks/applications/hurst.md
@@ -6,38 +6,55 @@ jupytext:
     format_version: 0.13
     jupytext_version: 1.16.6
 kernelspec:
-  display_name: Python 3 (ipykernel)
+  display_name: .venv
   language: python
   name: python3
 ---
 
 # Hurst Exponent
 
-The [Hurst exponent](https://en.wikipedia.org/wiki/Hurst_exponent) is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases.
-
+The [Hurst exponent](https://en.wikipedia.org/wiki/Hurst_exponent) is used as a measure of long-term memory of time series. It relates to the auto-correlations of the time series, and the rate at which these decrease as the lag between pairs of values increases.
 It is a statistics which can be used to test if a time-series is mean reverting or it is trending.
+
+The idea idea behind the Hurst exponent is that if the time-series $x_t$ follows a Brownian motion (aka Weiner process), than variance between two time points will increase linearly with the time difference. that is to say
+
+\begin{align}
+  \text{Var}(x_{t_2} - x_{t_1}) &\propto t_2 - t_1 \\
+  &\propto \Delta t^{2H}\\
+  H &= 0.5
+\end{align}
+
+where $H$ is the Hurst exponent.
+
 Trending time-series have a Hurst exponent H > 0.5, while mean reverting time-series have H < 0.5. Understanding in which regime a time-series is can be useful for trading strategies.
 
+These are some references to understand the Hurst exponent and its applications:
+
 * [Hurst Exponent for Algorithmic Trading](https://robotwealth.com/demystifying-the-hurst-exponent-part-1/)
+* [Basics of Statistical Mean Reversion Testing](https://www.quantstart.com/articles/Basics-of-Statistical-Mean-Reversion-Testing/)
 
 ## Study with the Weiner Process
 
-We want to construct a mechanism to estimate the Hurst exponent via OHLC data because it is widely available from data provider and easily constructed as an online signal during trading.
+We want to construct a mechanism to estimate the Hurst exponent via OHLC data because it is widely available from data providers and easily constructed as an online signal during trading.
 
 In order to evaluate results against known solutions, we consider the Weiner process as generator of timeseries.
 
 The Weiner process is a continuous-time stochastic process named in honor of Norbert Wiener. It is often also called Brownian motion due to its historical connection with the physical model of Brownian motion of particles in water, named after the botanist Robert Brown.
 
+We use the **WeinerProcess** from the stochastic process library and sample one path over a time horizon of 1 (day) with a time step every second.
+
 ```{code-cell} ipython3
 from quantflow.sp.weiner import WeinerProcess
-from quantflow.utils.dates import start_of_day
-p = WeinerProcess(sigma=0.5)
-paths = p.sample(1, 1, 24*60*60)
-paths.plot()
+p = WeinerProcess(sigma=2.0)
+paths = p.sample(n=1, time_horizon=1, time_steps=24*60*60)
+paths.plot(title="A path of Weiner process with sigma=2.0")
 ```
 
+In order to down-sample the timeseries, we need to convert it into a dataframe with dates as indices.
+
 ```{code-cell} ipython3
-df = paths.as_datetime_df(start=start_of_day()).reset_index()
+from quantflow.utils.dates import start_of_day
+df = paths.as_datetime_df(start=start_of_day(), unit="d").reset_index()
 df
 ```
 
@@ -50,9 +67,19 @@ The value should be close to the **sigma** of the WeinerProcess defined above.
 float(paths.paths_std(scaled=True)[0])
 ```
 
-### Range-base Variance estimators
+The evaluation of the hurst exponent is done by calculating the variance for several time windows and by fitting a line to the log-log plot of the variance vs the time window.
 
-We now turn our attention to range-based volatility estimators. These estimators depends on OHLC timeseries, which are widely available from data providers such as [FMP](https://site.financialmodelingprep.com/).
+```{code-cell} ipython3
+paths.hurst_exponent()
+```
+
+As expected, the Hurst exponent should be close to 0.5, since we have calculated the exponent from the paths of a Weiner process.
+
++++
+
+### Range-based Variance Estimators
+
+We now turn our attention to range-based variance estimators. These estimators depends on OHLC timeseries, which are widely available from data providers such as [FMP](https://site.financialmodelingprep.com/).
 To analyze range-based variance estimators, we use he **quantflow.ta.OHLC** tool which allows to down-sample a timeserie to OHLC series and estimate variance with three different estimators
 
 * **Parkinson** (1980)
@@ -65,14 +92,111 @@ For this we build an OHLC estimator as template and use it to create OHLC estima
 
 ```{code-cell} ipython3
 import pandas as pd
+import polars as pl
+import math
 from quantflow.ta.ohlc import OHLC
-ohlc = OHLC(serie="0", period="10m", rogers_satchell_variance=True, parkinson_variance=True, garman_klass_variance=True)
+template = OHLC(serie="0", period="10m", rogers_satchell_variance=True, parkinson_variance=True, garman_klass_variance=True)
+seconds_in_day = 24*60*60
+
+def rstd(pdf: pl.Series, range_seconds: float) -> float:
+    """Calculate the standard deviation from a range-based variance estimator"""
+    variance = pdf.mean()
+    # scale the variance by the number of seconds in the period
+    variance = seconds_in_day * variance / range_seconds
+    return math.sqrt(variance)
 
 results = []
-for period in ("2m", "5m", "10m", "30m", "1h", "4h"):
-    operator = ohlc.model_copy(update=dict(period=period))
-    result = operator(df).sum()
-    results.append(dict(period=period, pk=result["0_pk"].item(), gk=result["0_gk"].item(), rs=result["0_rs"].item()))
-vdf = pd.DataFrame(results)
+for period in ("10s", "20s", "30s", "1m", "2m", "3m", "5m", "10m", "30m"):
+    ohlc = template.model_copy(update=dict(period=period))
+    rf = ohlc(df)
+    ts = pd.to_timedelta(period).to_pytimedelta().total_seconds()
+    data = dict(period=period)
+    for name in ("pk", "gk", "rs"):
+        estimator = rf[f"0_{name}"]
+        data[name] = rstd(estimator, ts)
+    results.append(data)
+vdf = pd.DataFrame(results).set_index("period")
 vdf
 ```
+
+These numbers are different from the realized variance because they are based on the range of the prices, not on the actual prices. The realized variance is a more direct measure of the volatility of the process, while the range-based estimators are more robust to market microstructure noise.
+
+The Parkinson estimator is always higher than both the Garman-Klass and Rogers-Satchell estimators, the reason is due to the use of the high and low prices only, which are always further apart than the open and close prices. The GK and RS estimators are similar and are more accurate than the Parkinson estimator, especially for greater periods.
+
+```{code-cell} ipython3
+pd.options.plotting.backend = "plotly"
+fig = vdf.plot(markers=True, title="Weiner Standard Deviation from Range-based estimators - correct value is 2.0")
+fig.show()
+```
+
+To estimate the Hurst exponent with the range-based estimators, we calculate the variance of the log of the range for different time windows and fit a line to the log-log plot of the variance vs the time window.
+
+```{code-cell} ipython3
+from typing import Sequence
+import numpy as np
+from quantflow.ta.ohlc import OHLC
+from collections import defaultdict
+from quantflow.ta.base import DataFrame
+
+default_periods = ("10s", "20s", "30s", "1m", "2m", "3m", "5m", "10m", "30m")
+
+def ohlc_hurst_exponent(
+    df: DataFrame,
+    series: Sequence[str],
+    periods: Sequence[str] = default_periods,
+) -> DataFrame:
+    results = {}
+    estimator_names = ("pk", "gk", "rs")
+    for serie in series:
+        template = OHLC(
+            serie=serie,
+            period="10m",
+            rogers_satchell_variance=True,
+            parkinson_variance=True,
+            garman_klass_variance=True
+        )
+        time_range = []
+        estimators = defaultdict(list)
+        for period in periods:
+            ohlc = template.model_copy(update=dict(period=period))
+            rf = ohlc(df)
+            ts = pd.to_timedelta(period).to_pytimedelta().total_seconds()
+            time_range.append(ts)
+            for name in estimator_names:
+                estimators[name].append(rf[f"{serie}_{name}"].mean())
+        results[serie] = [float(np.polyfit(np.log(time_range), np.log(estimators[name]), 1)[0])/2.0 for name in estimator_names]
+    return pd.DataFrame(results, index=estimator_names)
+```
+
+```{code-cell} ipython3
+ohlc_hurst_exponent(df, series=["0"])
+```
+
+The Hurst exponent should be close to 0.5, since we have calculated the exponent from the paths of a Weiner process. But the Hurst exponent is not exactly 0.5 because the range-based estimators are not the same as the realized variance. Interestingly, the Parkinson estimator gives a Hurst exponent closer to 0.5 than the Garman-Klass and Rogers-Satchell estimators.
+
+## Mean Reverting Time Series
+
+We now turn our attention to mean reverting time series, where the Hurst exponent is less than 0.5.
+
+```{code-cell} ipython3
+from quantflow.sp.ou import Vasicek
+import pandas as pd
+pd.options.plotting.backend = "plotly"
+
+p = Vasicek(kappa=2)
+paths = {f"kappa={k}": Vasicek(kappa=k).sample(n=1, time_horizon=1, time_steps=24*60*6) for k in (1.0, 10.0, 50.0, 100.0, 500.0)}
+pdf = pd.DataFrame({k: p.path(0) for k, p in paths.items()}, index=paths["kappa=1.0"].dates(start=start_of_day()))
+pdf.plot()
+```
+
+We can now estimate the Hurst exponent from the realized variance. As we can see the Hurst exponent decreases as we increase the mean reversion parameter.
+
+```{code-cell} ipython3
+pd.DataFrame({k: [p.hurst_exponent()] for k, p in paths.items()})
+```
+
+And we can also estimate the Hurst exponent from the range-based estimators. As we can see the Hurst exponent decreases as we increase the mean reversion parameter along the same lines as the realized variance.
+
+```{code-cell} ipython3
+ohlc_hurst_exponent(pdf.reset_index(), list(paths), periods=("10m", "20m", "30m", "1h"))
+```
diff --git a/quantflow/sp/base.py b/quantflow/sp/base.py
@@ -7,9 +7,9 @@
 from pydantic import BaseModel, ConfigDict, Field
 from scipy.optimize import Bounds
 
+from quantflow.ta.paths import Paths
 from quantflow.utils.marginal import Marginal1D, default_bounds
 from quantflow.utils.numbers import sigfig
-from quantflow.utils.paths import Paths
 from quantflow.utils.transforms import lower_bound, upper_bound
 from quantflow.utils.types import FloatArray, FloatArrayLike, Vector
 
diff --git a/quantflow/sp/bns.py b/quantflow/sp/bns.py
@@ -4,7 +4,7 @@
 from pydantic import Field
 from scipy.special import xlogy
 
-from ..utils.paths import Paths
+from ..ta.paths import Paths
 from ..utils.types import FloatArrayLike, Vector
 from .base import Im, StochasticProcess1D
 from .ou import GammaOU
diff --git a/quantflow/sp/cir.py b/quantflow/sp/cir.py
@@ -7,7 +7,7 @@
 
 from quantflow.utils.types import FloatArrayLike, Vector
 
-from ..utils.paths import Paths
+from ..ta.paths import Paths
 from .base import Im, IntensityProcess
 
 
diff --git a/quantflow/sp/heston.py b/quantflow/sp/heston.py
@@ -3,8 +3,8 @@
 import numpy as np
 from pydantic import Field
 
+from ..ta.paths import Paths
 from ..utils.distributions import DoubleExponential, Exponential
-from ..utils.paths import Paths
 from ..utils.types import FloatArrayLike, Vector
 from .base import StochasticProcess1D
 from .cir import CIR
diff --git a/quantflow/sp/jump_diffusion.py b/quantflow/sp/jump_diffusion.py
@@ -5,8 +5,8 @@
 import numpy as np
 from pydantic import Field
 
+from ..ta.paths import Paths
 from ..utils.distributions import Normal
-from ..utils.paths import Paths
 from ..utils.types import FloatArrayLike, Vector
 from .base import StochasticProcess1D
 from .poisson import CompoundPoissonProcess, D
diff --git a/quantflow/sp/ou.py b/quantflow/sp/ou.py
@@ -7,8 +7,8 @@
 from scipy.optimize import Bounds
 from scipy.stats import gamma, norm
 
+from ..ta.paths import Paths
 from ..utils.distributions import Exponential
-from ..utils.paths import Paths
 from ..utils.types import Float, FloatArrayLike, Vector
 from .base import Im, IntensityProcess
 from .poisson import CompoundPoissonProcess, D
diff --git a/quantflow/sp/poisson.py b/quantflow/sp/poisson.py
@@ -7,9 +7,9 @@
 from scipy.optimize import Bounds
 from scipy.stats import poisson
 
+from ..ta.paths import Paths
 from ..utils.distributions import Distribution1D
 from ..utils.functions import factorial
-from ..utils.paths import Paths
 from ..utils.transforms import TransformResult
 from ..utils.types import FloatArray, FloatArrayLike, Vector
 from .base import Im, StochasticProcess1D, StochasticProcess1DMarginal
diff --git a/quantflow/sp/weiner.py b/quantflow/sp/weiner.py
@@ -4,7 +4,7 @@
 from pydantic import Field
 from scipy.stats import norm
 
-from ..utils.paths import Paths
+from ..ta.paths import Paths
 from ..utils.types import FloatArrayLike, Vector
 from .base import StochasticProcess1D
 
diff --git a/quantflow/ta/paths.py b/quantflow/ta/paths.py
@@ -9,10 +9,10 @@
 from pydantic import BaseModel, Field
 from scipy.integrate import cumulative_trapezoid
 
-from . import plot
-from .bins import pdf as bins_pdf
-from .dates import utcnow
-from .types import FloatArray
+from quantflow.utils import plot
+from quantflow.utils.bins import pdf as bins_pdf
+from quantflow.utils.dates import utcnow
+from quantflow.utils.types import FloatArray
 
 
 class Paths(BaseModel, arbitrary_types_allowed=True):
@@ -58,6 +58,10 @@ def ys(self) -> list[list[float]]:
         """Paths as list of list (for visualization tools)"""
         return self.data.transpose().tolist()  # type: ignore
 
+    def path(self, i: int) -> FloatArray:
+        """Path i"""
+        return self.data[:, i]
+
     def dates(
         self, *, start: datetime | None = None, unit: str = "d"
     ) -> pd.DatetimeIndex:
@@ -116,6 +120,22 @@ def integrate(self) -> Paths:
             data=cumulative_trapezoid(self.data, dx=self.dt, axis=0, initial=0),
         )
 
+    def hurst_exponent(self, steps: int | None = None) -> float:
+        """Estimate the Hurst exponent from all paths
+
+        :param steps: number of lags to consider, if not provided it uses
+            half of the time steps capped at 100
+        """
+        ts = self.time_steps // 2
+        n = min(steps or ts, 100)
+        lags = []
+        tau = []
+        for lag in range(2, n):
+            variances = np.var(self.data[lag:, :] - self.data[:-lag, :], axis=0)
+            tau.extend(variances)
+            lags.extend([lag] * self.samples)
+        return float(np.polyfit(np.log(lags), np.log(tau), 1)[0]) / 2.0
+
     def cross_section(self, t: float | None = None) -> FloatArray:
         """Cross section of paths at time t"""
         index = self.time_steps
diff --git a/quantflow_tests/test_utils.py b/quantflow_tests/test_utils.py
@@ -1,7 +1,7 @@
 import numpy as np
 
+from quantflow.ta.paths import Paths
 from quantflow.utils.numbers import round_to_step, to_decimal
-from quantflow.utils.paths import Paths
 
 
 def test_round_to_step():

Original file line number	Diff line number	Diff line change
`@@ -8,3 +8,4 @@ Timeseries Analysis`
`8`	`8`	`:maxdepth: 1`
`9`	`9`
`10`	`10`	`ohlc`
	`11`	`+ paths`