Notes on Hierarchical Risk Parity

If you scroll quantitative finance Twitter for any length of time you will see them: 3D mountain landscapes labeled with asset tickers and days, surfaces that wave and ridge along one axis with sharp drops between ridges. The plot title is almost always some variant of "Hierarchical Risk Parity — Sorted Cumulative % Change."

That image is not arbitrary. It is the single most recognizable visualization in a genre that formed around 2018, when Marcos López de Prado's HRP paper hit the Python tutorial ecosystem and the mlfinlab / Riskfolio-Lib / PyPortfolioOpt libraries shipped the plotting functions out of the box. The genre has deeper roots — Markowitz 1952, Mantegna 1999, Laloux–Bouchaud 1999 — but the "render-a-3D-mountain-out-of-a-covariance-matrix" aesthetic is genuinely a post-2016 phenomenon, downstream of HRP specifically.

This article is about what those landscapes actually show, why they exist, and what mathematics turns a sample covariance matrix into a clean block of cluster-sorted assets that you can stare at and recognize industries in.

Markowitz's curse

Markowitz (1952) framed portfolio choice as a quadratic program: given a covariance matrix $\Sigma$ and an expected-return vector $\mu$ , find weights $w$ that minimize variance subject to a target return:

\min_w \tfrac{1}{2} w^\top \Sigma w \quad \text{s.t.}\quad w^\top \mu = \mu^\star, \ w^\top \mathbf{1} = 1.

The unconstrained solution is closed-form: $w^\star \propto \Sigma^{-1}(\mu - \lambda \mathbf{1})$ . The constrained version (Critical Line Algorithm, Markowitz 1956) is a finite combinatorial walk on the efficient frontier.

In sample, this works. Out of sample, it falls over.

The failure has a name — the Markowitz curse — and three coupled mechanisms:

Condition-number explosion. The sample covariance from $T$ observations of $N$ assets has rank $\min(N, T-1)$ . For $N \approx T$ the smallest eigenvalue collapses to zero and $\kappa(\hat\Sigma) = \lambda_\text{max}/\lambda_\text{min} \to \infty$ . The condition number controls how much numerical noise the inverse amplifies: a $\kappa$ of 1000 — routine for a 200-asset book on a single year of daily data — turns a 0.1% estimation error in $\hat\Sigma$ into a 100% error in $\Sigma^{-1}$ , which is why the optimal portfolio routinely produces individual weights above 100% of capital.
Error maximization. Michaud (1989) diagnosed mean-variance optimization as a literal error-maximizer: it loads on assets whose estimated $\hat\mu_i$ or $\hat\sigma_i^{-1}$ is a noisy upward draw. The optimizer cannot tell signal from noise; it can only tell large numbers from small ones.
Weight concentration. Without bounds, optimal weights routinely hit $\pm 1000\%$ . With long-only constraints, they peg most assets to zero and concentrate on a handful of corners.

López de Prado ("The 10 Reasons Most Machine Learning Funds Fail", JPM 44(6), 2018, SSRN 3104816) tightens the diagnosis: the weight error from a perturbation $\Delta\Sigma$ scales as $\|\Sigma^{-1}\|_2^2 \cdot \|\Delta\Sigma\| = \lambda_\text{min}^{-2}\|\Delta\Sigma\|$ — quadratic in the condition number. Every approach in this article is, in one way or another, a way of not inverting $\hat\Sigma$ directly.

The pre-HRP toolkit was three things. Linear shrinkage (Ledoit & Wolf 2004) blends $\hat\Sigma$ with a structured target. Resampled efficient frontier (Michaud 1998) bootstraps the inputs and averages the outputs. Black–Litterman (1991) replaces $\hat\mu$ with the Bayesian posterior of a CAPM equilibrium prior plus subjective views. All three help. None of them avoid $\Sigma^{-1}$ .

Figure 1.1Covariance inversion becomes numerically lethal well before the q = 1 rank-deficiency wall.

The Marchenko–Pastur condition-number bound $\kappa(q) = \bigl((1+\sqrt q)/(1-\sqrt q)\bigr)^2$ . Inversion noise scales as $\kappa$ and weight error as $\kappa^2$ , so the curve enters its "lethal" regime ( $\kappa > 100$ , shaded amber) at $q \approx 0.67$ — well before the rank-deficiency wall at $q = 1$ . A retail book with $q = 0.3$ already faces $\kappa \approx 47$ .

Original interactive plotRendered in-browser from the deterministic synthetic data and equations embedded in this article.

How much does Markowitz actually cost? Out-of-sample variance is the standard yardstick. López de Prado's Monte Carlo benchmark (2016) runs $N = 10$ assets, $T = 520$ days, block-correlation structure: the Critical Line Algorithm, the literal minimum-variance optimizer, comes in at variance $\approx 0.41$ . Inverse-variance allocation: $\approx 0.24$ . Equal-weight 1/N: $\approx 0.22$ . The dedicated optimizer is the worst of the three. The bar to clear is low.

Marchenko–Pastur and the noise spectrum

Random matrix theory tells you, quantitatively, how much of your sample covariance is signal versus noise. Take an $N \times T$ matrix $X$ of standardized returns drawn from independent $\mathcal{N}(0, 1)$ — pure noise, no structure. The eigenvalue density of $\hat\Sigma = \tfrac{1}{T}X X^\top$ , in the limit $N, T \to \infty$ with $q = N/T$ fixed, converges to the Marchenko–Pastur distribution:

\rho_\text{MP}(\lambda) = \frac{1}{2\pi q \lambda}\sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)},\qquad \lambda_\pm = (1 \pm \sqrt{q})^2.

For $q = 0.3$ — a year of daily returns ( $T \approx 250$ ) on $N = 75$ assets — the bulk is supported on $[0.2, 2.4]$ .

NoteWhy MP applies to financial returns at all

Financial returns are not iid Gaussian — they are fat-tailed, autocorrelated, and conditionally heteroskedastic. The MP density was derived under iid Gaussian assumptions, so the natural objection is that the result should not transfer. The answer is universality: the limiting bulk distribution depends only on the fact that entries have finite variance, not on their specific distribution. This is the same phenomenon that makes the central limit theorem work for non-Gaussian summands. For heavy-tailed returns the bulk position shifts mildly, but the band structure $[\lambda_-, \lambda_+]$ remains a reliable noise/signal separator.

Laloux, Cizeau, Bouchaud, and Potters (1999, Phys. Rev. Lett. 83:1467, arXiv:cond-mat/9810255) measured the spectrum of the S&P 500 sample correlation matrix and found that the bulk of the empirical eigenvalues sits exactly inside this distribution. The signal is in a handful of outliers above $\lambda_+$ :

The single largest eigenvalue $\lambda_1 \approx 25$ — an order of magnitude above the MP edge — with an eigenvector that has all-positive entries roughly proportional to market cap. This is the market mode.
The next 5–10 outliers correspond to sector modes (energy, financials, tech).
The remaining $\sim$ 90% of the spectrum is statistically indistinguishable from random.

Figure 2.1The Marchenko–Pastur bulk separates random eigenvalue noise from four planted market and sector signals.

Eigenvalue spectrum of a simulated $100 \times 100$ sample correlation matrix with $q = 0.4$ against the Marchenko–Pastur density $\rho_\text{MP}(\lambda)$ . The MP bulk $[\lambda_-, \lambda_+] \approx [0.15, 2.63]$ is shaded and contains 96 of 100 empirical eigenvalues — statistically indistinguishable from random. Four outliers carry signal: the all-positive market mode and three sector modes. Log $x$ -axis is necessary because the market mode sits an order of magnitude above the noise edge.

Original interactive plotRendered in-browser from the deterministic synthetic data and equations embedded in this article.

The practical consequence: if you naively invert $\hat\Sigma$ to get a minimum-variance portfolio, you are inverting a matrix that is 90% noise, and the inversion blows up the noise quadratically. The fix is to clean before optimizing.

Eigenvalue clipping (Laloux et al. 1999) keeps the $k$ outlier eigenvalues, replaces the bulk with a constant chosen to preserve $\operatorname{tr}\hat\Sigma$ , and reconstructs. The Rotationally Invariant Estimator (Bouchaud and Potters, arXiv:1610.08104, 2017) is the provably optimal nonlinear shrinkage among estimators that preserve the eigenvectors of $\hat\Sigma$ : each eigenvalue gets replaced by a function $\xi(\lambda)$ derived from the Stieltjes transform of the empirical spectral density.

Both methods buy orders of magnitude in out-of-sample portfolio variance over the naive approach.

This is the foundation. Now we change strategy: instead of cleaning $\hat\Sigma$ and inverting the result, what if we never invert at all?

Mantegna and the hidden hierarchy

Mantegna ("Hierarchical structure in financial markets", Eur. Phys. J. B 11 (1999), arXiv:cond-mat/9802256) introduced the trick that everything since rests on. Define a distance between assets directly from their correlation:

d_{ij} = \sqrt{2(1 - \rho_{ij})}.

Aside

This is not arbitrary. For standardized return vectors $\hat r_i = r_i / \|r_i\|$ on the unit sphere, $\rho_{ij} = \hat r_i \cdot \hat r_j$ and a direct expansion gives $\|\hat r_i - \hat r_j\|^2 = 2(1 - \rho_{ij})$ , so $d_{ij} = \|\hat r_i - \hat r_j\|$ exactly. Euclidean distance on $\RR^T$ is a metric — non-negative, symmetric, and triangle-inequal — so $d_{ij}$ inherits all three properties for free. Pairs that move together have $\rho \approx 1$ and $d \approx 0$ ; uncorrelated pairs have $\rho \approx 0$ and $d \approx \sqrt 2$ ; anti-correlated pairs have $\rho \approx -1$ and $d \approx 2$ .

Mantegna then computed the minimum spanning tree of the resulting weighted graph — the $N-1$ smallest-distance edges that connect all $N$ assets. The MST extracts the strongest correlations and discards the rest. On equity universes the MST is recognizable: megacap names are hubs, sector members cluster around their hubs, inter-sector edges hop hub-to-hub.

Tumminello, Aste, Di Matteo, and Mantegna (PNAS 102(30):10421, 2005) generalized this to the Planar Maximally Filtered Graph — the densest correlation subgraph embeddable in a sphere, with $3(N-2)$ edges. These information-filtering networks are the conceptual ancestors of HRP. The intuition: a $\rho \in \mathbb{R}^{N \times N}$ matrix has $\binom{N}{2}$ entries, but only $O(N)$ of them carry signal. The rest is noise dressing, and the filtering operation is how you tell them apart.

Hierarchical Risk Parity, in full

López de Prado ("Building Diversified Portfolios that Outperform Out-of-Sample", JPM 42(4):59–69, 2016, SSRN 2708678) packaged Mantegna's metric, Markowitz's quadratic structure, and a recursive bisection into one algorithm. Four stages.

Stage 1 — Distance. Start from Mantegna's distance $d_{ij} = \sqrt{2(1 - \hat\rho_{ij})} \in [0, 2]$ . López de Prado rescales it to live on the unit interval —

d_{ij} = \sqrt{\tfrac{1}{2}(1 - \hat\rho_{ij})}, \qquad d_{ij} \in [0, 1],

— which leaves the clustering output unchanged but produces a tidier distance matrix.

then a second-order distance — the Euclidean distance between the columns of $d$ , which compares assets by their full distance profile rather than just their pairwise distance:

\tilde d_{ij} = \sqrt{\sum_{n=1}^N (d_{ni} - d_{nj})^2}.

The point of the second-order step is robustness. Two assets in different sectors that are linked only through their shared market beta will have a non-trivial pairwise $d_{ij}$ that mostly reflects market-mode noise. But their distance profiles $(d_{ni})_{n=1}^N$ across the entire universe will look very similar — both will be close to every large-cap stock and far from defensive assets in the same way. The transform $\tilde d$ measures that profile similarity directly, so the resulting clusters group assets by how they relate to everything else, not by raw pairwise correlation.

Stage 2 — Tree clustering. Run single-linkage agglomerative clustering on $\tilde d$ : recursively merge the two clusters $u, v$ with smallest $\min_{i \in u, j \in v} \tilde d_{ij}$ . The output is a binary dendrogram — a tree whose leaves are the $N$ assets and whose internal nodes are merge events, each at a height equal to the merge distance. Single linkage is the choice here because it preserves the MST topology — the merges follow the same backbone Mantegna extracted.

Figure 4.1The dendrogram exposes three stable asset clusters separated by a clear merge-height gap.

Single-linkage dendrogram on 8 assets with HRP distance $\tilde d$ . The leaf ordering — AAPL, MSFT, NVDA, JPM, GS, BAC, XOM, CVX — is the permutation $\pi$ used in quasi-diagonalization. Each horizontal bar sits at its merge height; the dashed cut at $h = 0.30$ falls in the gap between within-cluster merges ( $\le 0.19$ ) and inter-cluster merges ( $\ge 0.42$ ), producing the natural three-cluster grouping. The vertical separation between these regimes is what makes the partition meaningful — a flat dendrogram would be a noise dendrogram.

Original diagramAuthored as inline SVG in this article.

Stage 3 — Quasi-diagonalization. Walk the dendrogram in tree order to produce a permutation $\pi : \{1, \ldots, N\} \to \{1, \ldots, N\}$ . The permuted correlation matrix $\hat\rho_{\pi(i), \pi(j)}$ has its large entries concentrated near the diagonal — a block-diagonal-ish structure where each block is a cluster.

Figure 4.2Quasi-diagonalization reveals block correlation structure without changing any matrix entry.

The same 24-asset correlation matrix — intra-cluster $\rho = 0.78$ , inter-cluster $\rho = 0.08$ — rendered before and after the HRP permutation $\pi$ . Left: assets in alphabetical order; the block structure is washed out by the interleaved indexing. Right: $\pi$ groups correlated assets adjacent and three diagonal blocks (amber dashed) emerge. No matrix entry has changed — only the labels on the axes.

Original diagramAuthored as inline SVG in this article.

Stage 4 — Recursive bisection. Initialize $w_i = 1$ for all $i$ . Walk the dendrogram top-down, at each internal node bisecting the cluster into two halves $L_1, L_2$ . Within each half, compute inverse-variance weights

\tilde w^{(k)} = \frac{\operatorname{diag}(\Sigma_k)^{-1}}{\operatorname{tr}(\operatorname{diag}(\Sigma_k)^{-1})}

and the resulting sub-portfolio variance $\tilde V_k = \tilde w^{(k)\top} \Sigma_k \tilde w^{(k)}$ . The split factor

\alpha = 1 - \frac{\tilde V_1}{\tilde V_1 + \tilde V_2}

is the fraction of capital assigned to $L_1$ ; with $\tilde V_1 < \tilde V_2$ we get $\alpha > 1/2$ , so more capital flows to the lower-variance half. Multiply all weights in $L_1$ by $\alpha$ and all weights in $L_2$ by $1 - \alpha$ . Recurse on each half. Terminate when every cluster contains a single asset; the final $w_i$ for asset $i$ is the product of split factors along the path from the dendrogram root down to leaf $i$ .

Figure 4.3Recursive bisection assigns less weight to each higher-variance cluster.

HRP recursive bisection on four assets. Sub-portfolio variances $\tilde V_1 = 0.02$ and $\tilde V_2 = 0.05$ yield $\alpha = 5/7$ : more capital to the lower-variance left half. Each edge multiplies a running weight; the final allocations are products along the root-to-leaf path. Two scalar multiplications per asset, no matrix inversion anywhere in the tree. Compare $w_1 \approx 0.452$ with $w_4 \approx 0.166$ — the four-fold spread reflects real variance differences, not optimizer noise.

Original diagramAuthored as inline SVG in this article.

The algorithm never inverts $\Sigma$ . It only ever inverts diagonals (1×1 blocks) and compares 2×2 sub-portfolios. The propagated error scales as $O(\sqrt N)$ in the worst case versus $O(\kappa^2 N)$ for naive mean-variance. Antonov, Lipton, and López de Prado (SSRN 4748151, 2024) give a formal proof: HRP variance is bounded above by a function that grows polynomially in $\kappa$ , not exponentially in the eigenvalue dispersion.

Returning to López de Prado's Monte Carlo with this machinery in hand: HRP came in at variance $\approx 0.16$ , beating inverse-variance ( $0.24$ ), equal-weight ( $0.22$ ), and the Critical Line Algorithm ( $0.41$ ) on the same data and the same out-of-sample objective. The optimizer designed to minimize variance lost to a tree-walk that never inverted its covariance matrix.

Figure 4.4On the synthetic dataset, HRP and equal weighting follow the underlying drift more smoothly than the covariance-sensitive Markowitz allocation.

Out-of-sample cumulative returns for the same 24-asset synthetic dataset under three allocators. HRP and $1/N$ track the underlying drift smoothly; the naive Markowitz CLA — fed a noisy short-window sample covariance — produces weights with $\pm 50\%$ estimation jitter and shorts sprinkled throughout, and the resulting portfolio path is dominated by hedging churn rather than the underlying clusters. Realized $\sigma$ is computed from each path at render time.

Original interactive plotRendered in-browser from the deterministic synthetic data and equations embedded in this article.

The cluster-sorted return surface

Figure 5.1Cluster ordering turns within-cluster correlation into continuous ridges and between-cluster changes into sharp folds.

The HRP sorted-cumulative-return surface: $Z(i, t) = \sum_{s = 1}^{t} r_{\pi(i), s}$ over 24 assets in 3 clusters over 180 days. Adjacent indices are clustered correlated assets, so adjacent rows are similar — producing continuous ridges along the time axis. Cluster boundaries appear as sharp folds on the asset axis. The surface uses only the permutation $\pi$ , not the HRP weights.

Original interactive plotRendered in-browser from the deterministic synthetic data and equations embedded in this article.

The mathematical object plotted is

Z(i, t) = \sum_{s = 1}^{t} r_{\pi(i), s},

the cumulative return of the asset at HRP-sorted index $i$ on day $t$ . The axes are: $i$ (cluster-sorted asset index), $t$ (calendar day), $Z$ (cumulative percent return).

What makes the surface wavy in a structured way is precisely the permutation $\pi$ . Single-linkage clustering puts the most correlated pairs adjacent in $\pi$ , so adjacent columns of the surface have similar trajectories. The eye reads that as smooth ridges running parallel to the time axis. Discontinuities along the asset axis mark cluster boundaries — those are the fold lines where one industry ends and the next begins.

The trick is worth stating plainly. The 3D landscape plot does not use the HRP weights $w_i$ at all. It uses only the permutation $\pi$ . The image is a way of staring at what the clustering algorithm produced, before any capital allocation is performed. The smoothness of the surface is diagnostic of clustering quality, not of portfolio optimality.

This is also why the visualization works at all. A correlation matrix has $\binom{N}{2}$ entries — too many to read by hand. A heatmap renders all of them but reads as a 2D blob. The dendrogram captures the merge order but loses the asset positions. The cluster-sorted return landscape collapses correlation, temporal evolution, and asset identity into a single surface, and your visual cortex does the rest. You can pick out a market crash as a synchronous downstroke across all clusters, a sector rotation as a one-cluster trough, and a stable factor exposure as a long persistent ridge. None of that is visible in $\hat\Sigma$ alone.

The HRP family

Single algorithms become families. HRP has produced at least five direct descendants worth knowing.

HERC — Hierarchical Equal Risk Contribution. Raffinot (SSRN 3237540, 2018) replaces single linkage with Ward linkage, which merges the two clusters whose combined within-cluster sum of squares grows least. Where single linkage is a pointwise criterion — merge whoever has the closest pair — Ward is a global one, so it produces compact, similarly-sized clusters rather than long thin chains. HERC also replaces HRP's index-halving bisection with a dendrogram-following bisection, and uses the Gap statistic (Tibshirani–Walther–Hastie 2001) to pick the number of clusters. The Gap statistic compares the observed within-cluster dispersion against the dispersion expected under a uniform null reference; the optimal $K$ is where the data's clustering structure explains the most that random data would not. At each split, allocate by risk-contribution ratio:

\alpha = \frac{\text{RC}(L_2)}{\text{RC}(L_1) + \text{RC}(L_2)}.

You can plug in variance, CVaR, or Conditional Drawdown-at-Risk as the risk measure.

HERC fixes two specific HRP pathologies. First, single linkage's chaining effect: because the merge criterion is the closest pair across two clusters, a single bridging asset between two otherwise-distant clusters causes them to merge at a height comparable to within-cluster merges. The resulting "chain" cluster mixes assets that share no real economic similarity beyond the bridge. Second, HRP's bisection halves clusters by leaf-index count, which can put highly correlated assets on opposite sides of a cut purely because they happen to sit on different sides of a midpoint. Dendrogram-following bisection respects the merge structure and keeps correlated groups together.

Figure 6.1Ward linkage preserves the cluster-separation gap that single linkage collapses through chaining.

Same nine-asset dataset under two linkage criteria. Left — single linkage: B* (amber) bridges the A- and B-chains via their nearest-pair distances, collapsing everything into one mega-cluster at $h = 0.24$ — barely above the within-cluster merges at $h = 0.08$ – $0.17$ . Right — Ward linkage: the squared-dispersion penalty makes bridging too costly; B* is absorbed cleanly into cluster B at $h = 0.35$ , and the A–B merge is deferred to $h = 0.72$ — a threefold gap that makes the cluster boundary unambiguous. HERC uses Ward precisely because this gap is the structure worth preserving.

Original diagramAuthored as inline SVG in this article.

NCO — Nested Clustered Optimization. López de Prado (SSRN 3469961, 2019). Two-stage Markowitz: cluster $\hat\Sigma$ into $K$ blocks, run MVO within each block (small per-block condition number), then run MVO across blocks using a reduced $K \times K$ inter-cluster covariance. Under reasonable assumptions about block sparsity, the per-stage condition number is bounded by $\max_k \kappa_k \ll \kappa(\hat\Sigma)$ , so the propagated error scales as $\max_k \kappa_k^2$ rather than $\kappa^2$ . NCO recovers most of MVO's in-sample optimality while inheriting HRP's stability.

Schur Complementary Allocation. Cotton (arXiv:2411.05807, 2024) gives the cleanest theoretical unification. HRP's recursive bisection treats each cluster as if the inter-cluster covariance were zero. The Schur method augments each sub-covariance with a scaled Schur complement of the off-diagonal block:

\Sigma = \begin{pmatrix} A & B \\ B^\top & C \end{pmatrix}, \qquad A/C := A - \gamma B C^{-1} B^\top.

The notation $A/C$ — the slash is a quotient symbol, not division — is the standard name for the Schur complement. Conceptually, $A/C$ is what the variance of cluster 1 looks like after conditioning on cluster 2: regress out the linear influence of $L_2$ on $L_1$ and the residual variance is $A - B C^{-1} B^\top$ . The parameter $\gamma$ slides between ignoring inter-cluster information ( $\gamma = 0$ , full HRP) and using all of it ( $\gamma = 1$ , full MVP). Cotton proves that portfolio variance decreases monotonically in $\gamma$ over $[0, 1]$ : HRP is the worst case of a one-parameter family that interpolates to MVP. This is the right way to read HRP — not as an alternative to Markowitz but as a regularized degenerate limit of it.

Constrained HRP. Pfitzinger and Katzke (Stellenbosch WP 14/2019) project partial weights onto box and group constraint sets at each split, so HRP plays nicely with mandates (no more than 5% per asset, no more than 20% per sector, etc.).

Hierarchical Minimum Variance Portfolios. arXiv:2503.12328 (2025) replaces HRP's inverse-variance step with full minimum-variance per cluster, retaining the dendrogram structure but inverting only well-conditioned local blocks. Empirically competitive with NCO at lower implementation cost.

The decision rule across the family is approximately this. Use plain HRP when you want maximum stability and minimum implementation complexity. Use HERC when the asset universe has no obvious block structure and single-linkage chaining is producing degenerate clusters. Use NCO when you want MVO efficiency inside well-conditioned blocks but HRP-style stability across them. Use Schur when you want a single tunable parameter that interpolates between HRP and minimum-variance. Use Constrained HRP when you have a hard mandate constraint to respect.

The rest of the toolbox

HRP is not the only response to Markowitz's curse. Two parallel programs are worth mentioning.

Risk parity (Roncalli). Maillard, Roncalli, Teïletche (JPM 36(4), 2010) define equal-risk-contribution portfolios: $w$ with $\text{RC}_i(w) = w_i(\Sigma w)_i / \sqrt{w^\top \Sigma w}$ equal across $i$ . The definition is not arbitrary. Portfolio volatility $\sigma(w) = \sqrt{w^\top \Sigma w}$ is first-order homogeneous in $w$ (scaling all weights by $\lambda$ scales $\sigma$ by $\lambda$ ), so Euler's theorem for homogeneous functions guarantees the exact decomposition $\sigma(w) = \sum_i w_i \partial \sigma / \partial w_i = \sum_i \text{RC}_i(w)$ . The risk contributions are the unique additive partition of total volatility induced by Euler's identity — no approximation, no choice of decomposition.

The non-convex equal-contribution problem has an elegant convex reformulation:

\min_{y > 0}\; \tfrac{1}{2} y^\top \Sigma y - \sum_i \log y_i, \qquad w = y / \mathbf{1}^\top y.

Stationarity reads $(\Sigma y)_i - 1/y_i = 0$ , i.e. $y_i (\Sigma y)_i = 1$ for all $i$ . Normalizing $y$ to $w$ gives $w_i (\Sigma w)_i$ constant across $i$ — exactly the ERC condition. The unique solution sits in the variance interval between minimum-variance and equal-weight, and like HRP requires no expected-return estimate. The institutional version of this is Bridgewater's All-Weather, levered up to a target volatility.

Wasserstein distributionally robust optimization. The Wasserstein-1 distance between two probability distributions $P$ and $Q$ is the minimum expected transportation cost to morph one into the other: $W_1(P, Q) = \inf_\gamma \mathbb{E}_{(X, Y) \sim \gamma}[\|X - Y\|]$ over all couplings $\gamma$ with marginals $P$ and $Q$ . The "earth-mover" image is exact — you literally move probability mass and pay distance per unit moved. Mohajerin Esfahani and Kuhn (Math. Program. 171, 2018, arXiv:1505.05116) solve

\min_w \sup_{\mathbb{Q} \in \mathcal{B}_\varepsilon(\hat{\mathbb{P}}_T)} \mathbb{E}_{\mathbb{Q}}[\ell(w, X)]

where $\mathcal{B}_\varepsilon$ is the Wasserstein-1 ball of radius $\varepsilon$ around the empirical distribution $\hat{\mathbb{P}}_T$ — every distribution you could reach from your data by moving each observation at most $\varepsilon$ on average. The parameter $\varepsilon$ is the radius of adversarial scenarios you hedge against, and it scales like $N^{-1/2}$ with sample size: more data, smaller realistic worst-case ball. For convex losses (mean-CVaR, log-utility) the inner sup reduces to a finite convex program by Kantorovich duality. As $\varepsilon \to 0$ you recover sample-Markowitz; as $\varepsilon \to \infty$ you recover $1/N$ equal-weight (the minimax-optimal portfolio when you know nothing about the true distribution). The intermediate regime is where the work happens.

CVaR optimization. Rockafellar and Uryasev (J. Risk 2, 2000) collapse tail-risk minimization to a linear program via the dual representation

\text{CVaR}_\alpha(w) = \min_\zeta \left\{ \zeta + \tfrac{1}{1 - \alpha} \mathbb{E}[(L(w, X) - \zeta)^+] \right\}.

This is what made tail-risk optimization computationally tractable. Tail risk parity — equalizing CVaR contributions instead of variance contributions — is the natural HRP-style extension.

The frontier

Three threads dominate current research. Each one is a different attempt to step beyond what HRP can already do.

Signature methods in stochastic portfolio theory. Cuchiero, Schmocker, Teichmann (arXiv:2310.02322, 2023). The path signature is the universal feature representation for continuous paths from rough path theory — every tensor coefficient is an iterated integral of the path against itself. Replacing the static covariance summary $\hat\Sigma$ with a richer signature-based summary lets you build linear path-functional portfolios $w_t = f(\text{Sig}(X)_{[0, t]})$ that are dense in the space of continuous path-dependent portfolios (Lyons universality), and that reduce both log-wealth maximization and mean-variance optimization to convex quadratic programs with globally optimal solutions. The same machinery, applied to execution, gives Cartea–Jaimungal-style signature execution algorithms (arXiv:1905.00728). Recent work (arXiv:2510.10728, 2025) couples truncated log-signatures with neural rough differential equations for high-dimensional BSDE-based control.

Deep learning portfolios. Where HRP relies on a hand-specified clustering pipeline, deep methods learn the allocation function end-to-end. Zhang, Zohren, Roberts (arXiv:2005.13665, 2020) train an LSTM-plus-head to maximize realized Sharpe over a rolling window — no forecasting step, the network outputs weights directly. Reinforcement-learning portfolio agents (Jiang, Xu, Liang, arXiv:1706.10059) train policies that output a simplex action $w_{t+1}$ with an explicit log-return reward. Graph attention networks on correlation graphs (Korangi, Mues, Bravo, arXiv:2407.15532, 2024) — effectively a learned, attention-weighted version of Mantegna's correlation network — outperform classical MVO on S&P 500 universes at annualized return 16.8%, Sharpe 1.34, and 8.2% max drawdown, against an MVO baseline that typically lands at Sharpe well below 1.0 in the same regime.

Quantum portfolio optimization. Discrete Markowitz with integer position sizes (lot constraints, no fractional shares) is NP-complete; the QUBO encoding maps it to a transverse-field Ising Hamiltonian solvable on NISQ devices via QAOA. Benchmarking studies (arXiv:2509.17876) compare QAOA, quantum annealing, mixed-integer solvers, and metaheuristics on universes up to 1000 assets. Meaningful quantum advantage at scale remains an open question — current results are competitive but not yet dominant against classical solvers.

How to read the landscape

The HRP cumulative-return landscape is doing three jobs at once. It is collapsing $\Sigma$ , the return path, and the asset identity into a single surface that shows:

Cluster structure. Continuous ridges marked by sharp drops at cluster boundaries. Three smooth bands = three sectors well separated by correlation. A jagged surface with no visible blocks = your data has no usable structure or your clustering choice is wrong.
Factor regimes. A long monochromatic ridge running across the time axis = persistent factor exposure. An abrupt color shift halfway through = regime change.
Crash events. A synchronous downstroke across all clusters = the market mode firing. A trough confined to one cluster = idiosyncratic sector pain.

What the picture is not showing: the weights. The allocation step happens after the visualization and is invisible in the surface. The 3D plot is diagnostic of the clustering; the weights are a downstream byproduct.

The practical question for any allocator is the same one that drives every approach in this article: which estimator of $\Sigma$ is wrong in the way that matters least for the trade you are sizing? Naive Markowitz is wrong because it inverts a noisy matrix. RIE-cleaned Markowitz is wrong because it assumes static covariance. Risk parity is wrong because it ignores expected returns entirely. HRP is wrong because single linkage chains noise into clusters. NCO is wrong because the cluster count is a hyperparameter. Schur is wrong because $\gamma$ is a hyperparameter. Deep portfolios are wrong because they overfit.

Each method is wrong in a different way. The HRP visualization is one of the few ways of seeing, at a glance, which of those wrong assumptions is most wrong for your data. A clean three-cluster landscape says: the block structure is real, single linkage caught it, HRP will be stable. A muddy speckled surface says: cluster the data with Ward instead, or shrink first, or use NCO with $K$ chosen by silhouette.

The landscape is the model made visible. The wavy mountain encodes more diagnostic information about your assumptions than any single number a backtest would give you.

Comments