If you scroll quantitative finance Twitter for any length of time you will see them: 3D mountain landscapes labeled with asset tickers and days, surfaces that wave and ridge along one axis with sharp drops between ridges. The plot title is almost always some variant of "Hierarchical Risk Parity — Sorted Cumulative % Change."
That image is not arbitrary. It is the single most recognizable visualization in a genre that formed around 2018, when Marcos López de Prado's HRP paper hit the Python tutorial ecosystem and the mlfinlab / Riskfolio-Lib / PyPortfolioOpt libraries shipped the plotting functions out of the box. The genre has deeper roots — Markowitz 1952, Mantegna 1999, Laloux–Bouchaud 1999 — but the "render-a-3D-mountain-out-of-a-covariance-matrix" aesthetic is genuinely a post-2016 phenomenon, downstream of HRP specifically.
This article is about what those landscapes actually show, why they exist, and what mathematics turns a sample covariance matrix into a clean block of cluster-sorted assets that you can stare at and recognize industries in.
Markowitz's curse
Markowitz (1952) framed portfolio choice as a quadratic program: given a covariance matrix and an expected-return vector , find weights that minimize variance subject to a target return:
The unconstrained solution is closed-form: . The constrained version (Critical Line Algorithm, Markowitz 1956) is a finite combinatorial walk on the efficient frontier.
In sample, this works. Out of sample, it falls over.
The failure has a name — the Markowitz curse — and three coupled mechanisms:
-
Condition-number explosion. The sample covariance from observations of assets has rank . For the smallest eigenvalue collapses to zero and . The condition number controls how much numerical noise the inverse amplifies: a of 1000 — routine for a 200-asset book on a single year of daily data — turns a 0.1% estimation error in into a 100% error in , which is why the optimal portfolio routinely produces individual weights above 100% of capital.
-
Error maximization. Michaud (1989) diagnosed mean-variance optimization as a literal error-maximizer: it loads on assets whose estimated or is a noisy upward draw. The optimizer cannot tell signal from noise; it can only tell large numbers from small ones.
-
Weight concentration. Without bounds, optimal weights routinely hit . With long-only constraints, they peg most assets to zero and concentrate on a handful of corners.
López de Prado ("The 10 Reasons Most Machine Learning Funds Fail", JPM 44(6), 2018, SSRN 3104816) tightens the diagnosis: the weight error from a perturbation scales as — quadratic in the condition number. Every approach in this article is, in one way or another, a way of not inverting directly.
The pre-HRP toolkit was three things. Linear shrinkage (Ledoit & Wolf 2004) blends with a structured target. Resampled efficient frontier (Michaud 1998) bootstraps the inputs and averages the outputs. Black–Litterman (1991) replaces with the Bayesian posterior of a CAPM equilibrium prior plus subjective views. All three help. None of them avoid .
The Marchenko–Pastur condition-number bound . Inversion noise scales as and weight error as , so the curve enters its "lethal" regime (, shaded amber) at — well before the rank-deficiency wall at . A retail book with already faces .
How much does Markowitz actually cost? Out-of-sample variance is the standard yardstick. López de Prado's Monte Carlo benchmark (2016) runs assets, days, block-correlation structure: the Critical Line Algorithm, the literal minimum-variance optimizer, comes in at variance . Inverse-variance allocation: . Equal-weight 1/N: . The dedicated optimizer is the worst of the three. The bar to clear is low.
Marchenko–Pastur and the noise spectrum
Random matrix theory tells you, quantitatively, how much of your sample covariance is signal versus noise. Take an matrix of standardized returns drawn from independent — pure noise, no structure. The eigenvalue density of , in the limit with fixed, converges to the Marchenko–Pastur distribution:
For — a year of daily returns () on assets — the bulk is supported on .
Financial returns are not iid Gaussian — they are fat-tailed, autocorrelated, and conditionally heteroskedastic. The MP density was derived under iid Gaussian assumptions, so the natural objection is that the result should not transfer. The answer is universality: the limiting bulk distribution depends only on the fact that entries have finite variance, not on their specific distribution. This is the same phenomenon that makes the central limit theorem work for non-Gaussian summands. For heavy-tailed returns the bulk position shifts mildly, but the band structure remains a reliable noise/signal separator.
Laloux, Cizeau, Bouchaud, and Potters (1999, Phys. Rev. Lett. 83:1467, arXiv:cond-mat/9810255) measured the spectrum of the S&P 500 sample correlation matrix and found that the bulk of the empirical eigenvalues sits exactly inside this distribution. The signal is in a handful of outliers above :
- The single largest eigenvalue — an order of magnitude above the MP edge — with an eigenvector that has all-positive entries roughly proportional to market cap. This is the market mode.
- The next 5–10 outliers correspond to sector modes (energy, financials, tech).
- The remaining 90% of the spectrum is statistically indistinguishable from random.
Eigenvalue spectrum of a simulated sample correlation matrix with against the Marchenko–Pastur density . The MP bulk is shaded and contains 96 of 100 empirical eigenvalues — statistically indistinguishable from random. Four outliers carry signal: the all-positive market mode and three sector modes. Log -axis is necessary because the market mode sits an order of magnitude above the noise edge.
The practical consequence: if you naively invert to get a minimum-variance portfolio, you are inverting a matrix that is 90% noise, and the inversion blows up the noise quadratically. The fix is to clean before optimizing.
Eigenvalue clipping (Laloux et al. 1999) keeps the outlier eigenvalues, replaces the bulk with a constant chosen to preserve , and reconstructs. The Rotationally Invariant Estimator (Bouchaud and Potters, arXiv:1610.08104, 2017) is the provably optimal nonlinear shrinkage among estimators that preserve the eigenvectors of : each eigenvalue gets replaced by a function derived from the Stieltjes transform of the empirical spectral density.
Both methods buy orders of magnitude in out-of-sample portfolio variance over the naive approach.
This is the foundation. Now we change strategy: instead of cleaning and inverting the result, what if we never invert at all?
Mantegna and the hidden hierarchy
Mantegna ("Hierarchical structure in financial markets", Eur. Phys. J. B 11 (1999), arXiv:cond-mat/9802256) introduced the trick that everything since rests on. Define a distance between assets directly from their correlation:
Mantegna then computed the minimum spanning tree of the resulting weighted graph — the smallest-distance edges that connect all assets. The MST extracts the strongest correlations and discards the rest. On equity universes the MST is recognizable: megacap names are hubs, sector members cluster around their hubs, inter-sector edges hop hub-to-hub.
Tumminello, Aste, Di Matteo, and Mantegna (PNAS 102(30):10421, 2005) generalized this to the Planar Maximally Filtered Graph — the densest correlation subgraph embeddable in a sphere, with edges. These information-filtering networks are the conceptual ancestors of HRP. The intuition: a matrix has entries, but only of them carry signal. The rest is noise dressing, and the filtering operation is how you tell them apart.
Hierarchical Risk Parity, in full
López de Prado ("Building Diversified Portfolios that Outperform Out-of-Sample", JPM 42(4):59–69, 2016, SSRN 2708678) packaged Mantegna's metric, Markowitz's quadratic structure, and a recursive bisection into one algorithm. Four stages.
Stage 1 — Distance. Start from Mantegna's distance . López de Prado rescales it to live on the unit interval —
— which leaves the clustering output unchanged but produces a tidier distance matrix.
then a second-order distance — the Euclidean distance between the columns of , which compares assets by their full distance profile rather than just their pairwise distance:
The point of the second-order step is robustness. Two assets in different sectors that are linked only through their shared market beta will have a non-trivial pairwise that mostly reflects market-mode noise. But their distance profiles across the entire universe will look very similar — both will be close to every large-cap stock and far from defensive assets in the same way. The transform measures that profile similarity directly, so the resulting clusters group assets by how they relate to everything else, not by raw pairwise correlation.
Stage 2 — Tree clustering. Run single-linkage agglomerative clustering on : recursively merge the two clusters with smallest . The output is a binary dendrogram — a tree whose leaves are the assets and whose internal nodes are merge events, each at a height equal to the merge distance. Single linkage is the choice here because it preserves the MST topology — the merges follow the same backbone Mantegna extracted.
Single-linkage dendrogram on 8 assets with HRP distance . The leaf ordering — AAPL, MSFT, NVDA, JPM, GS, BAC, XOM, CVX — is the permutation used in quasi-diagonalization. Each horizontal bar sits at its merge height; the dashed cut at falls in the gap between within-cluster merges () and inter-cluster merges (), producing the natural three-cluster grouping. The vertical separation between these regimes is what makes the partition meaningful — a flat dendrogram would be a noise dendrogram.
Stage 3 — Quasi-diagonalization. Walk the dendrogram in tree order to produce a permutation . The permuted correlation matrix has its large entries concentrated near the diagonal — a block-diagonal-ish structure where each block is a cluster.
The same 24-asset correlation matrix — intra-cluster , inter-cluster — rendered before and after the HRP permutation . Left: assets in alphabetical order; the block structure is washed out by the interleaved indexing. Right: groups correlated assets adjacent and three diagonal blocks (amber dashed) emerge. No matrix entry has changed — only the labels on the axes.
Stage 4 — Recursive bisection. Initialize for all . Walk the dendrogram top-down, at each internal node bisecting the cluster into two halves . Within each half, compute inverse-variance weights
and the resulting sub-portfolio variance . The split factor
is the fraction of capital assigned to ; with we get , so more capital flows to the lower-variance half. Multiply all weights in by and all weights in by . Recurse on each half. Terminate when every cluster contains a single asset; the final for asset is the product of split factors along the path from the dendrogram root down to leaf .
HRP recursive bisection on four assets. Sub-portfolio variances and yield : more capital to the lower-variance left half. Each edge multiplies a running weight; the final allocations are products along the root-to-leaf path. Two scalar multiplications per asset, no matrix inversion anywhere in the tree. Compare with — the four-fold spread reflects real variance differences, not optimizer noise.
The algorithm never inverts . It only ever inverts diagonals (1×1 blocks) and compares 2×2 sub-portfolios. The propagated error scales as in the worst case versus for naive mean-variance. Antonov, Lipton, and López de Prado (SSRN 4748151, 2024) give a formal proof: HRP variance is bounded above by a function that grows polynomially in , not exponentially in the eigenvalue dispersion.
Returning to López de Prado's Monte Carlo with this machinery in hand: HRP came in at variance , beating inverse-variance (), equal-weight (), and the Critical Line Algorithm () on the same data and the same out-of-sample objective. The optimizer designed to minimize variance lost to a tree-walk that never inverted its covariance matrix.
Out-of-sample cumulative returns for the same 24-asset synthetic dataset under three allocators. HRP and track the underlying drift smoothly; the naive Markowitz CLA — fed a noisy short-window sample covariance — produces weights with estimation jitter and shorts sprinkled throughout, and the resulting portfolio path is dominated by hedging churn rather than the underlying clusters. Realized is computed from each path at render time.
The cluster-sorted return surface
The HRP sorted-cumulative-return surface: over 24 assets in 3 clusters over 180 days. Adjacent indices are clustered correlated assets, so adjacent rows are similar — producing continuous ridges along the time axis. Cluster boundaries appear as sharp folds on the asset axis. The surface uses only the permutation , not the HRP weights.
The mathematical object plotted is
the cumulative return of the asset at HRP-sorted index on day . The axes are: (cluster-sorted asset index), (calendar day), (cumulative percent return).
What makes the surface wavy in a structured way is precisely the permutation . Single-linkage clustering puts the most correlated pairs adjacent in , so adjacent columns of the surface have similar trajectories. The eye reads that as smooth ridges running parallel to the time axis. Discontinuities along the asset axis mark cluster boundaries — those are the fold lines where one industry ends and the next begins.
The trick is worth stating plainly. The 3D landscape plot does not use the HRP weights at all. It uses only the permutation . The image is a way of staring at what the clustering algorithm produced, before any capital allocation is performed. The smoothness of the surface is diagnostic of clustering quality, not of portfolio optimality.
This is also why the visualization works at all. A correlation matrix has entries — too many to read by hand. A heatmap renders all of them but reads as a 2D blob. The dendrogram captures the merge order but loses the asset positions. The cluster-sorted return landscape collapses correlation, temporal evolution, and asset identity into a single surface, and your visual cortex does the rest. You can pick out a market crash as a synchronous downstroke across all clusters, a sector rotation as a one-cluster trough, and a stable factor exposure as a long persistent ridge. None of that is visible in alone.
The HRP family
Single algorithms become families. HRP has produced at least five direct descendants worth knowing.
HERC — Hierarchical Equal Risk Contribution. Raffinot (SSRN 3237540, 2018) replaces single linkage with Ward linkage, which merges the two clusters whose combined within-cluster sum of squares grows least. Where single linkage is a pointwise criterion — merge whoever has the closest pair — Ward is a global one, so it produces compact, similarly-sized clusters rather than long thin chains. HERC also replaces HRP's index-halving bisection with a dendrogram-following bisection, and uses the Gap statistic (Tibshirani–Walther–Hastie 2001) to pick the number of clusters. The Gap statistic compares the observed within-cluster dispersion against the dispersion expected under a uniform null reference; the optimal is where the data's clustering structure explains the most that random data would not. At each split, allocate by risk-contribution ratio:
You can plug in variance, CVaR, or Conditional Drawdown-at-Risk as the risk measure.
HERC fixes two specific HRP pathologies. First, single linkage's chaining effect: because the merge criterion is the closest pair across two clusters, a single bridging asset between two otherwise-distant clusters causes them to merge at a height comparable to within-cluster merges. The resulting "chain" cluster mixes assets that share no real economic similarity beyond the bridge. Second, HRP's bisection halves clusters by leaf-index count, which can put highly correlated assets on opposite sides of a cut purely because they happen to sit on different sides of a midpoint. Dendrogram-following bisection respects the merge structure and keeps correlated groups together.
Same nine-asset dataset under two linkage criteria. Left — single linkage: B* (amber) bridges the A- and B-chains via their nearest-pair distances, collapsing everything into one mega-cluster at — barely above the within-cluster merges at –. Right — Ward linkage: the squared-dispersion penalty makes bridging too costly; B* is absorbed cleanly into cluster B at , and the A–B merge is deferred to — a threefold gap that makes the cluster boundary unambiguous. HERC uses Ward precisely because this gap is the structure worth preserving.
NCO — Nested Clustered Optimization. López de Prado (SSRN 3469961, 2019). Two-stage Markowitz: cluster into blocks, run MVO within each block (small per-block condition number), then run MVO across blocks using a reduced inter-cluster covariance. Under reasonable assumptions about block sparsity, the per-stage condition number is bounded by , so the propagated error scales as rather than . NCO recovers most of MVO's in-sample optimality while inheriting HRP's stability.
Schur Complementary Allocation. Cotton (arXiv:2411.05807, 2024) gives the cleanest theoretical unification. HRP's recursive bisection treats each cluster as if the inter-cluster covariance were zero. The Schur method augments each sub-covariance with a scaled Schur complement of the off-diagonal block:
The notation — the slash is a quotient symbol, not division — is the standard name for the Schur complement. Conceptually, is what the variance of cluster 1 looks like after conditioning on cluster 2: regress out the linear influence of on and the residual variance is . The parameter slides between ignoring inter-cluster information (, full HRP) and using all of it (, full MVP). Cotton proves that portfolio variance decreases monotonically in over : HRP is the worst case of a one-parameter family that interpolates to MVP. This is the right way to read HRP — not as an alternative to Markowitz but as a regularized degenerate limit of it.
Constrained HRP. Pfitzinger and Katzke (Stellenbosch WP 14/2019) project partial weights onto box and group constraint sets at each split, so HRP plays nicely with mandates (no more than 5% per asset, no more than 20% per sector, etc.).
Hierarchical Minimum Variance Portfolios. arXiv:2503.12328 (2025) replaces HRP's inverse-variance step with full minimum-variance per cluster, retaining the dendrogram structure but inverting only well-conditioned local blocks. Empirically competitive with NCO at lower implementation cost.
The decision rule across the family is approximately this. Use plain HRP when you want maximum stability and minimum implementation complexity. Use HERC when the asset universe has no obvious block structure and single-linkage chaining is producing degenerate clusters. Use NCO when you want MVO efficiency inside well-conditioned blocks but HRP-style stability across them. Use Schur when you want a single tunable parameter that interpolates between HRP and minimum-variance. Use Constrained HRP when you have a hard mandate constraint to respect.
The rest of the toolbox
HRP is not the only response to Markowitz's curse. Two parallel programs are worth mentioning.
Risk parity (Roncalli). Maillard, Roncalli, Teïletche (JPM 36(4), 2010) define equal-risk-contribution portfolios: with equal across . The definition is not arbitrary. Portfolio volatility is first-order homogeneous in (scaling all weights by scales by ), so Euler's theorem for homogeneous functions guarantees the exact decomposition . The risk contributions are the unique additive partition of total volatility induced by Euler's identity — no approximation, no choice of decomposition.
The non-convex equal-contribution problem has an elegant convex reformulation:
Stationarity reads , i.e. for all . Normalizing to gives constant across — exactly the ERC condition. The unique solution sits in the variance interval between minimum-variance and equal-weight, and like HRP requires no expected-return estimate. The institutional version of this is Bridgewater's All-Weather, levered up to a target volatility.
Wasserstein distributionally robust optimization. The Wasserstein-1 distance between two probability distributions and is the minimum expected transportation cost to morph one into the other: over all couplings with marginals and . The "earth-mover" image is exact — you literally move probability mass and pay distance per unit moved. Mohajerin Esfahani and Kuhn (Math. Program. 171, 2018, arXiv:1505.05116) solve
where is the Wasserstein-1 ball of radius around the empirical distribution — every distribution you could reach from your data by moving each observation at most on average. The parameter is the radius of adversarial scenarios you hedge against, and it scales like with sample size: more data, smaller realistic worst-case ball. For convex losses (mean-CVaR, log-utility) the inner sup reduces to a finite convex program by Kantorovich duality. As you recover sample-Markowitz; as you recover equal-weight (the minimax-optimal portfolio when you know nothing about the true distribution). The intermediate regime is where the work happens.
CVaR optimization. Rockafellar and Uryasev (J. Risk 2, 2000) collapse tail-risk minimization to a linear program via the dual representation
This is what made tail-risk optimization computationally tractable. Tail risk parity — equalizing CVaR contributions instead of variance contributions — is the natural HRP-style extension.
The frontier
Three threads dominate current research. Each one is a different attempt to step beyond what HRP can already do.
Signature methods in stochastic portfolio theory. Cuchiero, Schmocker, Teichmann (arXiv:2310.02322, 2023). The path signature is the universal feature representation for continuous paths from rough path theory — every tensor coefficient is an iterated integral of the path against itself. Replacing the static covariance summary with a richer signature-based summary lets you build linear path-functional portfolios that are dense in the space of continuous path-dependent portfolios (Lyons universality), and that reduce both log-wealth maximization and mean-variance optimization to convex quadratic programs with globally optimal solutions. The same machinery, applied to execution, gives Cartea–Jaimungal-style signature execution algorithms (arXiv:1905.00728). Recent work (arXiv:2510.10728, 2025) couples truncated log-signatures with neural rough differential equations for high-dimensional BSDE-based control.
Deep learning portfolios. Where HRP relies on a hand-specified clustering pipeline, deep methods learn the allocation function end-to-end. Zhang, Zohren, Roberts (arXiv:2005.13665, 2020) train an LSTM-plus-head to maximize realized Sharpe over a rolling window — no forecasting step, the network outputs weights directly. Reinforcement-learning portfolio agents (Jiang, Xu, Liang, arXiv:1706.10059) train policies that output a simplex action with an explicit log-return reward. Graph attention networks on correlation graphs (Korangi, Mues, Bravo, arXiv:2407.15532, 2024) — effectively a learned, attention-weighted version of Mantegna's correlation network — outperform classical MVO on S&P 500 universes at annualized return 16.8%, Sharpe 1.34, and 8.2% max drawdown, against an MVO baseline that typically lands at Sharpe well below 1.0 in the same regime.
Quantum portfolio optimization. Discrete Markowitz with integer position sizes (lot constraints, no fractional shares) is NP-complete; the QUBO encoding maps it to a transverse-field Ising Hamiltonian solvable on NISQ devices via QAOA. Benchmarking studies (arXiv:2509.17876) compare QAOA, quantum annealing, mixed-integer solvers, and metaheuristics on universes up to 1000 assets. Meaningful quantum advantage at scale remains an open question — current results are competitive but not yet dominant against classical solvers.
How to read the landscape
The HRP cumulative-return landscape is doing three jobs at once. It is collapsing , the return path, and the asset identity into a single surface that shows:
- Cluster structure. Continuous ridges marked by sharp drops at cluster boundaries. Three smooth bands = three sectors well separated by correlation. A jagged surface with no visible blocks = your data has no usable structure or your clustering choice is wrong.
- Factor regimes. A long monochromatic ridge running across the time axis = persistent factor exposure. An abrupt color shift halfway through = regime change.
- Crash events. A synchronous downstroke across all clusters = the market mode firing. A trough confined to one cluster = idiosyncratic sector pain.
What the picture is not showing: the weights. The allocation step happens after the visualization and is invisible in the surface. The 3D plot is diagnostic of the clustering; the weights are a downstream byproduct.
The practical question for any allocator is the same one that drives every approach in this article: which estimator of is wrong in the way that matters least for the trade you are sizing? Naive Markowitz is wrong because it inverts a noisy matrix. RIE-cleaned Markowitz is wrong because it assumes static covariance. Risk parity is wrong because it ignores expected returns entirely. HRP is wrong because single linkage chains noise into clusters. NCO is wrong because the cluster count is a hyperparameter. Schur is wrong because is a hyperparameter. Deep portfolios are wrong because they overfit.
Each method is wrong in a different way. The HRP visualization is one of the few ways of seeing, at a glance, which of those wrong assumptions is most wrong for your data. A clean three-cluster landscape says: the block structure is real, single linkage caught it, HRP will be stable. A muddy speckled surface says: cluster the data with Ward instead, or shrink first, or use NCO with chosen by silhouette.
The landscape is the model made visible. The wavy mountain encodes more diagnostic information about your assumptions than any single number a backtest would give you.
Comments