Math Primitives¶

This document outlines the core mathematical primitives implemented in src/maths/primitives.py. These primitives are rigorously designed for numerical stability and form the fundamental backbone of the Variational Linear Attention (VLA) system.

1. Inner-Product Score¶

The attention score $s_t$ at timestep $t$ is computed as the scaled dot product of a key vector $k_t \in \mathbb{R}^d$ and a query vector $q_t \in \mathbb{R}^d$:

\[ s_t = \frac{k_t^\top q_t}{\sqrt{d}} \]

Implementation Details:

Returns: A scalar value indicating query-key relevance.
Scaling: Includes the standard $1/\sqrt{d}$ scaling factor to mathematically constrain gradients from vanishing or exploding, completely analogous to standard scaled dot-product attention.
Safety Measures: Validates that all values are strictly finite (void of NaN or Inf), which is crucially required before undertaking subsequent inverse operations.

2. Sherman–Morrison Rank-1 Inverse Update¶

The core bottleneck of stably solving a linear system at every timestep is the matrix inversion. The Sherman-Morrison formula empowers VLA to surgically update the inverse of the penalty matrix $M_t$ when it is mathematically perturbed by a rank-1 update $u_t u_t^\top$.

Theoretical Foundation¶

Given:

$M_{t-1} \in \mathbb{R}^{d \times d}$: The prior penalty matrix (symmetric positive definite).
$A_{t-1} = M_{t-1}^{-1}$: The exact inverse of the prior penalty matrix.
$u_t \in \mathbb{R}^d$: The new update vector proposed by the PenaltyBuilder.

The updated penalty matrix naturally takes the form $M_t = M_{t-1} + u_t u_t^\top$. Computing $(M_t)^{-1}$ directly scales at $\mathcal{O}(d^3)$. However, by leveraging the Sherman-Morrison identify, we reconstruct the inverse update securely in $\mathcal{O}(d^2)$:

\[ A_t = A_{t-1} - \frac{A_{t-1} u_t u_t^\top A_{t-1}}{1 + u_t^\top A_{t-1} u_t} \]

Algorithmic Execution Steps¶

Compute Denominator (Scalar): $$ \delta = 1 + u_t^\top \left(A_{t-1} u_t\right) $$
Numerical Safety Enforcement: If the absolute value $|\delta| < \epsilon$, a strict fallback is triggered (adding $\epsilon I$) stopping division by zero or gradient detonation. (Default $\epsilon \approx 10^{-6}$).
Compute Intermediate Vector ($z \in \mathbb{R}^d$): $$ z = A_{t-1} u_t $$
Compute Outer Product ($O \in \mathbb{R}^{d \times d}$): $$ O = z z^\top $$
Final Rank-1 Update: $$ A_t = A_{t-1} - \frac{O}{\delta} $$
Periodic Stabilization: Every $K$ steps, add $\epsilon I$ onto the diagonal tensor of $A_t$ to correct accumulating floating-point drift. This represents a critical fix for extremely long context sequence regimes.

3. Multiple Rank-1 Updates (Woodbury Generalization)¶

To securely support higher-rank context parameterizations, we sequence iterating rank-1 updates $\{u_1, u_2, \dots, u_r\}$. This explicitly mirrors the mathematical equivalence of the Woodbury matrix identity for a rank-$r$ update—but iterating sequentially minimizes dangerous intermediate memory spikes.

\[\begin{split} \begin{aligned} A^{(0)} &= A_{t-1} \\ A^{(i)} &= \text{ShermanMorrison}\left(A^{(i-1)}, \ u_i\right) \quad \text{for } i \in \{1, \dots, r\} \\ A_t &= A^{(r)} \end{aligned} \end{split}\]

4. Recovering Optimal Coefficients $\alpha^*$¶

In standard Linear Attention, the global memory matrix $S_t$ is linearly updated using static $v_t k_t^\top$. In sharp contrast, VLA computes a globally optimal scaling vector $\alpha_t$ that completely minimizes the associative reconstruction error of the value vector $v_t$ bounded strictly by the active penalty $M_t$.

As VLA naturally tracks the inverted formulation $A_t = M_t^{-1}$, generating the theoretical ground-truth optimum $\alpha^* = M_t^{-1} s_t$ simplifies exponentially into a single matrix-vector hardware product:

\[ \alpha_t = A_t s_t \]

5. Memory Matrix Update¶

Having securely arrived at the optimal coefficient map $\alpha_t$, the global memory matrix $S_t$ updates to ingest new contextual information:

\[ S_t = S_{t-1} + \alpha_t \otimes \left(v_t k_t^\top\right) \]