# Advancing mathematics by guiding human intuition with AI

### Framework

#### Supervised learning

In the supervised learning stage, the mathematician proposes a hypothesis that there exists a relationship between X(z) and Y(z). In this work we assume that there is no known function mapping from X(z) to Y(z), which in turn implies that X is not invertible (otherwise there would exist a known function Y ° X−1). While there may still be value to this process when the function is known, we leave this for future work. To test the hypothesis that X and Y are related, we generate a dataset of X(z), Y(z) pairs, where z is sampled from a distribution PZ. The results of the subsequent stages will hold true only for the distribution PZ, and not the whole space Z. Initially, sensible choices for PZ would be, for example, uniformly over the first N items for Z with a notion of ordering, or uniformly at random where possible. In subsequent iterations, PZ may be chosen to understand the behaviour on different parts of the space Z (for example, regions of Z that may be more likely to provide counterexamples to a particular hypothesis). To first test whether a relation between X(z) and Y(z) can be found, we use supervised learning to train a function (hat{f}) that approximately maps X(z) to Y(z). In this work we use neural networks as the supervised learning method, in part because they can be easily adapted to many different types of X and Y and knowledge of any inherent geometry (in terms of invariances and symmetries) of the input domain X can be incorporated into the architecture of the network37. We consider a relationship between X(z) and Y(z) to be found if the accuracy of the learned function (hat{f}) is statistically above chance on further samples from PZ on which the model was not trained. The converse is not true; namely, if the model cannot predict the relationship better than chance, it may mean that a pattern exists, but is sufficiently complicated that it cannot be captured by the given model and training procedure. If it does indeed exist, this can give a mathematician confidence to pursue a particular line of enquiry in a problem that may otherwise be only speculative.

If a relationship is found, the attribution stage is to probe the learned function (hat{f}) with attribution techniques to further understand the nature of the relationship. These techniques attempt to explain what features or structures are relevant to the predictions made by (hat{f}), which can be used to understand what parts of the problem are relevant to explore further. There are many attribution techniques in the body of literature on machine learning and statistics, including stepwise forward feature selection38, feature occlusion and attention weights39. In this work we use gradient-based techniques40, broadly similar to sensitivity analysis in classical statistics and sometimes referred to as saliency maps. These techniques attribute importance to the elements of X(z), by calculating how much (hat{f}) changes in predictions of Y(z) given small changes in X(z). We believe these are a particularly useful class of attribution techniques as they are conceptually simple, flexible and easy to calculate with machine learning libraries that support automatic differentiation41,42,43. Information extracted via attribution techniques can then be useful to guide the next steps of mathematical reasoning, such as conjecturing closed-form candidates f′, altering the sampling distribution PZ or generating new hypotheses about the object of interest z, as shown in Fig. 1. This can then lead to an improved or corrected version of the conjectured relationship between these quantities.

### Topology

#### Problem framing

Not all knots admit a hyperbolic geometry; however, most do, and all knots can be constructed from hyperbolic and torus knots using satellite operations44. In this work we focus only on hyperbolic knots. We characterize the hyperbolic structure of the knot complement by a number of easily computable invariants. These invariants do not fully define the hyperbolic structure, but they are representative of the most commonly interesting properties of the geometry. Our initial general hypothesis was that the hyperbolic invariants would be predictive of algebraic invariants. The specific hypothesis we investigated was that the geometry is predictive of the signature. The signature is an ideal candidate as it is a well-understood and common invariant, it is easy to calculate for large knots and it is an integer, which makes the prediction task particularly straightforward (compared to, for example, a polynomial).

#### Data generation

We generated a number of datasets from different distributions PZ on the set of knots using the SnapPy software package45, as follows.

1. 1.

All knots up to 16 crossings (1.7 × 106 knots), taken from the Regina census46.

2. 2.

Random knot diagrams of 80 crossings generated by SnapPy’s random_link function (106 knots). As random knot generation can potentially lead to duplicates, we calculate a large number of invariants for each knot diagram and remove any samples that have identical invariants to a previous sample, as they are likely to represent that same knot with very high probability.

3. 3.

Knots obtained as the closures of certain braids. Unlike the previous two datasets, the knots that were produced here are not, in any sense, generic. Instead, they were specifically constructed to disprove Conjecture 1. The braids that we used were 4-braids (n = 11,756), 5-braids (n = 13,217) and 6-braids (n = 10,897). In terms of the standard generators σi for these braid groups, the braids were chosen to be (({sigma }_{{i}_{1}}^{{n}_{1}}{sigma }_{{i}_{2}}^{{n}_{2}}…{sigma }_{{i}_{k}}^{{n}_{k}}{)}^{N}) . The integers ij were chosen uniformly at random for the appropriate braid group. The powers nj were chosen uniformly at random in the ranges [−10, −3] and [3, 10]. The final power N was chosen uniformly between 1 and 10. The quantity ∑ |ni| was restricted to be at most 15 for 5-braids and 6-braids and 12 for 4-braids, and the total number of crossings N ∑ |ni| was restricted to lie in the range between 10 and 60. The rationale for these restrictions was to ensure a rich set of examples that were small enough to avoid an excessive number of failures in the invariant computations.

For the above datasets, we computed a number of algebraic and geometric knot invariants. Different datasets involved computing different subsets of these, depending on their role in forming and examining the main conjecture. Each of the datasets contains a subset of the following list of invariants: signature, slope, volume, meridional translation, longitudinal translation, injectivity radius, positivity, Chern–Simons invariant, symmetry group, hyperbolic torsion, hyperbolic adjoint torsion, invariant trace field, normal boundary slopes and length spectrum including the linking numbers of the short geodesics.

The computation of the canonical triangulation of randomly generated knots fails in SnapPy in our data generation process in between 0.6% and 1.7% of the cases, across datasets. The computation of the injectivity radius fails between 2.8% of the time on smaller knots up to 7.8% of the time on datasets of knots with a higher number of crossings. On knots up to 16 crossings from the Regina dataset, the injectivity radius computation failed in 5.2% of the cases. Occasional failures can occur in most of the invariant computations, in which case the computations continue for the knot in question for the remaining invariants in the requested set. Additionally, as the computational complexity of some invariants is high, operations can time out if they take more than 5 min for an invariant. This is a flexible bound and ultimately a trade-off that we have used only for the invariants that were not critical for our analysis, to avoid biasing the results.

#### Data encoding

The following encoding scheme was used for converting the different types of features into real valued inputs for the network: reals directly encoded; complex numbers as two reals corresponding to the real and imaginary parts; categoricals as one-hot vectors.

All features are normalized by subtracting the mean and dividing by the variance. For simplicity, in Fig. 3a, the salience values of categoricals are aggregated by taking the maximum value of the saliencies of their encoded features.

#### Model and training procedure

The model architecture used for the experiments was a fully connected, feed-forward neural network, with hidden unit sizes [300, 300, 300] and sigmoid activations. The task was framed as a multi-class classification problem, with the distinct values of the signature as classes, cross-entropy loss as an optimizable loss function and test classification accuracy as a metric of performance. It is trained for a fixed number of steps using a standard optimizer (Adam). All settings were chosen as a priori reasonable values and did not need to be optimized.

#### Process

First, to assess whether there may be a relationship between the geometry and algebra of a knot, we trained a feed-forward neural network to predict the signature from measurements of the geometry on a dataset of randomly sampled knots. The model was able to achieve an accuracy of 78% on a held-out test set, with no errors larger than ±2. This is substantially higher than chance (a baseline accuracy of 25%), which gave us strong confidence that a relationship may exist.

To understand how this prediction is being made by the network, we used gradient-based attribution to determine which measurements of the geometry are most relevant to the signature. We do this using a simple sensitivity measure ri that averages the gradient of the loss L with respect to a given input feature xi over all of the examples x in a dataset ({mathscr{X}}):

$${{bf{r}}}_{i}=frac{1}{|{mathscr{X}}|}{sum }_{{bf{x}}in {mathscr{X}}}|frac{partial L}{partial {{bf{x}}}_{i}}|$$

(3)

This quantity for each input feature is shown in Fig. 3a, where we can determine that the relevant measurements of the geometry appear to be what is known as the cusp shape: the meridional translation, which we will denote μ, and the longitudinal translation, which we will denote λ. This was confirmed by training a new model to predict the signature from only these three measurements, which was able to achieve the same level of performance as the original model.

To confirm that the slope is a sufficient aspect of the geometry to focus on, we trained a model to predict the signature from the slope alone. Visual inspection of the slope and signature in Extended Data Fig. 1a, b shows a clear linear trend, and training a linear model on this data results in a test accuracy of 78%, which is equivalent to the predictive power of the original model. This implies that the slope linearly captures all of the information about the signature that the original model had extracted from the geometry.

#### Evaluation

The confidence intervals on the feature saliencies were calculated by retraining the model 10 times with a different train/test split and a different random seed initializing both the network weights and training procedure.

### Representation theory

#### Data generation

For our main dataset we consider the symmetric groups up to S9. The first symmetric group that contains a non-trivial Bruhat interval whose KL polynomial is not simply 1 is S5, and the largest interval in S9 contains 9! ≈ 3.6 × 105 nodes, which starts to pose computational issues when used as inputs to networks. The number of intervals in a symmetric group SN is O(N!2), which results in many billions of intervals in S9. The distribution of coefficients of the KL polynomials uniformly across intervals is very imbalanced, as higher coefficients are especially rare and associated with unknown complex structure. To adjust for this and simplify the learning problem, we take advantage of equivalence classes of Bruhat intervals that eliminate many redundant small polynomials47. This has the added benefit of reducing the number of intervals per symmetric group (for example, to ~2.9 million intervals in S9). We further reduce the dataset by including a single interval for each distinct KL polynomial for all graphs with the same number of nodes, resulting in 24,322 non-isomorphic graphs for S9. We split the intervals randomly into train/test partitions at 80%/20%.

#### Data encoding

The Bruhat interval of a pair of permutations is a partially ordered set of the elements of the group, and it can be represented as a directed acyclic graph where each node is labelled by a permutation, and each edge is labelled by a reflection. We add two features at each node representing the in-degree and out-degree of that node.

#### Model and training procedure

For modelling the Bruhat intervals, we used a particular GraphNet architecture called a message-passing neural network (MPNN)48. The design of the model architecture (in terms of activation functions and directionality) was motivated by the algorithms for computing KL polynomials from labelled Bruhat intervals. While labelled Bruhat intervals contain privileged information, these algorithms hinted at the kind of computation that may be useful for computing KL polynomial coefficients. Accordingly, we designed our MPNN to algorithmically align to this computation49. The model is bi-directional, with a hidden layer width of 128, four propagation steps and skip connections. We treat the prediction of each coefficient of the KL polynomial as a separate classification problem.

#### Process

First, to gain confidence that the conjecture is correct, we trained a model to predict coefficients of the KL polynomial from the unlabelled Bruhat interval. We were able to do so across the different coefficients with reasonable accuracy (Extended Data Table 1) giving some evidence that a general function may exist, as a four-step MPNN is a relatively simple function class. We trained a GraphNet model on the basis of a newly hypothesized representation and could achieve significantly better performance, lending evidence that it is a sufficient and helpful representation to understand the KL polynomial.

To understand how the predictions were being made by the learned function (hat{f}), we used gradient-based attribution to define a salient subgraph SG for each example interval G, induced by a subset of nodes in that interval, where L is the loss and xv is the feature for vertex v:

$${S}_{G}={vin G||frac{partial L}{partial {x}_{v}}| > {C}_{k}}$$

(4)

We then aggregated the edges by their edge type (each is a reflection) and compared the frequency of their occurrence to the overall dataset. The effect on extremal edges was present in the salient subgraphs for predictions of the higher-order terms (q3, q4), which are the more complicated and less well-understood terms.

#### Evaluation

The threshold Ck for salient nodes was chosen a priori as the 99th percentile of attribution values across the dataset, although the results are present for different values of Ck in the range [95,  99.5]. In Fig. 5a, we visualize a measure of edge attribution for a particular snapshot of a trained model for expository purposes. This view will change across time and random seeds, but we can confirm that the pattern remains by looking at aggregate statistics over many runs of training the model, as in Fig. 5b. In this diagram, the two-sample two-sided t-test statistics are as follows—simple edges: t = 25.7, P = 4.0 × 10−10; extremal edges: t = −13.8, P = 1.1 × 10−7; other edges: t = −3.2, P = 0.01. These significance results are robust to different settings of the hyper-parameters of the model.