MAIN CONTENT

Neural heterogeneity promotes robust learning

Nicolas Perez-Nieves\(^{1\ast}\), Vincent C. H. Leung\(^{1}\), Pier Luigi Dragotti\(^{1}\), Dan F. M. Goodman\(^{1}\)

\(^{1}\)Department of Electrical and Electronic Engineering, Imperial College London, UK

\(^\ast\)To whom correspondence should be addressed; E-mail: nicolas.perez14@imperial.ac.uk

Abstract.

The brain has a hugely diverse, heterogeneous structure. Whether or not heterogeneity at the neural level plays a functional role remains unclear, and has been relatively little explored in models which are often highly homogeneous. We compared the performance of spiking neural networks trained to carry out tasks of real-world difficulty, with varying degrees of heterogeneity, and found that it substantially improved task performance. Learning was more stable and robust, particularly for tasks with a rich temporal structure. In addition, the distribution of neuronal parameters in the trained networks closely matches those observed experimentally. We suggest that the heterogeneity observed in the brain may be more than just the byproduct of noisy processes, but rather may serve an active and important role in allowing animals to learn in changing environments.

Summary. Neural heterogeneity is metabolically efficient for learning, and optimal parameter distribution matches experimental data.

[sec:intro]Introduction

The brain is known to be deeply heterogeneous at all scales (Koch and Laurent 1999), but it is still not known whether this heterogeneity plays an important functional role or if it is just a byproduct of noisy developmental processes and contingent evolutionary history. A number of hypothetical roles have been suggested (reviewed in (Gjorgjieva, Drion, and Marder 2016)), in efficient coding (Shamir and Sompolinsky 2006; Chelaru and Dragoi 2008; Osborne et al. 2008; Marsat and Maler 2010; Padmanabhan and Urban 2010; Hunsberger, Scott, and Eliasmith 2014; Zeldenrust, Gutkin, and Denéve 2019), reliability (Lengler 2013), working memory (Kilpatrick, Ermentrout, and Doiron 2013), and functional specialisation (Duarte and Morrison 2019). However, previous studies have largely used simplified tasks or networks, and it remains unknown whether or not heterogeneity can help animals solve complex information processing tasks in natural environments. Recent work has allowed us, for the first time, to train biologically realistic spiking neural networks to carry out these tasks at a high level of performance, using methods derived from machine learning. We used two different learning models (Nicola and Clopath 2017; Neftci, Mostafa, and Zenke 2019) to investigate the effect of introducing heterogeneity in the time scales of neurons when performing tasks with realistic and complex temporal structure. We found that it improves overall performance, makes learning more stable and robust, and learns neural parameter distributions that match experimental observations, suggesting that the heterogeneity observed in the brain may be a vital component to its ability to adapt to new environments.

[sec:results]Results

[sec:results-acc]Time scale heterogeneity improves learning on tasks with rich temporal structure

We investigated the role of neural heterogeneity in task performance by training recurrent spiking neural networks to classify visual and auditory stimuli with varying degrees of temporal structure. The model used three layers of spiking neurons: an input layer, a recurrently connected layer, and a readout layer used to generate predictions (1A), a widely used minimal architecture (e.g. Maass, Natschläger, and Markram (2002; Neftci, Mostafa, and Zenke 2019)). Heterogeneity was introduced by giving each neuron an individual membrane and synaptic time constant. We compared four different conditions: initial values could be either homogeneous or heterogeneous, and training could be either standard or heterogeneous (1B). Time constants were either initialised with a single value (homogeneous initialisation), or randomly according to a gamma distribution (heterogeneous). The models were trained using surrogate gradient descent (Neftci, Mostafa, and Zenke 2019). Synaptic weights were always plastic, while time constants were either held fixed at their initial values in the standard training regime, or could be modified in the heterogeneous training regime.

**Diagram of network architecture and training configurations.** A. Model architecture. A layer of input neurons emits spike trains into a recurrently connected layer of spiking neurons which is followed by a readout layer. B. Configurations. Training can be either standard (only the synaptic weights are learned) or heterogeneous (the synaptic weights and membrane and synaptic time constants are learned). The initialisation can be homogeneous (all synaptic and membrane time constants are initialised to the same value) or heterogeneous (synaptic and membrane time constants are randomly initialised for each neuron by sampling them from a given probability distribution).

We used five different datasets with varying degrees of temporal structure. Neuromorphic MNIST (N-MNIST; Orchard et al. (2015)), Fashion-MNIST (F-MNIST; Xiao, Rasul, and Vollgraf (2017), and the DVS128 Gesture dataset (Amir et al. 2017) feature visual stimuli, while the Spiking Heidelberg Digits (SHD) and Spiking Speech Commands (SSC) datasets (Cramer et al. 2020) are auditory. N-MNIST and DVS128 use a neuromorphic vision sensor to generate spiking activity, by moving the sensor with a static visual image of handwritten digits (N-MNIST) or by recording humans making hand gestures (DVS128). F-MNIST is a dataset of static images that is widely used in machine learning, which we converted into spike times by treating the image intensities as input currents to model neurons, so that higher intensity pixels would lead to earlier spikes, and lower intensity to later spikes. Both SHD and SSC use a detailed model of the activity of bushy cells in the cochlear nucleus, in response to spoken digits (SHD) or commands (SSC). Of these datasets, N-MNIST and F-MNIST have minimal temporal structure, as they are generated from static images. DVS128 has some temporal structure as it is recorded motion, but it is possible to perform well at this task by discarding the temporal information. The auditory tasks SHD and SSC by contrast have very rich temporal structure.

We found that heterogeneity in time constants had a profound impact on performance on those training datasets where information was encoded in the precise timing of input spikes (1, 2A). On the most temporally complex auditory tasks, accuracy improved by a factor of around 15-20%, while for the least temporally complex task N-MNIST we saw no improvement at all. For the gesture dataset DVS128 we can identify the source of the (intermediate) improvement as the heterogeneous models being better able to distinguish between spatially similar but temporally different gestures, such as clockwise and anticlockwise versions of the same gesture (See Supp. 6A-B). This suggests that we might see greater improvements for a richer version of this dataset in which temporal structure was more important.

**Testing accuracy percentage over different datasets and training methods** Effect of initialisation and training configuration on performance, on datasets of increasing temporal complexity. Initialisation can be homogeneous (all time constants the same) or heterogeneous (random initialisation), and training can be standard (only synaptic weights learned) or heterogeneous (time constants can also be learned). N-MNIST and F-MNIST are static image datasets with little temporal structure, DVS128 is video gestures, and SHD and SSC are temporally complex auditory datasets.
Initialisation	Training	N-MNIST	F-MNIST	DVS128	SHD	SSC
Homog.	Standard	\(97.4 \!\pm\! 0.0\)	\(80.1 \!\pm\! 7.4\)	\(76.9 \!\pm\! 0.8\)	\(71.7 \!\pm\! 1.0\)	\(49.7 \!\pm\! 0.4\)
Heterog.	Standard	\(97.5 \!\pm\! 0.0\)	\(87.9 \!\pm\! 0.1\)	\(79.5 \!\pm\! 1.0\)	\(73.7 \!\pm\! 1.1\)	\(53.7 \!\pm\! 0.7\)
Homog	Heterog.	\(96.6 \!\pm\! 0.2\)	\(79.7 \!\pm\! 7.4\)	\(81.2 \!\pm\! 0.8\)	\(82.7 \!\pm\! 0.8\)	\(56.6 \!\pm\! 0.7\)
Heterog.	Heterog.	\(97.3 \!\pm\! 0.1\)	\(87.5 \!\pm\! 0.1\)	\(82.1 \!\pm\! 0.8\)	\(81.7 \!\pm\! 0.8\)	\(60.1 \!\pm\! 0.7\)
Chance level		10.0	10.0	10.0	5.0	2.9

**Impact of training configuration and temporal structure of the dataset on the testing accuracy, membrane time constant distributions and performance when training at different time scales.** A. Improvements in accuracy in testing data, for datasets with temporal complexity low (N-MNIST, F-MNIST), intermediate (DVS) and high (SHD). Shaded areas correspond to standard error in the mean over 10 trials. Initialisation can be homogeneous (blue/green) or heterogeneous (orange/red), and training can be standard, weights only (blue/orange) or heterogeneous including time constants (green/red). Heterogeneous configurations achieve a better test accuracy on the more temporally complex datasets. Heterogeneous initialisation also results in a more stable and robust training trajectory for F-MNIST, leading to better performance overall. B. Membrane time constant distributions before (left) and after (right) training for each dataset. Histograms above the axis represent heterogeneous initialisation, and below the axis homogeneous initialisation. In the case of standard training (weights only), the initial distribution (left) is the same as the final distribution of time constants after training. C. Experimentally observed distributions of time constants for (top to bottom): mouse cochlear nucleus, multiple cell types (172 cells); mouse V1 layer 4, spiny (putatively excitatory) cells (164 cells); human middle temporal gyrus, spiny cells (236 cells). D. Raster plot on input spikes from a single sample of the SHD dataset (spoken digits) at three different time scales. E. Accuracy on the SHD dataset after training on a variety of time scales (randomly selected from the grey distribution) for the four configurations described in (A). F. Accuracy on the SHD dataset when the initial distribution of time constants is tuned for time scale 1.0, but the training and testing is done at different time scales.

We verified that our results were due to heterogeneity and not simply to a better tuning of time constants in two ways. Firstly, we performed a grid search across all homogeneous time constants for the SHD dataset and used the best values for our comparison. Secondly, we observe that the distribution of time constants after training is very similar and heterogeneous regardless of whether it was initialised with a homogeneous or heterogeneous distribution (2B), indicating that the heterogeneous distribution is optimal.

Introducing heterogeneity allows for a large increase in performance at the cost of only a very small increase in the number of parameters (0.23% for SHD, because the vast majority of parameters are synaptic weights), and without using any additional neurons or synapses. Heterogeneity is therefore a metabolically efficient strategy. It is also a computationally efficient strategy of interest to neuromorphic computing, because adding heterogeneity adds \(O(n)\) to memory use and computation time, while adding more neurons adds \(O(n^2)\). Further, in some neuromorphic systems like BrainScaleS this heterogeneity is already present as part of the manufacturing process (Schmitt et al. 2017).

Note that it is possible to obtain better performance using a larger number of neurons. For example, Neftci, Mostafa, and Zenke (2019) obtained a performance of 83.2% on the SHD dataset without heterogeneity using 1024 neurons and data augmentation techniques, whereas we obtained 82.7% using 128 neurons and no data augmentation. We focus on smaller networks here for two reasons. Firstly, we wanted to systematically investigate the effect of different training regimes, and current limitations of surrogate gradient descent mean that each training session takes several days. Secondly, with larger numbers of neurons, performance even without heterogeneity approaches the ceiling on these tasks (which are still simple in comparison to those faced by animals in real environments), making it more difficult to see the effect of different architectures. Even with this limitation to small networks, heterogeneity confers such an advantage that our results for the SSC dataset are state of the art (for spiking neural networks) by a large margin.

We also tested the effect of introducing heterogeneity of other neuron parameters, such as the firing threshold and reset potential, but found that it made no appreciable difference. This was because for our model, changing these is almost equivalent to a simple scaling of the membrane potential variable. By contrast, Bellec et al. (2019) found that introducing an adaptive threshold did improve performance, presumably because it allows for much richer temporal dynamics.

[sec:time-constants]Predicted time constant distributions match experimental data

In all our tasks, the distribution of time constants after training approximately but not exactly fit a log normal or gamma distribution (with different parameters for each task), and are consistent across different training runs (Supplementary Materials 4 and 5), suggesting that the learned distributions may be optimal. Using publicly available datasets including time constants recorded in large numbers of neurons in different animals and brain regions (P. B. Manis, Kasten, and Xie 2019; P. Manis, Kasten, and Xie 2019; Lein et al. 2007; Hawrylycz et al. 2012), we found very similar distributions to those we predicted (2C). The parameters for these distributions are different for each animal and region, just as for different tasks in our simulations. Interestingly, the distribution parameters are also different for each cell type in the experimental data, a feature not replicated in our simulations as all cells are identical. This suggests that introducing further diversity in terms of different cell types may lead to even better performance.

[sec:results-robust-time]Heterogeneity improves speech learning across time scales

Sensory signals such as speech and motion can be recognised across a range of speeds. We tested the role of heterogeneity in learning a circuit that can function across a wide range of speeds. We augmented the SHD spoken digits datasets to include faster or slower versions of the samples, multiplying all spike times by a temporal scale as an extremely simplified model that captures a part of the difficulty of this task (2D). During training, temporal scales were randomly selected from a distribution roughly matching human syllabic rate distributions (Lerner et al. 2014). The heterogeneous configurations performed as well or better at all time scales (2E), and in particular were able to generalise better to time scales outside the training distribution (e.g. accuracy of 47% close to a time scale of 4, around 7% higher than the homogeneous network, where chance performance would be 5%). Heterogeneous initialisation alone was sufficient to achieve this better generalisation performance, while fine tuning the distribution of time constants with heterogeneous training improved the peak performance but gave no additional ability to generalise.

[sec:results-timescale]

[sec:results-robust-hyper]Heterogeneity improves robustness against mistuned learning

We tested the hypothesis that heterogeneity can provide robustness with two experiments where the hyperparameters were mistuned, that is where the initial distributions and learning parameters were chosen to give the best performance for one distribution, but the actual training and testing is done on a different distribution.

In the first experiment, we took the augmented SHD spoken digits dataset from the previous section, selected the hyperparameters to give the best performance at a time scale of 1, but then trained and tested the network at a different time scale (2F). We used training of weights only, since allowing retraining of time constants lets the network cancel out the effect of changing the time scale. With a homogeneous or narrow heterogeneous initial distribution, performance falls off for time scales far from the optimal one, particularly for larger time scales (slower stimuli). However, a wide heterogeneous initial distribution allows for good performance across all time scales tested, at the cost of slightly lower peak performance at the best time scale. We tested whether or not this was solely due to the presence of a few neurons with long time constants by introducing an intermediate distribution where the majority of time constants were the same as the homogeneous case, with a small minority of much longer time constants. The performance of this distribution was intermediate between the homogeneous and heterogeneous distributions, a pattern that is repeated in our second experiment below.

In the second experiment, we switched to a very different learning paradigm, FORCE training of spiking neural networks (Nicola and Clopath 2017) to replay a learned signal, in this case a recording of a zebra finch call (3A; from Blättler and Hahnloser (2011)). This method does not allow for heterogeneous training (tuning time constants), so we only tested the role of untrained heterogeneous neuron parameters. We tested three configurations: fully homogeneous (single 20ms time constant as in the original paper); intermediate (each neuron randomly assigned a fixed fast 20ms or slow 100ms time constant); or fully heterogeneous (each neuron randomly assigned a time constant drawn from a gamma distribution).

Nicola and Clopath (2017) showed that network performance is highly dependent on two hyperparameters (\(G\) and \(Q\) in their paper). We therefore tuned these hyperparameters for a network of a fixed size (\(N\!=\!1000\) neurons) and ran the training and testing for networks of different sizes (3B). As the network size started to diverge, the homogeneous network began to make large errors, while the fully heterogeneous network was able to give low errors for all network sizes. The intermediate network was able to function well across a wider range of network sizes than the fully homogeneous network, but still eventually failed for the largest network sizes. At these large network sizes, the homogeneous network becomes saturated, leading to poor performance (3C). The robustness of the heterogeneous version of the network can be measured by the area of the hyperparameter space that leads to good performance (3D). Adding partial or full heterogeneity leads to an improvement in learning for all points in the hyperparameter space. again suggesting that it can improve robustness of learning in a wide range of situations.

**Robustness to learning hyperparameter mistuning.** A. Spectrogram of a zebra finch. The network has to learn to reproduce this spectrogram, chosen for its spectrotemporal complexity. B Error for three networks at different network sizes (hyperparameters were chosen to optimise performance at \(N=1000\) neurons). Networks are fully homogeneous (*Homog*); intermediate, where each neuron is randomly assigned slow or fast dynamics (*Double*); or fully heterogeneous, where each neuron has a random time constant drawn from a gamma distribution (*Gamma*). C. Raster plots of \(50\) neurons randomly chosen, and reconstructed spectrograms under fully homogeneous and fully heterogeneous (Gamma) conditions for \(N=4000\) neurons as indicated in (B). D. Reconstruction error. Each row is one of the conditions in (B). Each column is a network size. The axes of each image give the learning hyperparameters (\(G\) and \(Q\)). Grey pixels correspond to log mean square error above 0, corresponding to a complete failure to reconstruct the spectrogram. The larger the coloured region, the more robust the network is, and the less tuning is required.

[sec:discussion]Discussion

We trained spiking neural networks at difficult classification tasks, either forcing all time constants to be the same (homogeneous) or allowing them to be different (heterogeneous). We found that introducing heterogeneity improved overall performance across a range of tasks and training methods, but particularly so on tasks with richer intrinsic temporal structure. Learning was more robust, in that the networks were able to learn across a range of different environments, and when the hyperparameters of learning were mistuned. When the learning rule was allowed to tune the time constants as well as synaptic weights, a consistent distribution of time constants was found, akin to a log normal or gamma distribution, and this qualitatively matched time constants measured in experimental data. Note that we do not claim that the nervous system tunes time constants during its lifetime to optimise task performance, we only use our methods to find the optimal distribution of time constants.

We conclude from this that neural heterogeneity is a metabolically efficient strategy for the brain. Heterogeneous networks have no additional cost in terms of the number of neurons or synapses, and perform as well as homogeneous networks which have an order of magnitude more neurons. This gain also extends to neuromorphic computing systems, as adding heterogeneity to the neuron model adds an additional time and memory cost of only \(O(n)\), while adding more neurons has a cost of \(O(n^2)\). In addition to their overall performance being better, heterogeneous networks are more robust and able to learn across a wider range of environments, which is clearly ethologically advantageous. Again, this has a corresponding benefit to neuromorphic computing and potentially machine learning more generally, in that it reduces the cost of hyperparameter tuning, which is often one of the largest costs for developing these models.

The question remains as to the extent of time constant tuning in real nervous systems. It could be the case that the heterogeneous distribution of time constants observed in different animals and brain regions (2C) is simply a product of noisy developmental processes. This may be so, but our results show that these distributions closely match the optimal ones found by simulation which confer a substantial computational advantage, and it therefore seems likely that the brain makes use of this advantage. We found that any degree of heterogeneity improves performance, but that the best performance could be found by tuning the distribution of time constants to match the task. Without a more detailed model of these specific brain regions and the tasks they solve, it is difficult to conclude whether or not the precise distributions observed are tuned to those tasks or not, and indeed having a less precisely tuned distribution may lead to greater robustness in uncertain environments.

A number of studies have used heterogeneous or tunable time constants (Fang et al. 2020; Quax, D’Asaro, and Gerven 2020; Yin, Corradi, and Bohté 2020), but these have generally been focussed on maximising performance for neuromorphic applications, and not considering the potential role in real nervous systems. In particular, we have shown that: heterogeneity is particularly important for the type of temporally complex tasks faced in real environments, as compared to the static ones often considered in machine learning; heterogeneity confers robustness allowing for learning in a wide range of environments; optimal distributions of time constants are consistent across training runs and match experimental data; and that our results are not specific to a particular task or training method.

The methods used here are very computationally demanding, and this has limited us to investigating very small networks (hundreds of neurons). Indeed, we estimate that in the preparation of this paper we used approximately 2 years of GPU computing. Finding new algorithms to allow us to scale these methods to larger networks will be a critical task for the field.

Beyond this, it would be interesting to see to what extent different forms of heterogeneity confer other advantages, such as spatial heterogeneity as well as temporal. We observed that in the brain, different cell types have different stereotyped distributions of time constants, and it would be interesting to extend our methods to networks with multiple cell types, including more biophysically detailed cell models.

Our computational results show a compelling advantage for heterogeneity, and this makes intuitive sense. Having heterogeneous time constants in a layer allows the network to integrate incoming spikes at different time scales, corresponding to shorter or longer memory trace, thus allowing the readout layer to capture information at several scales and represent a richer set of functions. It would be very valuable to extend this line of thought and find a rigorous theoretical explanation of the advantage of heterogeneity.

Code availability

The code will be publicly available after publication in a public github repository. https://github.com/npvoid/neural_heterogeneity

Data availability

All datasets used in this work can be publicly accessed in the following links:

Competing Interests Declaration

We declare that none of the authors have competing financial or non-financial interests as defined by Nature Research.

Acknowledgments

We are immensely grateful to the Allen Institute and Paul Manis for publicly sharing their databases that allowed us to estimate time constant distributions in the brain. Releasing this data is not only immensely generous and essential for this work, but more generally it accelerates the pace of science and represents an optimistic vision of the future.

[sec:methods]Materials and Methods

Neuron and synaptic models

We use the Leaky Integrate and Fire (LIF) neuron model in all our simulations. In this model the membrane potential of the \(i\)-th neuron in the \(l\)-th layer \(U_i^{(l)}(t)\) varies over time following [eq:mem].

\[\label{eq:mem} \tau_m \dot{U}_i^{(l)} = -(U_i^{(l)}-U_{0}) + I_i^{(l)}\]

Here, \(\tau_m\) is the membrane time constant, \(U_{0}\) is the resting potential and \(I_i^{(l)}\) is the input current. When the membrane potential reaches the threshold value \(U_{th}\) a spike is emitted, \(U_i(t)\) resets to the reset potential \(U_{r}\) and then enters a refractory period that lasts \(t_{ref}\) seconds where the neuron cannot spike.

Spikes emitted by the \(j\)-th neuron in layer \(l\!-\!1\) at a finite set of times \(\{t_j^{(k)}\}\) can be formalised as a spike train \(S_j^{(l)}(t)\) defined as in [eq:spiketrain]

\[\label{eq:spiketrain} S_j^{(l-1)}(t) = \sum_k \delta(t-t_j^{(k)})\]

The input current \(I_i^{(l)}\) is obtained from the spike trains of all presynaptic neurons \(j\) connected to neuron \(i\) following [eq:syn]

\[\label{eq:syn} \tau_{s}\dot{I}_i^{(l)} = -I_i^{(l)}(t) + \sum_j W_{ij}^{(l)}S_j^{(l-1)}(t) + \sum_j V_{ij}^{(l)}S_j^{(l)}(t)\]

Here \(\tau_s\) is the synaptic time constant, \(W_{ij}^{(l)}\) is the feed-forward synaptic weight from neuron \(j\) in layer \(l\!-\!1\) to neuron \(i\) in layer \(l\) and \(V_{ij}^{(l)}\) is the recurrent weight from neuron \(j\) in layer \(l\) to neuron \(i\) in layer \(l\).

Thus, a LIF neuron is fully defined by six parameters \(\tau_m, \tau_s, U_{th}, U_{0}, U_{r}, t_{ref}\) plus its synaptic weights \(W_{ij}^{(l)}\) and \(V_{ij}^{(l)}\). We refer to these as the neuron parameters and weights respectively.

Since we are considering the cases where these parameters may be different for each neuron in the population we should actually refer to \(\tau_{m, i}\), \(\tau_{s, i}\), \(U_{th, i}\), \(U_{0, i}\),\(U_{r, i}\), \(t_{ref, i}\). However, for notational simplicity we will drop the \(i\) subscript and it will be assumed that these parameters can be different for each neuron in a population.

Neural and synaptic model discretisation

In order to implement the LIF model in a computer it is necessary to discretise it. Assuming a very small simulation time step \(\Delta t\), [eq:syn] can be discretised to

\[\label{eq:synd} I_i^{(l)}[t+1] = \alpha I_i^{(l)}[t] + \sum_j W_{ij}S_j^{(l-1)}[t] + \sum_j V_{ij} S_j^{(l)}[t]\]

With \(\alpha=\exp(-\Delta t / \tau_{s})\). Similarly, [eq:mem] becomes

\[\label{eq:memd} U_i^{(l)}[t+1] = \beta (U_i^{(l)}[t]-U_{0}) + U_{0} + (1-\beta)I_i^{(l)}[t] - (U_{th}-U_{r})S_i^{(l)}[t]\]

With \(\beta=\exp(-\Delta t / \tau_{m})\). Finally, the spiking mechanism

\[\label{eq:spk} S_i^{(l)}[t]= \begin{cases} 1 \qquad if \: U_i^{(l)}[t]\!-\!U_{th}\geq 0\\ 0 \qquad if \: U_i^{(l)}[t]\!-\!U_{th}<0\\ \end{cases}\]

Notice how the last term in [eq:memd] introduces the membrane potential resetting. This would only work if we assume that the neuron potential at spiking time was exactly equal to \(U_{th}\). This may not necessarily be the case since membrane potential update that crossed the threshold may result in \(U_i^{(l)}\!>\!U_{th}\) and then the resetting mechanism will not set the membrane potential to \(U_r\). However, we found that this has a negligible effect in our simulations.

Surrogate Gradient Descent Training

With the discretisation introduced in the previous section, a spiking layer consists of three cascaded sub-layers: current [eq:synd], membrane [eq:memd] and spike [eq:spk]. The current and membrane sub-layers have access to its previous state and thus, they can be seen as a particular case of recurrent neural network (RNN). Note that while each neuron is a recurrent unit since it has access to its own previous state, different neurons in the same spiking layer will only be connected if any of the non-diagonal elements of \(\pmb{V}^{(l)}\) is non-zero. In other words, all SNNs built using this model are RNNs but not all SNNs are RSNNs.

We can cascade \(L\) spiking layers to conform a Deep Spiking Neural Network analogous to a conventional Deep Neural Network and train it using gradient descent. However, since equation [eq:spk] is non-differentiable, we need to modify the backwards pass as in (Neftci, Mostafa, and Zenke 2019) so that BPTT algorithm can be used to update the network parameters.

\[\label{eq:surr} \sigma(U_i^{(l)}) = \frac{U_i^{(l)}}{1+\rho|(U_i^{(l)})|}\]

This means that while in the forward pass the network follows a step function as in [eq:spk], in the backwards pass it follows a sigmoidal function [eq:surr], with steepness set by \(\rho\).

We can now use gradient descent to optimise the synaptic weights \(\pmb{W}^{(l)}\) and \(\pmb{V}^{(l)}\) as in conventional deep learning. We can also optimise the spiking neuron specific parameters \(U_{th}, U_{0}, U_{r}\) since they can be seen as bias terms. The time constants can also be indirectly optimised by training \(\alpha\) and \(\beta\) which can be seen as forgetting factors.

We apply a clipping function to \(\alpha\) and \(\beta\) after every update.

\[\label{eq:clipp} clip(x) = \begin{cases} e^{-1/3}, \qquad if \quad x < e^{-1/3}\\ 0.995, \qquad if \quad x > 0.995 \end{cases}{}\]

In order to ensure stability, the forgetting factors have to be less than \(1\). Otherwise, the current and membrane potential would grow exponentially. Secondly to make the system causal these factors cannot be less than zero. This however, would allow for arbitrarily small time constants which would not have any meaning given a finite time resolution \(\Delta t\). Thus, we constrain the time constants to be at least \(3\Delta t\). We also set clipping limits for \(U_{th}, U_{0}, U_{r}\) such that they are always between the ranges specified in 2.

There are several ways in which the neuron parameters may be trained. One possibility is to make all neurons in a layer share the same neuron parameters. That is, a single value of \(U_{th}, U_{0}, U_{r}, \alpha, \beta\) is trained and shared by all neurons in a layer. Another possibility is to optimise each of these parameters in each neuron individually as we have done in our experiments. We also always trained the weight matrices \(\pmb{W}^{(l))}\) and \(\pmb{V}^{(l))}\). Training was done by using automatic differentiation on PyTorch ((Paszke et al. 2019)) and Adam optimiser with learning rate \(10^{-3}\) and betas \((0.9, 0.999)\).

In all surrogate gradient descent experiments a single recurrent spiking layer with 128 neurons received all input spikes. This recurrent layer is followed by a feedforward readout layer with \(U_{th}\) set to infinity and with as many neurons as classes in the dataset.

For the loss function, we follow the max-over-time loss in (Cramer et al. 2020) to take the maximal membrane potential over the entire time in the readout layer. We then take these potentials and compute the cross-entropy loss

\[\mathcal{L}= -\log\left(\frac{\exp(\mathop{\mathrm{\textit{arg\,max}}}_t U_{class}^{(L)}[t])}{\sum_j \exp(\mathop{\mathrm{\textit{arg\,max}}}_t U_{j}^{(L)}[t]))}\right)\]

where \(class\) corresponds to the readout neuron index of the correct label for a given sample. The loss is computed as the average of \(N_{batch}\) training samples. This was repeated for a total of \(N_{epochs}\).

In order to improve generalisation we added noise to the input by adding spikes following a Poisson process with rate 1.2 Hz and deleting spikes with probability \(0.001\).

The parameters used for the network are given in 2 and 3 unless otherwise specified. In 2F we used a log-normal distribution in which we ensured the mode was the same as in the Gamma distribution we used for the other experiments (see 2) but we scaled the standard deviation to be \(f\) times that of the original Gamma distribution. We used \(f\!=\!4\) for the heterogeneous case. For the intermediate case all neurons were initialised as in the Homogeneous configuration but \(5\%\) of them selected randomly were given the largest time constant value allowed of \(100\)ms.

Parameter initialisation for the different configurations
Parameter	HomInit	HetInit
\(\tau_m\)	\(\bar{\tau}_m\)	\(\Gamma(3, \bar{\tau}_m/3)\)
\(\tau_s\)	\(\bar{\tau}_s\)	\(\Gamma(3, \bar{\tau}_s/3)\)
\(U_{th}\)	\(U_{th}\)	\(\mathcal{U} (0.5, 1.5)\)
\(U_{0}\)	\(U_{0}\)	\(\mathcal{U} (-0.5, 0.5)\)
\(U_{r}\)	\(U_{r}\)	\(\mathcal{U} (-0.5, 0.5)\)

FORCE network parameters
Parameter	Value
\(\Delta t\)	0.5ms
\(M\)	500
\(\bar{\tau_m}\)	20ms
\(\bar{\tau_s}\)	10ms
\(U_{th}\)	1V
\(U_{0}\)	0mV
\(U_{r}\)	0mV
\(t_{ref}\)	0ms
\(\rho\)	100

All states \(I_i(l)[0]\) and \(U_i(l)[0]\) are initialised to \(0\). For the weights \(\pmb{W}\) and \(\pmb{V}\), we independently sampled from a uniform distribution \(\mathcal{U}(-k^{-1/2}, k^{-1/2})\), with \(k\) being the number of afferent connections ((LeCun et al. 1998)).

FORCE Training

The FORCE method is used to train a network consisting of a single recurrent layer of LIF neurons as in (Nicola and Clopath 2017). In this method, there are no feedforward weights and only the recurrent weights \(\pmb{V}\) are trained. We can express these weights as

\[\label{eq:force} \pmb{V} = G\pmb{v}^0 + Q \pmb{\eta} \pmb{\phi}^T\]

The first term in [eq:force], namely \(G\pmb{v}^0\), remains static during training and it is initialised to set the network into chaotic spiking. The learned component of the weights \(\pmb{\phi}^T\!\in\! \mathbb{R}^{K\times N}\) is updated using the Recursive Least Squares algorithm. The vector \(\pmb{\eta}\!\in\! \mathbb{R}^{N\times K}\) serves as a decoder and it is static during learning. The constants \(G\) and \(Q\) govern the ratio between chaotic and learned weights.

With this definition of \(\pmb{V}\) we can write the currents into the neurons as the sum \(\pmb{I}[t]\!=\!\pmb{I}_G[t]\!+\!\pmb{I}_Q[t]\) (we dropped the layer \(l\) superscript since we only have a single layer) where we define

\[\begin{aligned} \label{eq:force_currents} \pmb{I}_G[t+1] &= \alpha \pmb{I}_G[t]+G\pmb{v}^0\pmb{S}[t] \\ \pmb{I}_Q[t+1] &= Q \pmb{\eta} \pmb{\phi}^T \pmb{r}[t] \\ \pmb{r}[t+1] &= \alpha \pmb{r}[t]+\pmb{S}[t]\end{aligned}\]

In order to stabilise the network dynamics we add a High Dimensional Temporal Signal (HDTS) as in (Nicola and Clopath 2017). This is an \(M\) dimensional periodic signal \(\pmb{z}[t]\). Given the HDTS period \(T\), we split the interval \([0, T]\) into \(M\) subintervals \(I_m, \: m\!=\!1, \ldots , M\) such that each of the components of \(\pmb{z}[t]\) is given by

\[\label{eq:hdts} z_m[t]= \begin{cases} \left| A sin\left(\frac{M\pi t}{T}\right)\right|, \quad \text{if} \quad t \in I_m \\ \qquad 0 \qquad \qquad \text{otherwise} \\ \end{cases}\]

This signal is then projected onto the neurons leaving equation [eq:synd] as

\[\label{eq:synd_force} \pmb{I}[t]\!=\!\pmb{I}_G[t]\!+\!\pmb{I}_Q[t] + \pmb{\mu} \pmb{z}[t]\]

where vector \(\pmb{\mu}\!\in\! \mathbb{R}^{N\times M}\) is just a decoder similar to \(\pmb{\eta}\) in [eq:force]

The aim of FORCE learning is to approximate a \(K\)-dimensional time varying teaching signal \(\pmb{x}[t]\). The vector \(\pmb{r}[t]\) is used to obtain an approximant of the desired signal

\[\hat{\pmb{x}}(t) = \pmb{\phi}^T \pmb{r}[t]\]

The weights are updated using the RLS learning rule according to:

\[\pmb{\phi}(t) = \pmb{\phi}(t-\Delta t) - \pmb{e}(t) \pmb{P}(t)\pmb{r}(t)\] \[\pmb{P}(t) = \pmb{P}(t-\Delta t) - \frac{\pmb{P}(t-\Delta t)\pmb{r}(t)\pmb{r}(t)^T\pmb{P}(t-\Delta t)}{1+\pmb{r}(t)^T\pmb{P}(t-\Delta t)\pmb{r}(t)}\]

During the training phase, the teaching signal \(\pmb{x}(t)\) is used to perform the RLS update. Then, during the testing phase, the teaching signal is removed.

We used the parameters given in 4 for all FORCE experiments unless otherwise specified.

FORCE network parameters
Parameter	Value
\(\Delta t\)	0.04ms
\(N\)	1000
\(M\)	500
\(\tau_m\)	10ms
\(\tau_s\)	20ms
\(U_{th}\)	-40mV
\(U_{0}\)	0mV
\(U_{r}\)	-65mV
\(t_{ref}\)	2ms
\(Q\)	10
\(G\)	0.04
\(A\)	80

The period \(T\) was chosen to be equal to length of the teaching signal \(\pmb{x}[t]\). The membrane potential were randomly initialised following a uniform distribution \(\mathcal{U}(U_{r}, U_{th})\). Vectors \(\pmb{\eta}\) and \(\pmb{\mu}\) are randomly drawn from \(\mathcal{U}(-1, 1)\). The static weights \(\pmb{v}^0\) are drawn from a normal distribution \(\mathcal{N}(0, 1/(Np^2))\), then these weights are set to \(0\) with probability \(p\!\!=0.1\). All other variables are initialised to zero unless otherwise specified.

S1. Heterogeneity on other spiking neuron hyperparameters

Performance comparison among different RSNN configurations on SHD testing set. The configuration for \(\tau_m\) and \(\tau_s\) was used as specified in the top row. Initialisation and training schemes were applied only for the parameter in its corresponding column
	HomInit-StdTr \(\tau_m\) and \(\tau_s\)			HetInit-HetTr \(\tau_m\) and \(\tau_s\)
	\(U_{th}\)	\(U_{0}\)	\(U_{r}\)	\(U_{th}\)	\(U_{0}\)	\(U_{r}\)
HomInit-StdTr	\(72.8 \!\pm\! 3.4\)	\(72.8 \!\pm\! 3.4\)	\(72.8 \!\pm\! 3.4\)	\(78.9 \!\pm\! 2.0\)	\(78.9 \!\pm\! 2.0\)	\(78.9 \!\pm\! 2.0\)
HetInit-StdTr	\(71.1 \!\pm\! 2.3\)	\(69.9 \!\pm\! 2.3\)	\(70.8 \!\pm\! 3.3\)	\(79.2 \!\pm\! 2.9\)	\(74.8 \!\pm\! 4.9\)	\(79.1 \!\pm\! 2.7\)
HomInit-HetTr	\(73.7 \!\pm\! 2.8\)	\(71.6 \!\pm\! 3.1\)	\(70.5 \!\pm\! 3.7\)	\(79.1 \!\pm\! 1.8\)	\(76.1 \!\pm\! 3.3\)	\(76.4 \!\pm\! 2.7\)
HetInit-HetTr	\(73.2 \!\pm\! 2.8\)	\(72.0 \!\pm\! 2.6\)	\(69.4 \!\pm\! 3.8\)	\(79.2 \!\pm\! 2.8\)	\(74.9 \!\pm\! 6.6\)	\(75.4 \!\pm\! 6.6\)

S2. Membrane time constant distribution consistency across trials

Breakdown of membrane time constant distribution for each trial for consistency checking on each dataset

S3. Synaptic time constant distribution consistency across trials

Breakdown of synaptic time constant distribution for each trial for consistency checking on each dataset

S4. Confusion matrices DVS-gesture dataset

The classes correspond to

Hand clapping
Right hand wave
Left hand wave
Right arm clockwise
Right arm counter-clockwise
Left arm clockwise
Left arm counter-clockwise
Arm roll
Air drum
Air guitar

S5. Grid search of parameters

S6. All trials Fashion-MNIST

S7. Robustness under other time constant distributions

S8. Visualisation of temporal structure of DVS128 gesture dataset samples

A. Visualization of two samples of the DVS128 gesture dataset. Each frame spans a \(5\si{\milli\second}\) window with \(200\si{\milli\second}\) between frames. To distinguish these gestures we need to integrate over tens of milliseconds. B. Confusion matrices of the DVS128 gesture dataset under the fully homogeneous (left) and fully heterogeneous (right) configurations. Class 2 (*Right hand wave*) is often incorrectly classified as 4 (*Right hand clockwise*) and 5 (*Right hand counter clockwise*) under the homogeneous configuration but not under the heterogeneous one.

Amir, Arnon, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, et al. 2017. “A Low Power, Fully Event-Based Gesture Recognition System.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7243–52.

Bellec, Guillaume, Franz Scherr, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. 2019. “Biologically Inspired Alternatives to Backpropagation Through Time for Learning in Recurrent Neural Nets.” https://arxiv.org/abs/1901.09049.

Blättler, F., and R. H. Hahnloser. 2011. “An efficient coding hypothesis links sparsity and selectivity of neural responses.” PLoS ONE 6 (10): e25506.

Chelaru, Mircea I., and Valentin Dragoi. 2008. “Efficient Coding in Heterogeneous Neuronal Populations.” Proceedings of the National Academy of Sciences 105 (October): 16344–49. https://doi.org/10.1073/PNAS.0807744105.

Cramer, Benjamin, Yannik Stradmann, Johannes Schemmel, and Friedemann Zenke. 2020. “The heidelberg spiking datasets for the systematic evaluation of spiking neural networks.” IEEE Transactions on Neural Networks and Learning Systems, 1–14. https://doi.org/10.1109/TNNLS.2020.3044364.

Duarte, Renato, and Abigail Morrison. 2019. “Leveraging Heterogeneity for Neural Computation with Fading Memory in Layer 2/3 Cortical Microcircuits.” PLoS Computational Biology 15 (4): e1006781.

Fang, Wei, Zhaofei Yu, Yanqi Chen, Timothee Masquelier, Tiejun Huang, and Yonghong Tian. 2020. “Incorporating Learnable Membrane Time Constant to Enhance Learning of Spiking Neural Networks,” July. http://arxiv.org/abs/2007.05785.

Gjorgjieva, Julijana, Guillaume Drion, and Eve Marder. 2016. “Computational implications of biophysical diversity and multiple timescales in neurons and synapses for circuit performance.” Current Opinion in Neurobiology 37 (Table 1): 44–52. https://doi.org/10.1016/j.conb.2015.12.008.

Hawrylycz, Michael J., Ed S. Lein, Angela L. Guillozet-Bongaarts, Elaine H. Shen, Lydia Ng, Jeremy A. Miller, Louie N. van de Lagemaat, et al. 2012. “An Anatomically Comprehensive Atlas of the Adult Human Brain Transcriptome.” Nature 489 (September): 391–99. https://doi.org/10.1038/nature11405.

Hunsberger, Eric, Matthew Scott, and Chris Eliasmith. 2014. “The Competing Benefits of Noise and Heterogeneity in Neural Coding.” Neural Computation 26 (8): 1600–1623.

Kilpatrick, Zachary P., Bard Ermentrout, and Brent Doiron. 2013. “Optimizing Working Memory with Heterogeneity of Recurrent Cortical Excitation.” Journal of Neuroscience 33 (48): 18999–9011. https://doi.org/10.1523/JNEUROSCI.1641-13.2013.

Koch, Christof, and Gilles Laurent. 1999. “Complexity and the Nervous System.” Science 284 (5411): 96–98. https://doi.org/10.1126/science.284.5411.96.

LeCun, Yann, Leon Bottou, G Orr, and Klaus-Robert Muller. 1998. “Efficient Backprop.” Neural Networks: Tricks of the Trade. New York: Springer.

Lein, Ed S., Michael J. Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard, Andrew F. Boe, et al. 2007. “Genome-Wide Atlas of Gene Expression in the Adult Mouse Brain.” Nature 445 (January): 168–76. https://doi.org/10.1038/nature05453.

Lengler, Florian AND Steger, Johannes AND Jug. 2013. “Reliable Neuronal Systems: The Importance of Heterogeneity.” PLOS ONE 8 (12): 1–10. https://doi.org/10.1371/journal.pone.0080694.

Lerner, Y., C. J. Honey, M. Katkov, and U. Hasson. 2014. “Temporal Scaling of Neural Responses to Compressed and Dilated Natural Speech.” Journal of Neurophysiology 111 (12): 2433–44. https://doi.org/10.1152/jn.00497.2013.

Maass, Wolfgang, Thomas Natschläger, and Henry Markram. 2002. “Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations.” Neural Computation 14 (11): 2531–60. https://doi.org/10.1162/089976602760407955.

Manis, Paul B., Michael R. Kasten, and Ruili Xie. 2019. “Classification of Neurons in the Adult Mouse Cochlear Nucleus: Linear Discriminant Analysis.” Edited by Manuel S. Malmierca. PLOS ONE 14 (October): e0223137. https://doi.org/10.1371/journal.pone.0223137.

Manis, Paul, Michael R. Kasten, and Ruili Xie. 2019. “Raw Voltage and Current Traces for Current-Voltage (IV) Relationships for Cochlear Nucleus Neurons.” figshare. https://doi.org/10.6084/m9.figshare.8854352.v1.

Marsat, Gary, and Leonard Maler. 2010. “Neural Heterogeneity and Efficient Population Codes for Communication Signals.” Journal of Neurophysiology 104 (5): 2543–55. https://doi.org/10.1152/jn.00256.2010.

Neftci, E. O., H. Mostafa, and F. Zenke. 2019. “Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks.” IEEE Signal Processing Magazine 36 (6): 51–63. https://doi.org/10.1109/MSP.2019.2931595.

Nicola, Wilten, and Claudia Clopath. 2017. “Supervised learning in spiking neural networks with FORCE training.” Nature Communications 8 (1): 1–15. https://doi.org/10.1038/s41467-017-01827-3.

Orchard, Garrick, Ajinkya Jayawant, Gregory K. Cohen, and Nitish Thakor. 2015. “Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades.” Frontiers in Neuroscience 9: 437. https://doi.org/10.3389/fnins.2015.00437.

Osborne, Leslie C., Stephanie E. Palmer, Stephen G. Lisberger, and William Bialek. 2008. “The Neural Basis for Combinatorial Coding in a Cortical Population Response.” The Journal of Neuroscience 28: 13522. https://doi.org/10.1523/JNEUROSCI.4390-08.2008.

Padmanabhan, Krishnan, and Nathaniel N Urban. 2010. “Intrinsic Biophysical Diversity Decorrelates Neuronal Firing While Increasing Information Content.” Nature Neuroscience 13 (October): 1276–82. https://doi.org/10.1038/nn.2630.

Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” In Advances in Neural Information Processing Systems 32, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, 8024–35. Curran Associates, Inc.

Quax, Silvan C., Michele D’Asaro, and Marcel A. J. van Gerven. 2020. “Adaptive Time Scales in Recurrent Neural Networks.” Scientific Reports 10 (December): 11360. https://doi.org/10.1038/s41598-020-68169-x.

Schmitt, Sebastian, Johann Klahn, Guillaume Bellec, Andreas Grubl, Maurice Guttler, Andreas Hartel, Stephan Hartmann, et al. 2017. “Neuromorphic Hardware in the Loop: Training a Deep Spiking Network on the BrainScaleS Wafer-Scale System.” In 2017 International Joint Conference on Neural Networks (IJCNN), 2227–34. IEEE. https://doi.org/10.1109/IJCNN.2017.7966125.

Shamir, Maoz, and Haim Sompolinsky. 2006. “Implications of Neuronal Diversity on Population Coding.” Neural Computation 18 (August): 1951–86. https://doi.org/10.1162/neco.2006.18.8.1951.

Xiao, Han, Kashif Rasul, and Roland Vollgraf. 2017. “Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms.” CoRR abs/1708.07747. https://arxiv.org/abs/1708.07747.

Yin, Bojian, Federico Corradi, and Sander M. Bohté. 2020. “Effective and Efficient Computation with Multiple-Timescale Spiking Recurrent Neural Networks.” In International Conference on Neuromorphic Systems 2020. ICONS 2020. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3407197.3407225.

Zeldenrust, Fleur, Boris Gutkin, and Sophie Denéve. 2019. “Efficient and Robust Coding in Heterogeneous Recurrent Networks.” bioRxiv, October, 804864. https://doi.org/10.1101/804864.

RELATED
Click to pin

PINNED
Click to unpin