Spawned from :: [[SBFA Paper]]
Time mechanisms are underdeveloped in the ML field with the focus being mainly on the “spatial” dimensions of hyperparameter value tuning, and comparatively has failed to implement or account for time properties in the network activity.
- This isn’t altogether true, find the exceptions
- LSTMs don’t exactly track time as much as they just feedback historical states
Time mechanisms in biological neural assemblies are not completely understood but offer a potential map of complexity-rich and multi-dimensional dynamics. By understanding and modeling known properties of neuronal spiking communication, we may be able to draw out some emergent macro-properties of timing at the level of a biological agent. I.e. small, simplistic time-mechanisms (such as STDP) may accumulate at scale to compose more complicated modes of timing in an assembly.
Assuming biological models are composed of these simple and self-adjusting mechanisms as a means of adapting in varied and changing environments, it is beneficial to study circuits with distributed and multi-role functions.
One such model of time in neurobiology is SBF, which encodes events separated in time through a distribution of activity on subsidiary micro-functional circuits. This well studied model [REF] provides an explanation for flexible and distributed encodings of time information at multiple scales.
This model offers potential for implementation in a STDP based SNN, and can easily be implemented in an automata framework, which is the foundation we set here. We attempt to adapt the SBF in a naïve automata model, and test the model’s ability to learn a target interval in a reinforcement learning task.
This tests the SBF-Automata model’s ability to learn an interval of time based on the availability of a reward at the correct interval in the time-space. This is done by adjusting weights of attention between an executive “action circuit”, responsible for deciding an action taken, and multiple ancillary “time cell” circuits, which each maintain small oscillatory sequences and inform the action circuit of timing information by precession of their cycle.
- Comparison to other methods
- See what Ines used
- PA model
- Considered one of the better models
- Distributes multiple encodings of time intervals across circuits
- Does not rely on “cold storage” (in our interpretation)
- Used as a method
- Being explained in methods
- Will move Abstract description here
- Being explained in methods
- Will move Abstract description here
- Oscillators would reset after each trial
- Haven’t actually implemented this yet
- How will we set the osc-set
- Perhaps use the first 10 primes
- No reset of oscillators over trial (experiment) timespace
- We set timesteps
$t$ from$0$ to$T$ -
$T$ is the end timestep
-
- We set a reward phase of
$R$ - Which is the cyclic number of timesteps that pass before the reward is available
- So for every
$t$ which divides by$R$ , there is an award available
- We have agents (or “oscillators”)
$n$ which contribute to the action vote,- each has a unique phase
$n_t$ , a time step cycle, at which they can contribute to the action vote - if the current timestep is not in phase with the agent, that is the current timestep is not divisible by the agent’s unique cyclic phase, the agent can not contribute to the action vote
- Each agent’s vote has some weight
$w_n$ - This initial weight is
$w_{n}= \frac{1}{n}$
- This initial weight is
- each has a unique phase
- Some learning control parameter(s):
$\alpha$ ,$\beta$ ,$\epsilon$
- At each timestep, there occurs an action vote where the collective agents may decide to act on the environment to check for a reward
- The collective weighted vote of the agents, decides a probability that an action will take place at each timestep
- Further actions are determined by the individual algorithm used (see below).
See [[Algorithm Outline]] and standardize the formatting between pseudocode for methods sections and formal algo description for the model description.
For
If
If
End For loop
If the action vote fails:
- No further actions occur and environment moves to the next timestep If the action vote succeeds:
- The environment is checked for a reward :
-
If reward is unavailable (
$RP \mod{t} \neq 0$ ):- No further actions occur and environment moves to the next timestep
-
If reward is available (
$RP \mod{t} = 0$ ):- The weights of agents in phase with the rewarded timestep are rewarded, by an evenly distributed amount weight taken from the sum of inactive weights (adjusted by
$\alpha$ )$$w_{a} = w_{a} + \alpha \frac{\sum w_{i}}{n_{a}}$$
- The weights of inactive agents are reduced proportionally to their contribution:
$$w_{i} = w_{i} * (1 - \alpha)$$
- No further actions occur and environment moves to the next timestep
- The weights of agents in phase with the rewarded timestep are rewarded, by an evenly distributed amount weight taken from the sum of inactive weights (adjusted by
-
If reward is unavailable (
- Loop to next timestep
The one we actually use most.
If the environment is checked for reward and the reward is unavailable (
- Invert the update scheme to take from the active and distribute to the inactive. Such that:
- The weights of inactive agents are rewarded, by an evenly distributed amount weight taken from the sum of active weights (adjusted by
$\beta$ ) -$$w_{i} = w_{i} + \beta \frac{\sum w_{a}}{n_{i}}$$ - The weights of inactive agents are reduced proportionally to their contribution: -$$w_{a} = w_{a} * (1 - \beta)$$ - No further actions occur and environment moves to the next timestep - Loop to next timestep
The issue with this, is that the punishment was evenly distributed to the inactive, when it should directly scale to the amount of contribution (i.e. the weight) to the vote.
[This of course means that the even distribution of reward to active may also be incorrect]
The weights of inactive agents are reduced:
-
We want to test for:
- Adaptability
- Sorta reversed on this. See below
- Reward over time
- We should see rewards increase over time
- I’m not sure I look for that
- Anis is most interested in this
- Energy Expenditure
- reward to pulls
- He was trying to tell me something about even 1% pull rate is too much, I’m not sure what he meant
- He thinks we should add decay (at least for first paper) despite that destroying adaptability
- Node responsible for interacting with the environment and making the decision to do so
- one of a set of node that have a cyclic basis of activity
- Biologically analogous to a Time Cell
- The set of osc-nodes informing the action node
- In case we test using different sets of oscs
- Length of the time cycle on a osc-node
- Measured in terms of timesteps
- The series of timesteps over an entire trial
- From timestep 0 to T
Or experiments though I am using that term differently
- A single implementation of one experiment
- With set experimental conditions
- From timestep 0 to T
- Multiple trials are then averaged over for one “Experiment”
- Single set of experimental conditions for which multiple instances are run
- The time cycle at which a reward is available
- Intervals over the whole time space (lifetime?) of the trial at which a reward was available. I.e. timesteps which were divisble by the RP
- The most basic learning parameter
- Use to update the weights
- Typically for when the model is rewarded
- But we can also use it as the “punishment” when no reward was available by setting
$\alpha = \beta$
- The learning parameter for when the model searched for reward when none was available
- Typically set lower than the learning parameter
$\alpha$ - Done so by scaling alpha by some value e.g.
$\beta = 0.01 * \alpha$
- Done so by scaling alpha by some value e.g.
- These by definition should stay between 0 and 1, but some strategies allow for unbounded growth A few possible strategies to setting these:
- Static: unchanging values, basic form
- Decay: Not implemented. RPE adds a basic form of this.
- RPE based: RPE is used to change or update over time
- Our implementation
- An implementation drawn from [[@littlestoneWeightedMajorityAlgorithm1994|N. Littlestone, M.K. Warmuth (1994)]]
- [[Weighted Majority Algorithm]], [[Weighted Majority Vote Version 27.01.23]], [[Weighted Majority Algorithm - A beautiful algorithm for Learning from Experts]]
- Sorta reversed on this. See below
- We should see rewards increase over time
- I’m not sure I look for that
- Anis is most interested in this
- reward to pulls
- He was trying to tell me something about even 1% pull rate is too much, I’m not sure what he meant
- He thinks we should add decay (at least for first paper) despite that destroying adaptability
-
RPE: Reward prediction error
$\epsilon$ for RPE :$\epsilon : {0...1}$ -
Direct: RPE value (modified or none) is directly used to represent the L-params at an update step
- This means that the “memory” is stored directly in the osc-weights, which would be ideal
-
Independent (L-params): The l-params exist as an independent additional parameter
- This allows for “memory” to be stored in their values, but requires the extra variable instead of passing it directly to the oscillator states
- Indirect: Not really used and would likely mean making some change to the l-params after the update step. But this would lead to confusing terminology. - This allows for independent l-params to be updated outside their update steps - E.g. The potential reward for when a correct choice is made could be increased during failed attempts
-
Scaling: The L-params are independent values, and are scaled by the RPE usually by a multiplicative method, though a complicated log-scaling method has been made
- These have a bad habit of running away
- But seem to be ok if the RPE-tuning parameter is low enough
-
Additive: L-params are independent values, and are modified by the RPE via additive methods.
- These require bounding to stay within 0-1
- Not necessarily bad results when unbounded, but obviously shouldn’t go to 0
-
Bounded: Artificially bounded to 0-1. Useful in additive regime.
- There may be basis for a schema like this in the biological regime where metabolic limits prevent hyperactivity
-
No-Action (N-A) Update Step: Update step (modification of the weights) is made when no action (no pull) is taken
- More interesting in zero-osc regimes
[[@littlestoneWeightedMajorityAlgorithm1994|N. Littlestone, M.K. Warmuth (1994)]] [[@yinOscillationCoincidenceDetectionModels2022|B. Yin, Z. Shi, Y. Wang, W.H. Meck (2022)]]