Spawned from :: [[SBFA Paper]]

Highlights

Outline

Rough Intro

Time mechanisms are underdeveloped in the ML field with the focus being mainly on the “spatial” dimensions of hyperparameter value tuning, and comparatively has failed to implement or account for time properties in the network activity.

This isn’t altogether true, find the exceptions
LSTMs don’t exactly track time as much as they just feedback historical states

Time mechanisms in biological neural assemblies are not completely understood but offer a potential map of complexity-rich and multi-dimensional dynamics. By understanding and modeling known properties of neuronal spiking communication, we may be able to draw out some emergent macro-properties of timing at the level of a biological agent. I.e. small, simplistic time-mechanisms (such as STDP) may accumulate at scale to compose more complicated modes of timing in an assembly.

Assuming biological models are composed of these simple and self-adjusting mechanisms as a means of adapting in varied and changing environments, it is beneficial to study circuits with distributed and multi-role functions.

One such model of time in neurobiology is SBF, which encodes events separated in time through a distribution of activity on subsidiary micro-functional circuits. This well studied model [REF] provides an explanation for flexible and distributed encodings of time information at multiple scales.

This model offers potential for implementation in a STDP based SNN, and can easily be implemented in an automata framework, which is the foundation we set here. We attempt to adapt the SBF in a naïve automata model, and test the model’s ability to learn a target interval in a reinforcement learning task.

This tests the SBF-Automata model’s ability to learn an interval of time based on the availability of a reward at the correct interval in the time-space. This is done by adjusting weights of attention between an executive “action circuit”, responsible for deciding an action taken, and multiple ancillary “time cell” circuits, which each maintain small oscillatory sequences and inform the action circuit of timing information by precession of their cycle.

Current Methods and Theory

Striatal Beat Frequency Model

Comparison to other methods
- See what Ines used
- PA model
Considered one of the better models
Distributes multiple encodings of time intervals across circuits
Does not rely on “cold storage” (in our interpretation)

Peak-Interval Timing Task (PITT)

Used as a method

Weighted Majority Algorithm (WMA)

Being explained in methods
Will move Abstract description here

Normalized Automata Vote Algorithm (NAVA)

Being explained in methods
Will move Abstract description here

Methods

Experiments

Strict SBF interpretation in PITT

Oscillators would reset after each trial
Haven’t actually implemented this yet
How will we set the osc-set
- Perhaps use the first 10 primes

Agent lifetime learning

No reset of oscillators over trial (experiment) timespace

Using the WMA

Using the NAVA

Without Punishment

With Punishment

Experimental Task

We set timesteps $t$ from $0$ to $T$
- $T$ is the end timestep
We set a reward phase of $R$
- Which is the cyclic number of timesteps that pass before the reward is available
- So for every $t$ which divides by $R$, there is an award available

We have agents (or “oscillators”) $n$ which contribute to the action vote,
- each has a unique phase $n_t$ , a time step cycle, at which they can contribute to the action vote
- if the current timestep is not in phase with the agent, that is the current timestep is not divisible by the agent’s unique cyclic phase, the agent can not contribute to the action vote
- Each agent’s vote has some weight $w_n$
  - This initial weight is $w_{n}= \frac{1}{n}$
Some learning control parameter(s): $\alpha$, $\beta$, $\epsilon$

At each timestep, there occurs an action vote where the collective agents may decide to act on the environment to check for a reward
The collective weighted vote of the agents, decides a probability that an action will take place at each timestep
Further actions are determined by the individual algorithm used (see below).

Experimental Implementation(s)

[[Weighted Majority Algorithm]]

See [[Algorithm Outline]] and standardize the formatting between pseudocode for methods sections and formal algo description for the model description. For $t =0,T$ do Check “active” agents with check cycles in phase with current time step, where $t \ mod \ tc = 0$ Take sum of weights of active agents, $w_{t} = \sum{w_{a}}$ Take probability $P$ of polling environment for reward

If $P \geq w_t$ : do nothing and move to next time-step

If $P \leq w_t$ If $R_t \ mod \ t = 0$ , timestep is on a reward interval: Decrease weights of inactive agents: $$w_{i} = \frac{1}{2} * w_{i}$$Active Agents are not updated. Divided all weights by sum of all weights:$$w_{n} = \frac{w_{n}}{\sum\limits w_{n}}$$ Else, timestep is not on a reward interval : do nothing and move to next time-step

End For loop

[[SBFA - Normalized Automata Vote|Normalized Automata Vote]]

No punishment variant

If the action vote fails:

No further actions occur and environment moves to the next timestep If the action vote succeeds:
The environment is checked for a reward :
- If reward is unavailable ($RP \mod{t} \neq 0$):
  - No further actions occur and environment moves to the next timestep
- If reward is available ($RP \mod{t} = 0$):
  - The weights of agents in phase with the rewarded timestep are rewarded, by an evenly distributed amount weight taken from the sum of inactive weights (adjusted by $\alpha$)
    - $$w_{a} = w_{a} + \alpha \frac{\sum w_{i}}{n_{a}}$$
  - The weights of inactive agents are reduced proportionally to their contribution:
    - $$w_{i} = w_{i} * (1 - \alpha)$$
  - No further actions occur and environment moves to the next timestep
Loop to next timestep

With Punishment Variant

The one we actually use most. If the environment is checked for reward and the reward is unavailable ($RP \mod{t} \neq 0$):

Invert the update scheme to take from the active and distribute to the inactive. Such that:
The weights of inactive agents are rewarded, by an evenly distributed amount weight taken from the sum of active weights (adjusted by $\beta$) - $$w_{i} = w_{i} + \beta \frac{\sum w_{a}}{n_{i}}$$ - The weights of inactive agents are reduced proportionally to their contribution: - $$w_{a} = w_{a} * (1 - \beta)$$ - No further actions occur and environment moves to the next timestep
Loop to next timestep

Older Variant? [Remove in next draft]

The issue with this, is that the punishment was evenly distributed to the inactive, when it should directly scale to the amount of contribution (i.e. the weight) to the vote. [This of course means that the even distribution of reward to active may also be incorrect] The weights of inactive agents are reduced: - $$w_{i} = w_{i} - \alpha \frac{\sum w_{i}}{n_{i}}$$ - The The weights of agents in phase with the rewarded timestep are rewarded - $$w_{n-active} = w_{n-active} + \alpha \frac{\sum w_{i}}{n_{a}}$$

Experimental Metrics

We want to test for:

Adaptability
- Sorta reversed on this. See below
Reward over time
- We should see rewards increase over time
- I’m not sure I look for that
- Anis is most interested in this
Energy Expenditure
- reward to pulls
- He was trying to tell me something about even 1% pull rate is too much, I’m not sure what he meant
- He thinks we should add decay (at least for first paper) despite that destroying adaptability

GLOSSARY

Action Node

Node responsible for interacting with the environment and making the decision to do so

Oscillatory Node

one of a set of node that have a cyclic basis of activity
Biologically analogous to a Time Cell

Action Vote

Oscillator Set

The set of osc-nodes informing the action node
In case we test using different sets of oscs

Oscillator Phase

Length of the time cycle on a osc-node
Measured in terms of timesteps

Lifetime or timespace of Trial

The series of timesteps over an entire trial
From timestep 0 to T

Trials

Or experiments though I am using that term differently

A single implementation of one experiment
- With set experimental conditions
- From timestep 0 to T
Multiple trials are then averaged over for one “Experiment”

Experiment

Single set of experimental conditions for which multiple instances are run

Reward Period (RP)

The time cycle at which a reward is available

Reward Interval

Intervals over the whole time space (lifetime?) of the trial at which a reward was available. I.e. timesteps which were divisble by the RP

Basic Learning parameter : $\alpha$

The most basic learning parameter
Use to update the weights
Typically for when the model is rewarded
But we can also use it as the “punishment” when no reward was available by setting $\alpha = \beta$

Punishment parameter : $\beta$

The learning parameter for when the model searched for reward when none was available
Typically set lower than the learning parameter $\alpha$
- Done so by scaling alpha by some value e.g. $\beta = 0.01 * \alpha$

L-params: Control parameters $(\alpha,\beta)$ (or sometimes learning parameters)

These by definition should stay between 0 and 1, but some strategies allow for unbounded growth A few possible strategies to setting these:
Static: unchanging values, basic form
Decay: Not implemented. RPE adds a basic form of this.
RPE based: RPE is used to change or update over time

Normalized Automata Algorithm

Our implementation

Weighted Majority Algorithm

An implementation drawn from [[@littlestoneWeightedMajorityAlgorithm1994|N. Littlestone, M.K. Warmuth (1994)]]
[[Weighted Majority Algorithm]], [[Weighted Majority Vote Version 27.01.23]], [[Weighted Majority Algorithm - A beautiful algorithm for Learning from Experts]]

Adaptability

Sorta reversed on this. See below

Reward over time

We should see rewards increase over time
I’m not sure I look for that
Anis is most interested in this

Energy Expenditure

reward to pulls
He was trying to tell me something about even 1% pull rate is too much, I’m not sure what he meant
He thinks we should add decay (at least for first paper) despite that destroying adaptability

For RPE Methods: see [[SBFA Development Log#30.05.23]]

RPE: Reward prediction error $\epsilon$ for RPE : $\epsilon : {0...1}$
Direct: RPE value (modified or none) is directly used to represent the L-params at an update step
- This means that the “memory” is stored directly in the osc-weights, which would be ideal
Independent (L-params): The l-params exist as an independent additional parameter
- This allows for “memory” to be stored in their values, but requires the extra variable instead of passing it directly to the oscillator states
Indirect: Not really used and would likely mean making some change to the l-params after the update step. But this would lead to confusing terminology. - This allows for independent l-params to be updated outside their update steps - E.g. The potential reward for when a correct choice is made could be increased during failed attempts
Scaling: The L-params are independent values, and are scaled by the RPE usually by a multiplicative method, though a complicated log-scaling method has been made
- These have a bad habit of running away
- But seem to be ok if the RPE-tuning parameter is low enough
Additive: L-params are independent values, and are modified by the RPE via additive methods.
- These require bounding to stay within 0-1
- Not necessarily bad results when unbounded, but obviously shouldn’t go to 0
Bounded: Artificially bounded to 0-1. Useful in additive regime.
- There may be basis for a schema like this in the biological regime where metabolic limits prevent hyperactivity
No-Action (N-A) Update Step: Update step (modification of the weights) is made when no action (no pull) is taken
- More interesting in zero-osc regimes

Bibliography

[[@littlestoneWeightedMajorityAlgorithm1994|N. Littlestone, M.K. Warmuth (1994)]] [[@yinOscillationCoincidenceDetectionModels2022|B. Yin, Z. Shi, Y. Wang, W.H. Meck (2022)]]

MichaTarlton/Outline 2.md