Skip to content

Instantly share code, notes, and snippets.

@salotz
Created January 9, 2020 18:16
Show Gist options
  • Save salotz/0158a99a75078b47538452111ec0faa2 to your computer and use it in GitHub Desktop.
Save salotz/0158a99a75078b47538452111ec0faa2 to your computer and use it in GitHub Desktop.
Clean numpy-only implementation of the Shimazaki-Shinimoto histogram binning method as a function.
def shimazaki_shinimoto_binning(x, min_bins, max_bins):
"""The Shimazaki-Shinimoto histogram binning algorithm for choosing an
optimal constant number of bins.
Parameters
----------
x : array of int or float
The data you want to bin
min_bins : int
The minimum number of bins to consider for optimization. Must
be at least 1.
max_bins : int
The maximum number of bins to consider for optimization.
Returns
-------
optimal_bin_size : float
The optimal bin size calculated
bin_edges : array of float
The bin edges for the data using the optimal bin size
Notes
-----
Created on Thu Oct 25 11:32:47 2012
Histogram Binwidth Optimization Method
Shimazaki and Shinomoto, Neural Comput 19 1503-1527, 2007
2006 Author Hideaki Shimazaki, Matlab
Department of Physics, Kyoto University
shimazaki at ton.scphys.kyoto-u.ac.jp
Please feel free to use/modify this program.
Version in python adapted Érbet Almeida Costa
Bugfix by Takuma Torii 2.24.2013
Adapted as a library function and commented etc. by Samuel D. Lotz
2020-1-9 from:
http://176.32.89.45/~hideaki/res/histogram.html#Python1D
archived at:
https://web.archive.org/save/http://176.32.89.45/~hideaki/res/histogram.html#Python1D
References
----------
@article{Shimazaki_2007,
doi = {10.1162/neco.2007.19.6.1503},
url = {https://doi.org/10.1162%2Fneco.2007.19.6.1503},
year = 2007,
month = {jun},
publisher = {{MIT} Press - Journals},
volume = {19},
number = {6},
pages = {1503--1527},
author = {Hideaki Shimazaki and Shigeru Shinomoto},
title = {A Method for Selecting the Bin Size of a Time Histogram},
journal = {Neural Computation}
}
"""
import numpy as np
assert min_bins > 1, "must be more than 1 bin"
x_max = np.max(x)
x_min = np.min(x)
# The minimum and maximum number of bins will determine the number
# of trials in between them. This will go into the construction of
# the cost function to determine the optimal bin size
trials_num_bins = np.arange(min_bins, max_bins)
num_trials = len(trials_num_bins)
# compute the bin size for each trial based on the min and max of
# the data
trials_bin_sizes = (x_max - x_min) / trials_num_bins
# then for each trial we will have a cost value associated with it
cost_vector = []
# Computation of the cost values for each bin size trial
for trial_idx, num_bins in enumerate(trials_num_bins):
num_edges = num_bins[trial_idx] + 1
# bin edges are linearly spaced for this range
edges = np.linspace(x_min, x_max, num_edges)
# count number of events in bins
trial_hist, _ = np.histogram(x, bins=edges)
# mean of event count
mean_count = np.mean(trial_hist)
# variance of event count
var_count = sum((trial_hist - mean_count)**2) / num_bins[trial_idx]
# compute the cost for this trial
cost = (2 * mean_count - var_count) /\
(trials_bin_sizes[trial_idx] ** 2)
cost_vector.append(cost)
# Optimal Bin Size Selection
# just find the minimal cost
min_idx = np.argmin(cost_vector)
min_cost = cost_vector[min_idx]
# then choose the number of bins associated with that
optimal_bin_size = trials_bin_sizes[min_idx]
edges = np.arange(x_min, x_max, num_bins[min_idx]+1)
return optimal_bin_size, edges
@gcpeixoto
Copy link

Hello! I stumbled over here while looking for this method... The code above seemed to have some inconsistencies as @AstroPierre pointed out. @salotz and @saravanansaminathan, find below a cleaner Python 3 reproduction from the archived version that may work for you. I hope this helps.

import numpy as np
import matplotlib.pyplot as plt

# Generate n pseudo-random numbers with(mu,sigma,n)
x = np.random.normal(0, 100, 100) 

x_min, x_max = np.min(x), np.max(x)

N_MIN = 4  # Min number of bins (integer), must be > 1
N_MAX = 50 # Max number of bins (integer)

N = np.arange(N_MIN,N_MAX) # number of bins
D = (x_max-x_min)/N        # Bin size vector
C = np.zeros(np.size(D))

# Computation of the cost function
for i in range(np.size(N)):
    edges = np.linspace(x_min,x_max,N[i]+1) # Bin edges
    ki = plt.hist(x,edges,alpha=0.5)[0]     # Count number of events in bins
    k = np.mean(ki)                         # Mean of event count
    v = np.sum((ki-k)**2)/N[i]              # Variance of event count
    C[i] = (2*k-v)/((D[i])**2)              # Cost Function

# Optimal bin size Selection
cmin = np.min(C)
idx  = np.where(C == cmin)[0][0]
optD = D[idx] 

# Plotting
edges = np.linspace(x_min,x_max,N[idx]+1)
plt.figure(figsize=(12,4))
plt.subplot(121)
plt.hist(x,edges)
plt.title("Histogram")
plt.ylabel("Frequency")
plt.xlabel("Value")
#plt.savefig('Hist.png')

plt.subplot(122)
plt.plot(D,C,'.',alpha=0.5)
plt.plot(optD,cmin,'*r');
#plt.savefig('Fobj.png')

@AstroPierre
Copy link

AstroPierre commented Mar 12, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment