Skip to content

Instantly share code, notes, and snippets.

@grenkoca
Created March 22, 2023 14:00
Show Gist options
  • Select an option

  • Save grenkoca/8a306bb81e1f059fcedc85ddbcc48b71 to your computer and use it in GitHub Desktop.

Select an option

Save grenkoca/8a306bb81e1f059fcedc85ddbcc48b71 to your computer and use it in GitHub Desktop.
Estimate the optimal K for partitioning a dataset in python, as described in SC3 (Kiselev et al., 2017)
# See methods section of SC3 for details (Kiselev et al., 2017)
import numpy as np
import pandas as pd
from scipy.linalg import eigvalsh
import sys
def estkTW(dataset):
p = dataset.shape[1]
n = dataset.shape[0]
# compute Tracy-Widom bound
x = (dataset - dataset.mean(axis=0)) / dataset.std(axis=0, ddof=1)
muTW = (np.sqrt(n - 1) + np.sqrt(p)) ** 2
sigmaTW = (np.sqrt(n - 1) + np.sqrt(p)) * ((1 / np.sqrt(n - 1)) + (1 / np.sqrt(p))) ** (1 / 3)
sigmaHatNaive = x.T @ x
bd = 3.273 * sigmaTW + muTW # 3.2730 is the p=0.001 percentile point for the Tracy-Widom distribution
# compute eigenvalues and return the amount which falls above the bound
evals = eigvalsh(sigmaHatNaive)
k = np.sum(evals > bd)
return k
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment