Skip to content

Instantly share code, notes, and snippets.

View al1190's full-sized avatar

Annalaura Ianiro al1190

  • Frankfurt am Main
View GitHub Profile
@schaunwheeler
schaunwheeler / pyspark_minhash_jaccard.py
Last active September 18, 2024 11:49
Use MinHash to get Jaccard Similarity in Pyspark
from numpy.random import RandomState
import pyspark.sql.functions as f
from pyspark import StorageLevel
def hashmin_jaccard_spark(
sdf, node_col, edge_basis_col, suffixes=('A', 'B'),
n_draws=100, storage_level=None, seed=42, verbose=False):
"""
Calculate a sparse Jaccard similarity matrix using MinHash.