Skip to content

Instantly share code, notes, and snippets.

@vankesteren
Created July 13, 2023 18:38
Show Gist options
  • Save vankesteren/3743f5b5c5d6c395d50f69c92c6406f1 to your computer and use it in GitHub Desktop.
Save vankesteren/3743f5b5c5d6c395d50f69c92c6406f1 to your computer and use it in GitHub Desktop.
Normalized compression distance in python. Inspired by https://aclanthology.org/2023.findings-acl.426.pdf
import gzip
x1 = "Japan’s Seiko Epson Corp. has developed a 12-gram flying microrobot."
x2 = "The latest tiny flying robot has been unveiled in Japan."
x3 = "Michael Phelps won the gold medal in the 400 individual medley."
def ncd(x: str, y: str) -> float:
'''
Returns the normalized compression distance between two strings.
Parameters:
x (str): A text string
y (str): Another text string
Returns:
ncp (float): normalized compression distance (number between 0 and 1)
'''
cx = len(gzip.compress(x.encode()))
cy = len(gzip.compress(y.encode()))
xy = " ".join([x, y])
cxy = len(gzip.compress(xy.encode()))
return (cxy - min(cx, cy)) / max(cx, cy)
# x3 is further from x1 than x2
ncd(x1, x2)
ncd(x1, x3)
# ncd is symmetric
ncd(x1, x2)
ncd(x2, x1)
# ncd does not go to exactly 0
ncd(x1, x1)
ncd(x3, x3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment