Skip to content

Instantly share code, notes, and snippets.

@minhlab
Last active June 30, 2017 12:53
Show Gist options
  • Select an option

  • Save minhlab/613e74b25fd13e9ddb50fbb2aacd5bf4 to your computer and use it in GitHub Desktop.

Select an option

Save minhlab/613e74b25fd13e9ddb50fbb2aacd5bf4 to your computer and use it in GitHub Desktop.
from collections import Counter
import numpy as np
from scipy.stats import pearsonr
train = 'output/dep/penntree.jk.sd/train.mrg.dep'
ref = 'output/dep/penntree.jk.sd/valid.mrg.dep'
sys = 'output/dep/sd_parse-published-model_valid.conll'
out_path = 'output/occurrence-performance.npy'
count_path = 'output/occurrence-counts.npy'
punctuations = { "``", "''", ".", ",", ":" }
def count_occurrences(path):
print('Counting occurrences...')
c = Counter()
with open(path) as f:
for line_no, line in enumerate(f):
fields = line.strip().split('\t')
if len(fields) > 5:
c[fields[1]] += 1
if (line_no+1) % 100000 == 0:
print('%s...' %(line_no+1))
print('Counting occurrences... Done.')
return c
def iter_sents(path):
with open(path) as f:
sent = []
for line_no, line in enumerate(f):
if line.strip():
fields = line.split('\t')
sent.append([fields[1]] + fields[6:8] + [line_no])
else:
if sent:
yield sent
sent = []
if sent: yield sent
if __name__ == '__main__':
occ_counts = count_occurrences(train)
occ_counts_arr = np.array([occ_counts[k] for k in occ_counts])
with open(count_path, 'wb') as f: np.save(f, occ_counts_arr)
print('Occurence counts written to %s' %count_path)
data = []
for sent1, sent2 in zip(iter_sents(ref), iter_sents(sys)):
tok_count = 0
rare_count = 0.0
occ_count = 0.0
uas = 0.0
las = 0.0
assert len(sent1) == len(sent2)
for row1, row2 in zip(sent1, sent2):
assert row1[0] == row2[0] and row1[3] == row2[3]
if row1[0] not in punctuations:
tok_count += 1
occ_count += occ_counts[row1[0]]
rare_count += (5 < occ_counts[row1[0]] <= 20)
uas += (row1[1] == row2[1])
las += (row1[1:3] == row2[1:3])
uas /= tok_count
las /= tok_count
data.append([rare_count, occ_count, uas, las, tok_count, row1[3]])
data = np.array(data)
rare_counts, occ_counts, uas, las, lens = data[:,0], data[:,1], data[:,2], data[:,3], data[:,4]
print('Correlation between rare (but not UNKN) counts and UAS: %f' %pearsonr(rare_counts, uas)[0])
print('Correlation between rare (but not UNKN) counts and LAS: %f' %pearsonr(rare_counts, las)[0])
print('Correlation between occurrence counts and UAS: %f' %pearsonr(occ_counts, uas)[0])
print('Correlation between occurrence counts and LAS: %f' %pearsonr(occ_counts, las)[0])
print('Correlation between lengths and UAS: %f' %pearsonr(lens, uas)[0])
print('Correlation between lengths and LAS: %f' %pearsonr(lens, las)[0])
mask0 = (uas < 0.9)
print('After filtering out "easy" sentences:')
print('Correlation between occurrence counts and UAS: %f'
%pearsonr(occ_counts[mask0], uas[mask0])[0])
for min_len in range(0, 40, 10):
mask = np.logical_and(mask0, lens >= min_len, lens < min_len+10)
print('Correlation between occurrence counts and UAS (%d <= len < %d): %f'
%(min_len, min_len+10, pearsonr(occ_counts[mask], uas[mask])[0]))
mask = np.logical_and(mask0, lens >= 40, )
print('Correlation between occurrence counts and UAS (len >= 40): %f'
%(pearsonr(occ_counts[mask], uas[mask])[0]))
print('Sample data:')
print(data[:10])
with open(out_path, 'wb') as f: np.save(f, data)
print('Data written to %s' %out_path)
@minhlab
Copy link
Copy Markdown
Author

minhlab commented Jun 30, 2017

Counting occurrences...
100000...
200000...
300000...
400000...
500000...
600000...
700000...
800000...
900000...
Counting occurrences... Done.
Occurence counts written to output/occurrence-counts.npy
Correlation between rare (but not UNKN) counts and UAS: -0.106059
Correlation between rare (but not UNKN) counts and LAS: -0.087350
Correlation between occurrence counts and UAS: -0.075913
Correlation between occurrence counts and LAS: -0.051445
Correlation between lengths and UAS: -0.116341
Correlation between lengths and LAS: -0.097247
After filtering out "easy" sentences:
Correlation between occurrence counts and UAS: 0.291020
Correlation between occurrence counts and UAS (0 <= len < 10): 0.291020
Correlation between occurrence counts and UAS (10 <= len < 20): 0.232327
Correlation between occurrence counts and UAS (20 <= len < 30): 0.209121
Correlation between occurrence counts and UAS (30 <= len < 40): 0.095825
Correlation between occurrence counts and UAS (len >= 40): 0.188923
Sample data:
[[  6.00000000e+00   2.33169000e+05   9.71428571e-01   9.71428571e-01
    3.50000000e+01   3.60000000e+01]
 [  3.00000000e+00   1.35523000e+05   9.75609756e-01   9.26829268e-01
    4.10000000e+01   8.20000000e+01]
 [  5.00000000e+00   1.35641000e+05   8.88888889e-01   8.88888889e-01
    1.80000000e+01   1.03000000e+02]
 [  2.00000000e+00   8.82080000e+04   7.66666667e-01   7.66666667e-01
    3.00000000e+01   1.41000000e+02]
 [  1.00000000e+00   1.24599000e+05   9.54545455e-01   9.54545455e-01
    2.20000000e+01   1.65000000e+02]
 [  0.00000000e+00   1.87807000e+05   9.04761905e-01   8.09523810e-01
    2.10000000e+01   1.88000000e+02]
 [  0.00000000e+00   2.09830000e+04   9.33333333e-01   8.00000000e-01
    1.50000000e+01   2.06000000e+02]
 [  1.00000000e+00   1.90347000e+05   8.33333333e-01   8.33333333e-01
    2.40000000e+01   2.35000000e+02]
 [  0.00000000e+00   1.76770000e+05   9.67741935e-01   9.67741935e-01
    3.10000000e+01   2.70000000e+02]
 [  1.00000000e+00   1.37953000e+05   1.00000000e+00   9.44444444e-01
    1.80000000e+01   2.94000000e+02]]
Data written to output/occurrence-performance.npy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment