Skip to content

Instantly share code, notes, and snippets.

@halfak
Last active March 23, 2020 19:34
Show Gist options
  • Select an option

  • Save halfak/656d4370b4583c2bd2bbb6836c4008b2 to your computer and use it in GitHub Desktop.

Select an option

Save halfak/656d4370b4583c2bd2bbb6836c4008b2 to your computer and use it in GitHub Desktop.
Extract count of idioms for Alan Turing
import time
import mwapi
from revscoring.dependencies import solve
from revscoring.languages import english
from articlequality.feature_lists import enwiki
session = mwapi.Session("https://en.wikipedia.org")
doc = session.get(action='query', prop='revisions', rvprop='content', titles='Alan Turing', formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['content']
start = time.time()
print(english.idioms.revision.matches, solve(english.idioms.revision.matches, cache={'datasource.revision.text': text}))
print("Extracting idioms took {0} seconds".format(time.time() - start))
start = time.time()
print("Features", list(solve(enwiki.wp10, cache={'datasource.revision.text': text})))
print("Extracting features took {0} seconds".format(time.time() - start))
features_wo_idioms = [f for f in enwiki.wp10 if not "idiom" in str(f)]
idiom_features = [f for f in enwiki.wp10 if "idiom" in str(f)]
start = time.time()
print("Features w/o idioms", list(solve(features_wo_idioms, cache={'datasource.revision.text': text})))
print("Extracting features w/o idioms took {0} seconds".format(time.time() - start))
start = time.time()
print("Idiom features", list(solve(idiom_features, cache={'datasource.revision.text': text})))
print("Extracting idiom features took {0} seconds".format(time.time() - start))
$ python demo_idioms_performance.py
feature.len(<datasource.english.idioms.revision.matches>) 7.0
Extracting idioms took 0.7575492858886719 seconds
Features [130094.0, 50640.0, 209.0, 0.004127172195892575, 463.0, 0.009142969984202212, 235.0, 0.004640600315955766, 7.0, 0.0001382306477093207, 20.0, 0.00039494470774091627, 12.0, 0.00023696682464454977, 54.0, 0.0010663507109004739, 209.0, 0.004127172195892575, 206.0, 0.004067930489731438, 0.9856459330143541, 3.0, 5.924170616113744e-05, 84.0, 0.0016587677725118483, 1.0, 2.0, 1.9747235387045812e-05, 2.0, 1.9747235387045812e-05, 3.0, 5.924170616113744e-05, 1.317654028436019, 7.045776576879511, 124.0, 0.00849780701754386, 7.0, 0.00047971491228070173, 131.0, 0.008977521929824562]
Extracting features took 1.490027904510498 seconds
Features w/o idioms [130094.0, 50640.0, 209.0, 0.004127172195892575, 463.0, 0.009142969984202212, 235.0, 0.004640600315955766, 7.0, 0.0001382306477093207, 20.0, 0.00039494470774091627, 12.0, 0.00023696682464454977, 54.0, 0.0010663507109004739, 209.0, 0.004127172195892575, 206.0, 0.004067930489731438, 0.9856459330143541, 3.0, 5.924170616113744e-05, 84.0, 0.0016587677725118483, 1.0, 2.0, 1.9747235387045812e-05, 2.0, 1.9747235387045812e-05, 3.0, 5.924170616113744e-05, 1.317654028436019, 7.045776576879511, 124.0, 0.00849780701754386]
Extracting features w/o idioms took 0.6989080905914307 seconds
Idiom features [7.0, 0.00047971491228070173, 131.0, 0.008977521929824562]
Extracting idiom features took 0.9881906509399414 seconds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment