Skip to content

Instantly share code, notes, and snippets.

View halfak's full-sized avatar

Aaron Halfaker halfak

View GitHub Profile
.diff{border:0;border-spacing:4px;margin:0;width:100%; table-layout:fixed}.diff td{padding:0.33em 0.5em}.diff td.diff-marker{ padding:0.25em}.diff col.diff-marker{width:2%}.diff .diff-content{width:48%}.diff td div{ word-wrap:break-word}.diff-title{vertical-align:top}.diff-notice,.diff-multi,.diff-otitle,.diff-ntitle{text-align:center}.diff-lineno{font-weight:bold}td.diff-marker{text-align:right;font-weight:bold;font-size:1.25em;line-height:1.2}.mw-diff-inline-deleted del,.mw-diff-inline-added ins,.mw-diff-inline-changed ins,.mw-diff-inline-changed del{display:inline-block;text-decoration:none}.diff-addedline,.diff-deletedline,.diff-context{font-size:88%;line-height:1.6;vertical-align:top;white-space:pre-wrap;border-style:solid;border-width:1px 1px 1px 4px;border-radius:0.33em}.mw-diff-inline-added ins,.mw-diff-inline-changed ins{background:#a3d3ff}.diff-addedline{border-color:#a3d3ff}.mw-diff-inline-deleted del,.mw-diff-inline-changed del{background:#ffe49c}.diff-deletedline{border-color:#ffe49c}.diff-conte
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from revscoring.dependencies import solve
>>> from revscoring.languages import english
>>> from revscoring.datasources import revision_oriented as ro
>>> solve(english.idioms.revision.datasources.matches, cache={ro.revision.text: "This is some text. I don't want to beat around the bush."})
['beat around the bush']
@halfak
halfak / arwiki.txt
Last active February 14, 2020 12:33
ORES Thresholds
$ python get_thresholds.py arwiki
------------------------------------------- -------- --------- --------- ------
label pop rate threshold precision recall
Culture.Biography.Biography* 0.123 0.338 0.7 0.975
Culture.Biography.Women 0.015 0.617 0.5 0.661
Culture.Food and drink 0.002 0.792 0.7 0.61
Culture.Internet culture 0.004 0.818 0.7 0.702
Culture.Linguistics 0.007 0.251 0.7 0.739
Culture.Literature 0.016 0.707 0.7 0.636
Culture.Media.Books 0.004 0.583 0.7 0.727
$ python
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mwxml import Dump
>>> import mwtypes.files
>>> d = Dump.from_file(mwtypes.files.reader("/mnt/data/xmldatadumps/public/eswiki/latest/eswiki-latest-pages-logging.xml.gz"))
>>> for l in d.log_items:
... print(l.type, l.action)
...
w2v = aggregators.mean(
revision_text_vectors,
vector=True,
name="revision.text.google_news_vector_mean"
)
# Define pronoun features
# ... preamble to defining features
female_pronouns_count = aggregators.len(female_pronouns)
@halfak
halfak / pronoun_features.py
Created January 9, 2020 16:07
Define features for: number_of_female_pronouns, number_of_male_pronouns, prop_of_female_pronouns, total_pronouns
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> # This revision references https://en.wikipedia.org/wiki/Ann_Bishop_(biologist)
>>> rev_id = 931384270
>>> from revscoring.extractors import api
>>> from revscoring.features import wikitext
>>> import mwapi
>>> extractor = api.Extractor(mwapi.Session("https://en.wikipedia.org"))
class FakeVectors(dict):
pass
test_vectors = FakeVectors({
'a': [1] * 300,
'b': [1] * 300,
'c': [1] * 300})
test_vectors.vector_size = 300
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from drafttopic.feature_lists.wordvectors import w2v
>>> from revscoring.dependencies import solve
>>> help(solve)
>>> from revscoring.languages import english
>>> english.stopwords.revision.datasources.non_stopwords
  • Culture
    • Biography
      • Biography*
      • Women
    • Food and drink
    • Internet culture
    • Linguistics
    • Literature
    • Media
  • Books
@halfak
halfak / screen.md
Created December 13, 2019 14:53
Basic Screen Tutorial

Screen lets you run a terminal remotely.

Let's say I have a long running job. I execute it by doing: $ cat big_dataset.json | my_processing_script > output.json

1. ssh to remote server

$ ssh stat1007.eqiad.wmnet
...
[stat1007]$