Skip to content

Instantly share code, notes, and snippets.

View halfak's full-sized avatar

Aaron Halfaker halfak

View GitHub Profile
$ pip install mwcites
<SNIP>
Cleaning up...
$ python
Python 3.4.1 (default, May 26 2014, 01:12:52)
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mwcites.extractors import doi
>>> list(doi.extract("Foobar 10.1000/282lasnd<foo>[bar].24 hats pants 10.0023/banana"))
> select LEFT(rev_timestamp, 8), COUNT(*) FROM revision WHERE rev_timestamp > "20150201" GROUP BY 1;
+------------------------+----------+
| LEFT(rev_timestamp, 8) | COUNT(*) |
+------------------------+----------+
| 20150201 | 161300 |
| 20150202 | 144059 |
| 20150203 | 143067 |
| 20150204 | 146833 |
| 20150205 | 139978 |
| 20150206 | 140813 |
(3.4)[halfak@ores-test: ~/projects/ores]
$ python
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from revscoring.scorers import MLScorerModel
>>> MLScorerModel.load(open("models/enwiki.reverted.linear_svc.model", 'rb'))
<revscoring.scorers.svc.LinearSVCModel object at 0x7fea7257c4a8>
>>> MLScorerModel.load(open("models/ptwiki.reverted.linear_svc.model", 'rb'))
<revscoring.scorers.svc.LinearSVCModel object at 0x7fea5f942f98>
select rev_id, COUNT(*) FROM mwcites_enwiki_20150112 GROUP BY 1 ORDER BY COUNT(*) DESC LIMIT 10;
+-----------+----------+
| rev_id | COUNT(*) |
+-----------+----------+
| 208356562 | 375 |
| 597498735 | 352 |
| 557642221 | 242 |
| 209522143 | 231 |
| 303096827 | 230 |
| 303100944 | 225 |
[halfak@graphite: ~/projects/wikimetrics]
$ grep -r httplib2 .
./requirements.txt:httplib2==0.9
- AI is important to quality control
- Story of huggle/cluebot in enwiki
- AI is hard
- Discussion of skills needed
- AI as service --> Ecosystem
- Discuss all the tools that do / might use quality scores
import cProfile as profile
import pprint
import re
import time
from hashlib import sha1
from mw import api
from more_itertools import peekable
jQuery111208480086917140102_1428331100186({
"task": {
"campaign_id": 1,
"data": {
"rev_id": 101
},
"id": 1,
"labels": [
{
"data": {
$ python
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from revscoring.features import parent_revision
>>> parent_revision.markup_chars.returns
<class 'int'>
>>> from revscoring.dependent import draw
>>> draw(parent_revision.markup_chars)
- <parent_revision.markup_chars>
$ python
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from revscoring.datasources import diff
>>> from revscoring.extractors import APIExtractor
>>> from mw.api import Session
>>> extractor = APIExtractor(Session("https://en.wikipedia.org/w/api.php"))
Sending requests with default User-Agent. Set 'user_agent' on api.Session to quiet this message.
>>> list(extractor.extract(4567890, [diff.added_words]))[0]