Created
September 11, 2019 17:53
-
-
Save ottomata/d245072fb77f525799c71599a2d27866 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
== What onboarding projects are options? What about mixing some research questions with the task of processing data? (for example, find patterns of those who open an account on Wikipedia) @Joseph | |
* wikidump text analysis? | |
** category analysis? | |
Take Tiziano's code and use hadoop instead of wikidump text. | |
(1st, 2nd) ** historical redirect analysis, add to mediawiki_history (very useful for Analytics) | |
Please see: https://phabricator.wikimedia.org/T232123 | |
Intro to this in the (1st month) and can start when he wants as part of the 1st or 2nd one. | |
We provide info about a page is redirect or not. We don't have historical info about it. | |
Dumps does know this info. | |
Parse the dumps, extract info to build historical revisions of redirect | |
The trick is that mediawiki language is multilingual. depending on the language, mediawiki has different codes. | |
This needs to be done cross-project. | |
Martin to collaborate with Diego on parameter alignment across languages. | |
Skills: learning about mediawiki a bit, Spark, writing to Scala is even better (PySpark is fine, too) | |
** auto generation of diffs from dumps (eventually do aaron's wikicredit https://meta.wikimedia.org/wiki/Research:WikiCredit on this in hadoop) | |
* mediawiki_history analysis | |
** (1st) new user creation patterns | |
** generate ML embedding of user patterns, cluster over embeddings. Generate clusters of type of actions. One user can have multiple roles. | |
** in hive | |
** in (py)spark | |
* linking wikidata items to pages/revision | |
** edits side and pageview side | |
* analysis of ORES events. | |
** otto: should really fix ORES event format first |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment