Last active
November 10, 2022 22:07
-
-
Save athalhammer/ecdcaafb0614bbc5dc05acc7f660252c to your computer and use it in GitHub Desktop.
Spearman correlation: Wikidata QRank and Wikidata PageRank (danker)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
export LC_ALL=C | |
if [ ! -f "qrank_sorted.tsv" ]; then | |
wget -O - https://qrank.toolforge.org/download/qrank.csv.gz | \ | |
gunzip -c | \ | |
tail -n+2 | \ | |
sed "s/,/\t/" | \ | |
sort -k1,1 \ | |
> qrank_sorted.tsv | |
fi | |
if [ ! -f "pr_202111_sorted.tsv" ]; then | |
wget -O - https://danker.s3.amazonaws.com/2021-11-15.allwiki.links.rank.bz2 | \ | |
bunzip2 -c | \ | |
sort -k1,1 \ | |
> pr_202111_sorted.tsv | |
fi | |
join qrank_sorted.tsv pr_202111_sorted.tsv > qrank_pr_joined.tsv | |
wc -l qrank_sorted.tsv pr_202111_sorted.tsv qrank_pr_joined.tsv | |
Rscript <(printf "qpr <- read.table(file = 'qrank_pr_joined.tsv', sep = ' ')\ncor(qpr[2],qpr[3], method='spearman')") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
https://qrank.toolforge.org/download/qrank.csv.gz is redirected to https://qrank.wmcloud.org/download/qrank.csv.gz which is broken since at least one week (error 502).
Do you know any other URL providing the same data?