Skip to content

Instantly share code, notes, and snippets.

\newcount\index
\newcount\sum
\def\esum#1{
\index=#1
\sum=0
\loop
\advance\sum by \index
\ifnum\index>2
\advance\index by -2
@dginev
dginev / arxiv_headings_report.csv
Last active August 1, 2019 22:00
Most common headings from 1.2 million of arXiv documents (upto 08.2018)
heading frequency
proof 2464628
lemma 1380622
theorem 1254064
references 1213025
abstract 1057178
introduction 955218
proposition 876742
remark 694222
definition 686827
@dginev
dginev / subject_metadata.md
Created May 1, 2019 18:48
arXMLiv 08.2018 dataset, subject classification frequencies
Subject Document count
math 334932
astro-ph 223437
cond-mat 212384
cs 132338
hep-ph 130788
hep-th 116499
physics 99881
quant-ph 80888
@dginev
dginev / arxiv_metadata_packer.rs
Last active April 24, 2019 16:33
Extracting arXiv category metadata from OAI_PMHv2.0 xml harvest
//! Convert arXiv's OAI harvested XML files into a lookup table for classification labels
// Step 0. Prerequisite: download all needed arXiv metadata via OAI, e.g.
//```
// $ pip install git+http://github.com/bloomonkey/oai-harvest.git#egg=oaiharvest
// $ mkdir metadata/arxiv; cd metadata/arxiv
// $ oai-reg add arxiv http://export.arxiv.org/oai2?verb=Identify
// $ oai-harvest arxiv --until 2018-09-09
//```
// endpoint documentation at: https://arxiv.org/help/oa
use jwalk::WalkDir;
@dginev
dginev / corpus_statistics_ref.csv
Created March 30, 2019 13:48
"Words prior \ref", arXMLiv 08.2018
word frequency
figure 3290488
theorem 3052607
section 2802295
lemma 2408488
table 1544961
proposition 1334759
and 1031640
corollary 476062
appendix 416964
@dginev
dginev / apply_cutoffs.pl
Last active March 24, 2019 03:34
arXMLiv 08.2018, MathML element report
#!/usr/bin/env perl
# Applies cutoffs to the very noisy 250 MB mathml_statistics.txt
# which was generated by llamapun over arXMLiv 08.2018.
#
# It rewrites to a CSV file, throwing out all known erroneous markup, including:
# - discard all SVG-associated markup (wrongly in MathML)
# - discard all (non-math) HTML-associated markup (wrongly in MathML)
# - discard all XMath-associated markup (wrongly in MathML)
# - less noisy for uninteresting values (numbers with known units, hex colors, open-ended id schemes, etc)
#
@dginev
dginev / dlmf_mathml_report.csv
Created March 23, 2019 15:59
DLMF v0.1.20 MathML element report
name@attr[value] frequency
mo 390704
mi 317263
mrow 265247
mi@href 230061
math@display 108952
math@class 108952
math 108952
math@alttext 108952
math@class[ltx_Math] 108944
@dginev
dginev / rustc.log
Created January 30, 2019 04:07
rtx_package$ time cargo rustc -- -Z time-passes
time: 0.026; rss: 58MB parsing
time: 0.000; rss: 58MB attributes injection
time: 0.000; rss: 58MB garbage collect incremental cache directory
time: 0.000; rss: 58MB recursion limit
time: 0.000; rss: 58MB crate injection
time: 0.000; rss: 58MB plugin loading
time: 0.000; rss: 58MB plugin registration
time: 0.000; rss: 58MB background load prev dep-graph
time: 0.003; rss: 58MB pre ast expansion lint checks
time: 1.662; rss: 237MB expand crate
@dginev
dginev / custom_derive_lib.rs
Last active January 24, 2019 01:08
Contextual variable capture in Rust, via Custom Derive
static mut CONTEXT_DEPTH: u32 = 0;
#[proc_macro_derive(BoundState)]
pub fn bound_state(_input: TokenStream) -> TokenStream {
let state_declaration = if unsafe {CONTEXT_DEPTH == 0} {
quote!(
macro_rules! state {
() => {
outer_state!()
};
@dginev
dginev / annual_dependency_status.csv
Last active September 19, 2018 00:22
arXiv 08.2018, LaTeX dependencies report
We can't make this file beautiful and searchable because it's too large.
00,-4,amsbsy.sty.ltxml,1
00,-4,amsfonts.sty.ltxml,4
00,-4,amsmath.sty.ltxml,1
00,-4,amsopn.sty.ltxml,1
00,-4,amssymb.sty.ltxml,3
00,-4,amstext.sty.ltxml,1
00,-4,amsthm.sty.ltxml,1
00,-4,array.sty.ltxml,5
00,-4,article.cls.ltxml,13
00,-4,color.sty.ltxml,2