Skip to content

Instantly share code, notes, and snippets.

@dginev
Last active February 28, 2020 18:00
Show Gist options
  • Save dginev/9bbb2b699054a9d3f124af020d0f7c00 to your computer and use it in GitHub Desktop.
Save dginev/9bbb2b699054a9d3f124af020d0f7c00 to your computer and use it in GitHub Desktop.
Top textual 4-grams within 15 words of an inline citation from arXiv (arXMLiv 08.2019)
4-gram frequency
see e g [cite] 340651
can be found in 197421
be found in [cite] 130873
see for example [cite] 93473
in the case of 86786
in the context of 80782
is given by [cite] 73337
shown in fig ref 65890
with respect to the 63965
we refer to [cite] 60033
the results of [cite] 59107
on the other hand 58902
as a function of 58393
in terms of the 56654
refer the reader to 56415
see for instance [cite] 52874
in the sense of 51688
we refer the reader 51363
NUM of ref [cite] 51155
the proof of [cite] 47736
as shown in [cite] 46642
was shown in [cite] 46442
in the literature [cite] 44990
the proof of theorem 44929
as well as the 44485
it was shown in 44075
in the proof of 43169
is consistent with the 41887
found in ref [cite] 40783
in agreement with the 37872
e g ref [cite] 37504
the reader to [cite] 37372
in the presence of 37266
in this paper we 36716
given in ref [cite] 36013
is based on the 35873
discussed in ref [cite] 34797
as described in [cite] 33364
it has been shown 33312
is similar to the 33170
if and only if 32776
as discussed in [cite] 31685
see e g ref 31349
can be used to 31318
[cite] and references therein 30404
proof of theorem ref 29978
it is well known 29902
as in ref [cite] 29860
we have the following 29705
shown in figure ref 29666
state of the art 29328
shown in ref [cite] 29272
is shown in [cite] 28927
is given in [cite] 28501
the proof of the 28324
in this section we 28083
the case of the 27814
is given by the 27663
in the next section 27370
NUM in ref [cite] 27344
can be written as 27273
the authors of [cite] 27155
it is shown in 27081
described in ref [cite] 26624
be found in ref 26505
in good agreement with 26494
the sense of [cite] 26262
is the same as 26255
in detail in [cite] 24601
[cite] for more details 24232
theorem NUM in [cite] 23784
for the case of 23774
the results in [cite] 23665
in the framework of 23566
is proved in [cite] 23154
proof of theorem NUM 23056
pointed out in [cite] 23037
taken from ref [cite] 22999
was proved in [cite] 22550
of the order of 22463
see [cite] for a 21893
for example in [cite] 21788
the work of [cite] 21787
presented in ref [cite] 21531
to the case of 21498
is in agreement with 21416
reported in ref [cite] 21356
with the results of 21269
a function of the 21136
it follows from [cite] 21055
be found in the 21021
is related to the 21016
e g refs [cite] 20833
the size of the 20822
results of ref [cite] 20515
section NUM of [cite] 20356
similar to the one 20279
good agreement with the 20187
is shown in fig 20068
be written as [cite] 19963
e g in [cite] 19928
as shown in fig 19666
eqs ref and ref 19407
in section ref we 19402
has been studied in 19285
proposed in ref [cite] 19271
a special case of 19248
similar to that of 19158
obtained in ref [cite] 18879
used in ref [cite] 18775
our previous work [cite] 18686
the proof of lemma 18605
refer to [cite] for 18600
is equivalent to the 18353
the presence of a 18352
has been shown in 18077
is one of the 18000
in contrast to the 17965
a factor of NUM 17956
is well known [cite] 17951
by a factor of 17920
as pointed out in 17916
in the form of 17909
the result of [cite] 17907
the same as in 17848
between NUM and NUM 17567
the main result of 17530
is similar to that 17461
theorem NUM of [cite] 17366
in the absence of 17338
the context of the 17030
see e g refs 16942
as explained in [cite] 16562
see [cite] for details 16443
at the end of 16336
reader is referred to 16330
which is consistent with 16290
see also ref [cite] 16272
the same as the 16243
are given in [cite] 16243
it is possible to 16168
are given by [cite] 16142
in the same way 16119
in the spirit of 15941
by means of the 15939
fig NUM of [cite] 15849
a generalization of the 15814
the results of the 15812
the case of a 15670
the existence of a 15637
in the study of 15628
ref ref and ref 15603
with the help of 15549
on the basis of 15531
is known to be 15499
a wide range of 15478
studied in ref [cite] 15331
the order of NUM 15288
we will use the 15224
from NUM to NUM 15187
the properties of the 15155
listed in table ref 15071
are shown in fig 15052
this completes the proof 15006
was introduced in [cite] 14978
in this case the 14939
of the proof of 14908
a consequence of the 14864
been studied in [cite] 14796
described in detail in 14715
in the paper [cite] 14626
beyond the scope of 14543
an important role in 14529
as in the proof 14523
fig NUM of ref 14347
the authors in [cite] 14304
to that of the 14254
the same way as 14242
see [cite] for more 14212
the definition of the 14190
as well as in 14188
the results of ref 14168
the details of the 14135
for more details see 14110
the reader is referred 14054
may be found in 14034
given in table ref 14027
is a consequence of 13990
are consistent with the 13990
is expected to be 13963
as defined in [cite] 13948
one of the most 13928
the rest of the 13919
it can be shown 13914
the framework of the 13896
shown in fig NUM 13895
is well known see 13854
this is consistent with 13755
@dginev
Copy link
Author

dginev commented Feb 28, 2020

  • Extracted from a token model containing 23,297,235 inline citations.
  • The top entry "see e.g. [cite]" is thus seen in only 1.4% of inline citation cases
  • The overall counts of these 200 4-grams adds to 5.6 million, or 24% of the total data (in fact less, since overlaps are possible).
  • Generated with llamapun's citation_ngrams example, written for this purpose.

@VladimirAlexiev
Copy link

  • What's NUM eg in "from NUM to NUM"?
  • very few of those indicate any polarity eg
    • are consistent with the
    • is a consequence of
    • in the spirit of
    • in the same way

@dginev
Copy link
Author

dginev commented Feb 28, 2020

  • All numeric literals get substituted with NUM in my tokenization, similarly for ref being a substitute for LaTeX \ref numbers.
  • I agree that most do not indicate polarity,
    • but a range of them indicate "kinds of certainty", even when phrased neutrally.
      • is well known - old & tried result, well-accepted
      • was proved in, it is shown in - certain by virtue of carrying its own proof (common to math)
      • our previous work - explicitly claiming credit & disclosing personal bias

I stand by my conclusion that ngrams are just the wrong tool to study inline citations, but there's a lot that one could imagine done on the sentence level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment