Skip to content

Instantly share code, notes, and snippets.

@Mahdisadjadi
Created June 9, 2017 06:37
Show Gist options
  • Select an option

  • Save Mahdisadjadi/96970b043fcd94b8e5109fe8a7dbbd5e to your computer and use it in GitHub Desktop.

Select an option

Save Mahdisadjadi/96970b043fcd94b8e5109fe8a7dbbd5e to your computer and use it in GitHub Desktop.
Make a markdown table from categories and subcategories of Arxiv.org
from bs4 import BeautifulSoup
import requests
# url to scrape
cat_url = 'https://arxiv.org/'
subcat_url = 'https://arxiv.org/archive/'
def return_soup(url):
url = requests.get(url).content
soup = BeautifulSoup(url,"html.parser")
return soup
def get_namespace(x):
name = x.find('a').text
tag = x.find('b').text
return name, tag
def convert_to_markdown(*args):
print (('|'+a for a in args))
main_page = return_soup(cat_url).find_all('li')
print ('| Category | Code | Subcategories | Subcode |\n| --- | --- | --- | --- |')
for x in main_page[0:]:
try:
xname,xtag = get_namespace(x)
string = '| ' + xname + ' | `' + xtag + '` | '
print (string)
subcat_page = return_soup(subcat_url + xtag).find_all('ul')[-1].find_all('b')
for y in subcat_page:
subs = y.text.split(' - ')
print ('| | | '+ subs[1] + ' | `' + subs[0] + '`')
except:
pass
Copy link

ghost commented Sep 15, 2020

not working anymore ^^ did you ever update it ?

@Mahdisadjadi
Copy link
Author

Thank you for raising this issue. The home page design is changed and I don't think scrapping the homepage is sustainable long-term. I might eventually change this to get the data from https://arxiv.org/category_taxonomy page.

@Mahdisadjadi
Copy link
Author

For updated list on March 2025, see this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment