rhine3/xeno-canto.md

Last active November 5, 2023 11:09

Star (9) You must be signed in to star a gist
Fork (1) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/rhine3/4829bf66381c7aa05c1f656cec4fa040.js"></script>
Save rhine3/4829bf66381c7aa05c1f656cec4fa040 to your computer and use it in GitHub Desktop.

Download ZIP

Downloading files from Xeno-Canto

Raw

xeno-canto.md

This script is no longer supported.

Over the years since I posted this script, it has become more and more common to scrape audio files off of Xeno-Canto.org. This has resulted in an overwhelming amount of traffic to their servers.

Please do not scrape Xeno-Canto without contacting the organizers first to ask for permission and for more information. They will be able to advise you on the best time of day to download data from their servers, or any alternative download options that are available.

Richterskala101 commented Sep 20, 2023

Hi,
Great script and Idea!
I had two issues. in:

Make wget input file

url_list = []
for file in record_df['file'].tolist():
url_list.append('https:{}'.format(file))
with open('xc-noca-urls.txt', 'w+') as f:
for item in url_list:
f.write("{}\n".format(item))

I needed to exchange 'file' in "in record_df['file'].tolist():" tu 'URL'. Otherwise, it would have appended an additional unwanted "https".

Second, The Download seems to work for me, but the files are kind of corrupted. They have no filename extension. And whenn adding an .wav or .mp3, the files cannot be opened with raven.

Would be happy, if you could point me in the right direction.

Thanks, Dominik

Richterskala101 commented Sep 21, 2023

Heya, sorry for spamming, but thought I'd share how it worked out for me.

The previous comment was misleading...
When I'am just deleting the 'https: ' in the append() function like so:
url_list = []
for file in record_df['file'].tolist():
url_list.append('{}'.format(file))
with open('xc-noca-urls.txt', 'w+') as f:
for item in url_list:
f.write("{}\n".format(item))

the URL list text file works out great. That being said, I am completely unexperienced with python, so there are definitely more elegant ways.

Another thing which I stumbled upon, was that the downloaded recordings were renamed with arbitrary names.
exchanging "--trust-server-names" with "--content-disposition" preserved the original XC filenames.

maybe that's helpful for someone...

Author

rhine3 commented Oct 9, 2023

Hi Dominik! To be honest, I should probably make this script private. It's from a while ago. Since I created it, it has become much more common to scrape Xeno-Canto and it is overwhelming their servers. So, the Xeno-Canto folks ask that you refrain from scraping it to the extent possible.