Skip to content

Instantly share code, notes, and snippets.

@tayfie
Created March 14, 2017 04:36
Show Gist options
  • Save tayfie/6dad43f1a452440fba7ea1c06d1b603a to your computer and use it in GitHub Desktop.
Save tayfie/6dad43f1a452440fba7ea1c06d1b603a to your computer and use it in GitHub Desktop.
how to scrape images from 4chan using wget

How To Scrape Images from 4chan Using Wget

This guide is to save other sorry plebs from needing to RTFM in figuring out how to use wget to scrape images from 4chan and other imageboards. There are lots of image downloaders in existence, but they are usually outdated and broken. You will save time following this guide to learn how to use a powerful and general purpose tool instead.

What Is Wget?

Wget is a command-line file downloader that can handle just about any file downloading task normal and power users will ever need to do. It has versions available for Windows, Mac, and Linux. If it is not already installed on your machine, install it now.

Basic syntax

wget [options] [urls]

Useful Options for Image Scraping

There are tons more, but these are the most useful ones for this guide.

  • -r downloads files recursively, downloading links that are contained in already downloaded documents. This is essential because a common case is one url that contains all image file links.
  • -l [n] controls the maximum recursion level. n will practically always be one for image scraping.
  • -H allows downloads from different hosts than the original url. This is useful because many sites show images hosted at different domains.
  • -D [domains] tells what additional hosts to download from. You will probably have to 'View Source' in your browser to know for sure what to put here. domains is a comma separated list of domain names.
  • -P [prefix directory] tells where to save the downloaded files. The default is the current directory.
  • -nd avoids creating additional hierarchy.
  • -A [extensions] tells what file extensions to save.

Putting It Together

To download images from 4chan:

wget -P pictures -nd -r -l 1 -H -D i.4cdn.org -A png,gif,jpg,jpeg,webm [thread-url]

from 8chan:

wget -P pictures -nd -r -l 1 -H -D media.8ch.net -A png,gif,jpg,jpeg,webm [thread-url]

@himanshuxd
Copy link

@ryankrage77 thanks was getting low quality wallpapers that fixed it !

@vitezfh
Copy link

vitezfh commented Apr 15, 2020

ryankrage77's answer doesn't seem to work for me. But keeping the i.4cdn.org and adding -R '?????????????s.*' to match and refuse numbered thumbnail images ( e.g. "1586456902053s.jpg" ) works perfectly for me on at least the /wg/ board:

wget -P save_folder -nd -r -l 1 -H -D i.4cdn.org -A png,gif,jpg,jpeg,webm -R '?????????????s.*' 4chan_url

@ryankrage77
Copy link

@vitezfh, agreed, my earlier solution no longer seems to work whereas i.4cdn.org now does. You can use -R '*s.*' as well. Works for me on /w/ at least.

@kk-Chiron
Copy link

kk-Chiron commented Jun 9, 2020

Is it possible to rename the save folder as the thread title/topic with wget? (on win10 so without grep or something)

@ryankrage77
Copy link

Is it possible to rename the save folder as the thread title/topic with wget? (on win10 so without grab or something)

You could write a script that takes the thread link as input, and get the title/topic from the page itself, then scrapes the images. I'm not sure how you'd get the topic exactly, it's the content meta tag, so you should be able to get it with grep and some regex.

wget alone can't do this. You could set the url/post number as the folder name (which is what I do), but wget can't pull info out of the stuff it downloads.

@DannyParker0001
Copy link

Some were on i.4cdn.org, some where on is2.4chan.org
wget -P save_folder -nd -r -l 1 -H -D i.4cdn.org,is2.4chan.org -A png,gif,jpg,jpeg,webm -R '?????????????s.*' 4chan_url
Is what worked for me

@eallder
Copy link

eallder commented Mar 8, 2021

This is what is working for me to not download the duplicate thumbnail files (the ones that end in "s"):

wget -P pictures -nd -r -l 1 -H -D i.4cdn.org -A png,gif,jpg,jpeg,webm -R *?????????????s* [thread-url]

@Ruberald
Copy link

Just an update, this is what seems to work for me now
wget -P pictures -nd -r -l 1 -H -D is2.4chan.org -A png,gif,jpg,jpeg,webm <url>

@eallder
Copy link

eallder commented Jun 21, 2023

Can confirm that the i.4cdn.org address no longer works. Just tried with the following command and it works fine:

wget -P pictures -nd -r -l 1 -H -D is2.4chan.org -A png,gif,jpg,jpeg,webm -R *?????????????s* [thread-url]

@vitezfh
Copy link

vitezfh commented Jun 26, 2023

The i.4cdn.org address is working fine, @eallder
It just depends on what board you're downloading from. The /x/ board, for example, serves images over that address still.
So the answer from DannyParker0001 is probably best:

Some were on i.4cdn.org, some where on is2.4chan.org wget -P save_folder -nd -r -l 1 -H -D i.4cdn.org,is2.4chan.org -A png,gif,jpg,jpeg,webm -R '?????????????s.*' 4chan_url Is what worked for me

@vitezfh
Copy link

vitezfh commented Jun 26, 2023

Here is a more maintainable script for it:

#!/bin/bash

CHAN_DOMAINS=${CHAN_DOMAINS-i.4cdn.org,is2.4chan.org}
directory=${directory-4chan_media}

thread_url="$@"

wget --directory-prefix="$directory" \
	--no-directories \
	--recursive \
	--level 1 \
	--span-hosts \
	-D "$CHAN_DOMAINS" \
	--accept png,gif,jpg,jpeg,webm \
	--reject '?????????????s*' \
	"$thread_url"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment