A simple web crawler that recursively crawls all links on a specified domain and outputs them hierarchically along with the header tags (h1, h2, h3, h4, h5, h6) in each page. The crawler only follows links that are HTTP or HTTPS, within the same domain, and have not been crawled before.
To run the script, use the following command:
python crawler.py <domain>
Replace <domain>
with the domain you want to crawl. For example:
python crawler.py example.com
To build the Docker image, run the following command:
docker build -t web-crawler .
To run the Docker container, use the following command:
docker run web-crawler <domain>
Replace <domain>
with the domain you want to crawl. For example:
docker run web-crawler example.com
This script requires the following dependencies to be installed:
- requests
- beautifulsoup4
To install the dependencies, run the following command:
pip install -r requirements.txt