Created
October 10, 2024 08:12
Free Google's Robots.txt Parser Tool - Official
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"provenance": [], | |
"authorship_tag": "ABX9TyNS6EqThYLVmcWV5KwpAo1+", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/natzir/87474e21a1b52a12deea9f487017d2e1/free-google-s-robots-txt-parser-tool-official.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Google's Robots.txt Parser Tool\n", | |
"\n", | |
"\n", | |
"---\n", | |
"\n", | |
"**Author:** Natzir, Technical SEO / Data Scientist\n", | |
"<br>**X/Twitter:** [@natzir9](https://twitter.com/natzir9)\n", | |
"\n", | |
"---\n", | |
"\n", | |
"### If you've ever tried to test a robots.txt file before deploying it, you've probably run into a common issue: most publicly available robots parsers don't follow Google's standard. This has become especially relevant since Google removed the robots.txt tester from Search Console. With this Colab notebook, you can test a robots.txt file as if you were Google because it uses the [open-source library that Googlebot uses in production](https://github.com/google/robotstxt).\n", | |
"\n", | |
"### Before you can use this notebook, make sure to **create a copy** of it in your Google Drive (File > Save a copy in Drive)\n" | |
], | |
"metadata": { | |
"id": "6tH7Un4dQMjA" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"cellView": "form", | |
"id": "SJC4jO6cN0Ip" | |
}, | |
"outputs": [], | |
"source": [ | |
"#@markdown # Step 1/4: \"Play\" this cell to install the the required libraries\n", | |
"\n", | |
"# @markdown ### Please be patient.\n", | |
"\n", | |
"\n", | |
"#@markdown ---\n", | |
"!sudo apt-get install apt-transport-https curl gnupg\n", | |
"!curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > bazel.gpg\n", | |
"!sudo mv bazel.gpg /etc/apt/trusted.gpg.d/\n", | |
"!echo \"deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8\" | sudo tee /etc/apt/sources.list.d/bazel.list\n", | |
"!sudo apt-get update && sudo apt-get install bazel\n", | |
"!git clone https://github.com/google/robotstxt.git\n", | |
"%cd robotstxt/\n", | |
"!bazel build :robots_main" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Step 2/4: Write the robots.txt below\n", | |
"\n", | |
"### Please write the content of your `robots.txt` file in the input field below and then **run the cell** by clicking the following playing button to process it.\n", | |
"### **Don't touch** the first line `%%writefile robots.txt`\n" | |
], | |
"metadata": { | |
"id": "QyL3seR-Xq49" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"%%writefile robots.txt\n", | |
"\n", | |
"# Google Search Engine Robot\n", | |
"# ==========================\n", | |
"User-agent: Googlebot\n", | |
"\n", | |
"Allow: /*?lang=\n", | |
"Allow: /hashtag/*?src=\n", | |
"Allow: /search?q=%23\n", | |
"Allow: /i/api/\n", | |
"Disallow: /search/realtime\n", | |
"Disallow: /search/users\n", | |
"Disallow: /search/*/grid\n", | |
"\n", | |
"Disallow: /*?\n", | |
"Disallow: /*/followers\n", | |
"Disallow: /*/following\n", | |
"\n", | |
"Disallow: /account/deactivated\n", | |
"Disallow: /settings/deactivated\n", | |
"\n", | |
"Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/likes\n", | |
"Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/retweets\n", | |
"Disallow: /[_0-9a-zA-Z]+/likes\n", | |
"Disallow: /[_0-9a-zA-Z]+/media\n", | |
"Disallow: /[_0-9a-zA-Z]+/photo\n", | |
"\n", | |
"\n", | |
"User-Agent: FacebookBot\n", | |
"Disallow: *\n", | |
"\n", | |
"User-agent: facebookexternalhit\n", | |
"Disallow: *\n", | |
"\n", | |
"User-agent: Discordbot\n", | |
"Disallow: *\n", | |
"\n", | |
"User-agent: Bingbot\n", | |
"Disallow: *\n", | |
"\n", | |
"# Every bot that might possibly read and respect this file\n", | |
"# ========================================================\n", | |
"User-agent: *\n", | |
"Disallow: /\n", | |
"\n", | |
"\n", | |
"# WHAT-4882 - Block indexing of links in notification emails. This applies to all bots.\n", | |
"# =====================================================================================\n", | |
"Disallow: /i/u\n", | |
"Noindex: /i/u\n", | |
"\n", | |
"# Wait 1 second between successive requests. See ONBOARD-2698 for details.\n", | |
"Crawl-delay: 1\n", | |
"\n", | |
"# Independent of user agent. Links in the sitemap are full URLs using https:// and need to match\n", | |
"# the protocol of the sitemap.\n", | |
"Sitemap: https://twitter.com/sitemap.xml" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "qF67QFnNOkQi", | |
"outputId": "461ae667-3c69-4438-c45d-4960b96ff092" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Writing robots.txt\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Step 3/4: Enter the URLs to test if they are blocked by your robots.txt file\n", | |
"### Please add the URLs, each separated by a newline and then run the cell by clicking the following playing button to process it.\n", | |
"### **Don't touch** the first line `%%writefile urls.txt`\n" | |
], | |
"metadata": { | |
"id": "Rbxjll36a-Ie" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"%%writefile urls.txt\n", | |
"\n", | |
"https://twitter.com/natzir9/status/1760592136649494873/likes\n", | |
"https://twitter.com/natzir9/status/1760592136649494873/likes?utm_source=test\n", | |
"https://twitter.com/natzir" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "CqbFTHG9a-UI", | |
"outputId": "fd7ad122-fe04-4fa0-92a9-9a240c79290f" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Writing urls.txt\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# @markdown # Step 4/4: Set the User Agent\n", | |
"#@markdown ---\n", | |
"user_agent = \"Googlebot\" #@param {type: \"string\"}\n", | |
"\n", | |
"with open('urls.txt', 'r') as file:\n", | |
" urls = file.read().splitlines()\n", | |
"\n", | |
"for url in urls:\n", | |
" if url and not url.startswith('#'):\n", | |
" !bazel-bin/robots_main robots.txt \"$user_agent\" \"$url\"\n", | |
"\n", | |
"# @markdown ### After setting the User Agent, run this cell by clicking the \"Play\" button." | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"cellView": "form", | |
"id": "rTWOnIRYT7-t", | |
"outputId": "7e69084f-4fa6-4d41-a40a-d11ebd6c631d" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"user-agent 'Googlebot' with URI 'https://twitter.com/natzir9/status/1760592136649494873/likes': ALLOWED\n", | |
"user-agent 'Googlebot' with URI 'https://twitter.com/natzir9/status/1760592136649494873/likes?utm_source=test': DISALLOWED\n", | |
"user-agent 'Googlebot' with URI 'https://twitter.com/natzir': ALLOWED\n" | |
] | |
} | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment