Skip to content

Instantly share code, notes, and snippets.

Free Google's Robots.txt Parser Tool - Official
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyNS6EqThYLVmcWV5KwpAo1+",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/natzir/87474e21a1b52a12deea9f487017d2e1/free-google-s-robots-txt-parser-tool-official.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Google's Robots.txt Parser Tool\n",
"\n",
"\n",
"---\n",
"\n",
"**Author:** Natzir, Technical SEO / Data Scientist\n",
"<br>**X/Twitter:** [@natzir9](https://twitter.com/natzir9)\n",
"\n",
"---\n",
"\n",
"### If you've ever tried to test a robots.txt file before deploying it, you've probably run into a common issue: most publicly available robots parsers don't follow Google's standard. This has become especially relevant since Google removed the robots.txt tester from Search Console. With this Colab notebook, you can test a robots.txt file as if you were Google because it uses the [open-source library that Googlebot uses in production](https://github.com/google/robotstxt).\n",
"\n",
"### Before you can use this notebook, make sure to **create a copy** of it in your Google Drive (File > Save a copy in Drive)\n"
],
"metadata": {
"id": "6tH7Un4dQMjA"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "SJC4jO6cN0Ip"
},
"outputs": [],
"source": [
"#@markdown # Step 1/4: \"Play\" this cell to install the the required libraries\n",
"\n",
"# @markdown ### Please be patient.\n",
"\n",
"\n",
"#@markdown ---\n",
"!sudo apt-get install apt-transport-https curl gnupg\n",
"!curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > bazel.gpg\n",
"!sudo mv bazel.gpg /etc/apt/trusted.gpg.d/\n",
"!echo \"deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8\" | sudo tee /etc/apt/sources.list.d/bazel.list\n",
"!sudo apt-get update && sudo apt-get install bazel\n",
"!git clone https://github.com/google/robotstxt.git\n",
"%cd robotstxt/\n",
"!bazel build :robots_main"
]
},
{
"cell_type": "markdown",
"source": [
"# Step 2/4: Write the robots.txt below\n",
"\n",
"### Please write the content of your `robots.txt` file in the input field below and then **run the cell** by clicking the following playing button to process it.\n",
"### **Don't touch** the first line `%%writefile robots.txt`\n"
],
"metadata": {
"id": "QyL3seR-Xq49"
}
},
{
"cell_type": "code",
"source": [
"%%writefile robots.txt\n",
"\n",
"# Google Search Engine Robot\n",
"# ==========================\n",
"User-agent: Googlebot\n",
"\n",
"Allow: /*?lang=\n",
"Allow: /hashtag/*?src=\n",
"Allow: /search?q=%23\n",
"Allow: /i/api/\n",
"Disallow: /search/realtime\n",
"Disallow: /search/users\n",
"Disallow: /search/*/grid\n",
"\n",
"Disallow: /*?\n",
"Disallow: /*/followers\n",
"Disallow: /*/following\n",
"\n",
"Disallow: /account/deactivated\n",
"Disallow: /settings/deactivated\n",
"\n",
"Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/likes\n",
"Disallow: /[_0-9a-zA-Z]+/status/[0-9]+/retweets\n",
"Disallow: /[_0-9a-zA-Z]+/likes\n",
"Disallow: /[_0-9a-zA-Z]+/media\n",
"Disallow: /[_0-9a-zA-Z]+/photo\n",
"\n",
"\n",
"User-Agent: FacebookBot\n",
"Disallow: *\n",
"\n",
"User-agent: facebookexternalhit\n",
"Disallow: *\n",
"\n",
"User-agent: Discordbot\n",
"Disallow: *\n",
"\n",
"User-agent: Bingbot\n",
"Disallow: *\n",
"\n",
"# Every bot that might possibly read and respect this file\n",
"# ========================================================\n",
"User-agent: *\n",
"Disallow: /\n",
"\n",
"\n",
"# WHAT-4882 - Block indexing of links in notification emails. This applies to all bots.\n",
"# =====================================================================================\n",
"Disallow: /i/u\n",
"Noindex: /i/u\n",
"\n",
"# Wait 1 second between successive requests. See ONBOARD-2698 for details.\n",
"Crawl-delay: 1\n",
"\n",
"# Independent of user agent. Links in the sitemap are full URLs using https:// and need to match\n",
"# the protocol of the sitemap.\n",
"Sitemap: https://twitter.com/sitemap.xml"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qF67QFnNOkQi",
"outputId": "461ae667-3c69-4438-c45d-4960b96ff092"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Writing robots.txt\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"# Step 3/4: Enter the URLs to test if they are blocked by your robots.txt file\n",
"### Please add the URLs, each separated by a newline and then run the cell by clicking the following playing button to process it.\n",
"### **Don't touch** the first line `%%writefile urls.txt`\n"
],
"metadata": {
"id": "Rbxjll36a-Ie"
}
},
{
"cell_type": "code",
"source": [
"%%writefile urls.txt\n",
"\n",
"https://twitter.com/natzir9/status/1760592136649494873/likes\n",
"https://twitter.com/natzir9/status/1760592136649494873/likes?utm_source=test\n",
"https://twitter.com/natzir"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CqbFTHG9a-UI",
"outputId": "fd7ad122-fe04-4fa0-92a9-9a240c79290f"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Writing urls.txt\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"# @markdown # Step 4/4: Set the User Agent\n",
"#@markdown ---\n",
"user_agent = \"Googlebot\" #@param {type: \"string\"}\n",
"\n",
"with open('urls.txt', 'r') as file:\n",
" urls = file.read().splitlines()\n",
"\n",
"for url in urls:\n",
" if url and not url.startswith('#'):\n",
" !bazel-bin/robots_main robots.txt \"$user_agent\" \"$url\"\n",
"\n",
"# @markdown ### After setting the User Agent, run this cell by clicking the \"Play\" button."
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"cellView": "form",
"id": "rTWOnIRYT7-t",
"outputId": "7e69084f-4fa6-4d41-a40a-d11ebd6c631d"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"user-agent 'Googlebot' with URI 'https://twitter.com/natzir9/status/1760592136649494873/likes': ALLOWED\n",
"user-agent 'Googlebot' with URI 'https://twitter.com/natzir9/status/1760592136649494873/likes?utm_source=test': DISALLOWED\n",
"user-agent 'Googlebot' with URI 'https://twitter.com/natzir': ALLOWED\n"
]
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment