Skip to content

Instantly share code, notes, and snippets.

@malev
Last active April 21, 2017 21:39
Show Gist options
  • Save malev/2b1bf5a42ad520e7be918f6bee486e6b to your computer and use it in GitHub Desktop.
Save malev/2b1bf5a42ad520e7be918f6bee486e6b to your computer and use it in GitHub Desktop.
Test amp errors
  • Can you verify 5555 amp links?
  • sure I can!

What do we need?

What do we have?

The links are in a CSV files with this format:

ID,BRAND,AMP Error Type,AMP URL,Last detected

Research

I need to send one by one through amphtml-validator:

$ node_modules/.bin/amphtml-validator http://www.bonappetit.com/uncategorized/article/the-linkery-03-24-10/amp --format json
{"http://www.bonappetit.com/uncategorized/article/the-linkery-03-24-10/amp":{"status":"FAIL","errors":[{"severity":"ERROR","line":33,"col":3,"message":"Invalid URL protocol 'foodhttp:' for attribute 'href' in tag 'a'.","specUrl":"https://www.ampproject.org/docs/reference/spec#links","category":"DISALLOWED_HTML","code":"INVALID_URL_PROTOCOL","params":["href","a","foodhttp"]}]}}

We can use head and tail to select a specific row:

$ head -n 2 errors.csv | tail -n 1
1,Bon Appetit,Prohibited or invalid use of HTML Tag (Critical issue),http://www.bonappetit.com/recipe/spicy-italian-sausage/amp,4/3/17
$ head -n 3 errors.csv | tail -n 1
2,Bon Appetit,Prohibited or invalid use of HTML Tag (Critical issue),http://www.bonappetit.com/entertaining-style/gift-guides/article/the-7-best-culinary-bookstores-in-america/amp,4/3/17

From each row we need to select the URL:

$ head -n 3 errors.csv | tail -n 1 | csvcut -c 4
http://www.bonappetit.com/entertaining-style/gift-guides/article/the-7-best-culinary-bookstores-in-america/amp

From there we will need to update the first index starting in 2 and ending in 5555. We can use seq for this:

➜  amp-valid seq 2 10
2
3
4
5
6
7
8
9
10
➜  amp-valid

And we are going to use parallel to speed up the process:

seq 2 5555 | parallel -j10 "head -n {} errors.csv | tail -n 1 | csvcut -c 4 | xargs node_modules/.bin/amphtml-validator $1 --format json | cat >> output/{}.json"

Now we have a bunch of json files inside ./output. We can merge them together with:

for f in *.json; do (cat "${f}"; echo) >> output.dat; done

Finally we need to remove ocasionally empty lines from out file:

cat output/output.dat | sed '/^\s*$/d' | cat >> clean.dat

Out new file clean.dat is ready to go!

Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 119,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import json\n",
"import re\n",
"from urlparse import urlparse"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [],
"source": [
"data = open('./clean.dat')"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [],
"source": [
"to_csv = []\n",
"\n",
"for row in data:\n",
" parsed =json.loads(row)\n",
" url = parsed.keys()[0]\n",
" temp = {\n",
" 'brand': re.sub(r'www\\.', '', urlparse(url).netloc),\n",
" 'status': parsed[url]['status'],\n",
" 'url': url\n",
" }\n",
" if temp['status'] == 'FAIL':\n",
" temp['message'] = parsed[url]['errors'][0]['message'].encode('ascii', 'ignore')\n",
" temp['severity'] = parsed[url]['errors'][0]['severity']\n",
" else:\n",
" temp['message'] = ''\n",
" temp['severity'] = ''\n",
" to_csv.append(temp)\n"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"\n",
"keys = to_csv[0].keys()\n",
"with open('file.csv', 'wb') as csvfile:\n",
" fieldnames = ['brand', 'url', 'status', 'message', 'severity']\n",
" writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n",
" writer.writeheader()\n",
" for row in to_csv:\n",
" try:\n",
" writer.writerow(row)\n",
" except:\n",
" print(row)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@rafaelhbarros
Copy link

this is incredible, I'd suggest merging this doc in github.com/CondeNast/autopilot-services-validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment