What to do if users tell us they get 404 sometimes and the content should be there?
Example, we're seeing this problem on staging on 2019-11-14.
This is the production server cluster diagram taken from the (private) cnx repo:
For non legacy requests, you can see from the diagram that the request goes:
- from
haproxy
onprod00
- to
varnish
onprod01
andprod02
, - and then to
archive
instances onprod07
andprod08
.
The archive instances on prod07
use the secondary database on prod10
and
the ones on prod08
use the primary database on prod09
.
For internal server errors, I would usually just check the archive logs for the traceback, but for intermittent errors you probably want to check all the logs.
-
haproxy
logs onstaging00
:ssh
intostaging00
and do something like:sudo grep b8d5607f-7db7-47ed-8ea7-8cedb24284f0 /var/log/haproxy.log
to search for a particular url and it'll show something like:
Nov 14 08:32:54 1573741973 127.0.0.1 info.local0 haproxy[24269]: 73.32.240.215:49712 [14/Nov/2019:08:32:54.750] http-proxy~ nodes/staging02.cnx.org 18/0/1/51/70 404 770 - - ---- 7/7/1/0/0 0/0 "GET /contents/b8d5607f-7db7-47ed-8ea7-8cedb24284f0.json HTTP/1.1"
It's showing that it's using varnish on staging02 and getting
404
. -
varnish
logs onstaging02
:ssh
intostaging02
and do something like:sudo varnishlog -r /var/log/varnish/varnish.log -q 'ReqUrl ~ "b8d5607f-7db7-47ed-8ea7-8cedb24284f0"'
to search for a particular url and you'll get something like:
* << Request >> 974019 - Begin req 974018 rxreq - Timestamp Start: 1573741974.770680 0.000000 0.000000 - Timestamp Req: 1573741974.770680 0.000000 0.000000 - ReqStart 128.42.169.22 56544 - ReqMethod GET - ReqURL /contents/b8d5607f-7db7-47ed-8ea7-8cedb24284f0.json - ReqProtocol HTTP/1.1 - ReqHeader Host: archive-staging.cnx.org - ReqHeader Accept: application/json, text/javascript, */*; q=0.01 - ReqHeader Origin: https://staging.cnx.org - ReqHeader User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36 - ReqHeader Sec-Fetch-Site: same-site - ReqHeader Sec-Fetch-Mode: cors - ReqHeader Referer: https://staging.cnx.org/contents/b8d5607f-7db7-47ed-8ea7-8cedb24284f0 - ReqHeader Accept-Encoding: gzip, deflate, br - ReqHeader Accept-Language: en-US,en;q=0.9 - ReqHeader X-Secure: true - ReqHeader X-Forwarded-Proto: https - ReqHeader X-Forwarded-For: 73.32.240.215 - ReqHeader Connection: close - ReqUnset X-Forwarded-For: 73.32.240.215 - ReqHeader X-Forwarded-For: 73.32.240.215, 128.42.169.22 - VCL_call RECV - VCL_acl NO_MATCH block - ReqUnset Accept-Encoding: gzip, deflate, br - ReqHeader Accept-Encoding: gzip - VCL_return hash - VCL_call HASH - VCL_return lookup - Hit 904782 - VCL_call HIT - VCL_return miss - VCL_call MISS - VCL_return fetch - Link bereq 974020 fetch - Timestamp Fetch: 1573741974.821335 0.050655 0.050655 - RespProtocol HTTP/1.1 - RespStatus 404 - RespReason Not Found - RespHeader Access-Control-Allow-Headers: origin,dnt,accept-encoding,accept-language,keep-alive,user-agent,x-requested-with,if-modified-since,cache-control,content-type - RespHeader Access-Control-Allow-Methods: GET, OPTIONS - RespHeader Access-Control-Allow-Origin: * - RespHeader Content-Length: 104 - RespHeader Content-Type: application/json - RespHeader Date: Thu, 14 Nov 2019 14:32:54 GMT - RespHeader Etag: "1B2M2Y8AsgTpgAmY7PhCfg" - RespHeader Server: waitress - RespHeader X-Varnish-Status: cacheable (override) - no Cache-Control in backend response - RespHeader Cache-Control: public, max-age=60 - RespHeader X-Varnish-Backend: staging07_archive5 - RespHeader X-Varnish-Ttl: 60.000 - RespHeader X-Varnish: 974019 - RespHeader Age: 0 - RespHeader Via: 1.1 varnish-v4 - VCL_call DELIVER - RespHeader X-Varnish-Cache: MISS - VCL_return deliver - Timestamp Process: 1573741974.821365 0.050685 0.000031 - Debug "RES_MODE 2" - RespHeader Connection: close - Timestamp Resp: 1573741974.821443 0.050763 0.000078 - ReqAcct 613 0 613 666 104 770 - End
See one of the
RespHeader
s, there's one calledX-Varnish-Backend: staging07_archive5
, so it's using thearchive
instance 5 onstaging07
. -
archive
logs onstaging07
:ssh
intostaging07
and do something like:sudo less /var/log/supervisor/archive5-std\*.log
I didn't see anything special in the archive logs.
All the archive instances on staging07
talk to the database on staging10
.
You can double check the archive config file by doing:
grep postgres /etc/cnx/archive/app.ini
on staging07
which gives you something like:
db-connection-string = postgresql://rhaptos:[email protected]:5432/repository
so it is indeed using the database on staging10
.
So now ssh
into staging10
and do:
PGHOST=localhost psql -U rhaptos repository
You can check whether the book exists by doing:
SELECT count(*) FROM modules WHERE uuid = 'b8d5607f-7db7-47ed-8ea7-8cedb24284f0';
It it's 0
then the book definitely doesn't exist.
You can do the same thing on staging09
and if that returns something else,
you know that the databases on staging09
and staging10
are not the same.
(and they need to be the same)
The alternative is to just check the number of rows in the modules
table on
staging09
and staging10
and compare the results:
SELECT count(*) FROM modules;
While I was checking the postgresql logs on staging10
, I was seeing
No space left on device
error, it happened for example when I was using tab complete for filenames.
You can check the available space on the server by doing:
df -h
and it returns something like:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 275G 261G 0 100% /
I showed the output to devops and asked them to fix it.
First devops fixed the space issue on staging09
and staging10
and then the
postgresql databases on staging09
and staging10
are not the same... Devops
is currently resyncing the databases.