Issue

What to do if users tell us they get 404 sometimes and the content should be there?

Example, we're seeing this problem on staging on 2019-11-14.

Background

This is the production server cluster diagram taken from the (private) cnx repo:

For non legacy requests, you can see from the diagram that the request goes:

from haproxy on prod00
to varnish on prod01 and prod02,
and then to archive instances on prod07 and prod08.

The archive instances on prod07 use the secondary database on prod10 and the ones on prod08 use the primary database on prod09.

Checking logs

For internal server errors, I would usually just check the archive logs for the traceback, but for intermittent errors you probably want to check all the logs.

haproxy logs on staging00:

ssh into staging00 and do something like:

sudo grep b8d5607f-7db7-47ed-8ea7-8cedb24284f0 /var/log/haproxy.log

to search for a particular url and it'll show something like:

Nov 14 08:32:54 1573741973 127.0.0.1 info.local0 haproxy[24269]: 73.32.240.215:49712 [14/Nov/2019:08:32:54.750] http-proxy~ nodes/staging02.cnx.org 18/0/1/51/70 404 770 - - ---- 7/7/1/0/0 0/0 "GET /contents/b8d5607f-7db7-47ed-8ea7-8cedb24284f0.json HTTP/1.1"

It's showing that it's using varnish on staging02 and getting 404.

varnish logs on staging02:

ssh into staging02 and do something like:

sudo varnishlog -r /var/log/varnish/varnish.log -q 'ReqUrl ~ "b8d5607f-7db7-47ed-8ea7-8cedb24284f0"'

to search for a particular url and you'll get something like:

*   << Request  >> 974019
-   Begin          req 974018 rxreq
-   Timestamp      Start: 1573741974.770680 0.000000 0.000000
-   Timestamp      Req: 1573741974.770680 0.000000 0.000000
-   ReqStart       128.42.169.22 56544
-   ReqMethod      GET
-   ReqURL         /contents/b8d5607f-7db7-47ed-8ea7-8cedb24284f0.json
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: archive-staging.cnx.org
-   ReqHeader      Accept: application/json, text/javascript, */*; q=0.01
-   ReqHeader      Origin: https://staging.cnx.org
-   ReqHeader      User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
-   ReqHeader      Sec-Fetch-Site: same-site
-   ReqHeader      Sec-Fetch-Mode: cors
-   ReqHeader      Referer: https://staging.cnx.org/contents/b8d5607f-7db7-47ed-8ea7-8cedb24284f0
-   ReqHeader      Accept-Encoding: gzip, deflate, br
-   ReqHeader      Accept-Language: en-US,en;q=0.9
-   ReqHeader      X-Secure: true
-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      X-Forwarded-For: 73.32.240.215
-   ReqHeader      Connection: close
-   ReqUnset       X-Forwarded-For: 73.32.240.215
-   ReqHeader      X-Forwarded-For: 73.32.240.215, 128.42.169.22
-   VCL_call       RECV
-   VCL_acl        NO_MATCH block
-   ReqUnset       Accept-Encoding: gzip, deflate, br
-   ReqHeader      Accept-Encoding: gzip
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   Hit            904782
-   VCL_call       HIT
-   VCL_return     miss
-   VCL_call       MISS
-   VCL_return     fetch
-   Link           bereq 974020 fetch
-   Timestamp      Fetch: 1573741974.821335 0.050655 0.050655
-   RespProtocol   HTTP/1.1
-   RespStatus     404
-   RespReason     Not Found
-   RespHeader     Access-Control-Allow-Headers: origin,dnt,accept-encoding,accept-language,keep-alive,user-agent,x-requested-with,if-modified-since,cache-control,content-type
-   RespHeader     Access-Control-Allow-Methods: GET, OPTIONS
-   RespHeader     Access-Control-Allow-Origin: *
-   RespHeader     Content-Length: 104
-   RespHeader     Content-Type: application/json
-   RespHeader     Date: Thu, 14 Nov 2019 14:32:54 GMT
-   RespHeader     Etag: "1B2M2Y8AsgTpgAmY7PhCfg"
-   RespHeader     Server: waitress
-   RespHeader     X-Varnish-Status: cacheable (override) - no Cache-Control in backend response
-   RespHeader     Cache-Control: public, max-age=60
-   RespHeader     X-Varnish-Backend: staging07_archive5
-   RespHeader     X-Varnish-Ttl: 60.000
-   RespHeader     X-Varnish: 974019
-   RespHeader     Age: 0
-   RespHeader     Via: 1.1 varnish-v4
-   VCL_call       DELIVER
-   RespHeader     X-Varnish-Cache: MISS
-   VCL_return     deliver
-   Timestamp      Process: 1573741974.821365 0.050685 0.000031
-   Debug          "RES_MODE 2"
-   RespHeader     Connection: close
-   Timestamp      Resp: 1573741974.821443 0.050763 0.000078
-   ReqAcct        613 0 613 666 104 770
-   End

See one of the RespHeaders, there's one called X-Varnish-Backend: staging07_archive5, so it's using the archive instance 5 on staging07.

archive logs on staging07:

ssh into staging07 and do something like:
```
sudo less /var/log/supervisor/archive5-std\*.log
```
I didn't see anything special in the archive logs.

Checking postgresql

All the archive instances on staging07 talk to the database on staging10. You can double check the archive config file by doing:

grep postgres /etc/cnx/archive/app.ini

on staging07 which gives you something like:

db-connection-string = postgresql://rhaptos:[email protected]:5432/repository

so it is indeed using the database on staging10.

So now ssh into staging10 and do:

PGHOST=localhost psql -U rhaptos repository

You can check whether the book exists by doing:

SELECT count(*) FROM modules WHERE uuid = 'b8d5607f-7db7-47ed-8ea7-8cedb24284f0';

It it's 0 then the book definitely doesn't exist.

You can do the same thing on staging09 and if that returns something else, you know that the databases on staging09 and staging10 are not the same. (and they need to be the same)

The alternative is to just check the number of rows in the modules table on staging09 and staging10 and compare the results:

SELECT count(*) FROM modules;

No space left on device

While I was checking the postgresql logs on staging10, I was seeing

No space left on device

error, it happened for example when I was using tab complete for filenames.

You can check the available space on the server by doing:

df -h

and it returns something like:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       275G  261G     0 100% /

I showed the output to devops and asked them to fix it.

Solution?

First devops fixed the space issue on staging09 and staging10 and then the postgresql databases on staging09 and staging10 are not the same... Devops is currently resyncing the databases.