Metrics in commercial product

Objective

The incoming log stream from Hasura instances can be analyzed to derive metrics about the instance and it’s performance.

These can help with real-time requests per second, success-error rates, query whitelisting, regression tests, field level analytics etc.

These features can be broadly classified into four buckets:

For the first release, we’ll focus on some basic features that could span these categories.

Legend:

✅ Is this instance alive?
- We will periodically poll the instance’s health endpoint and also see if we’re receiving logs from the instance.

Request ID
- [query-log].detail.request_id
- [http-log].detail.operation.request_id
Status Code: [http-log].detail.http_info.status
URL: [http-log].detail.http_info.url
IP address: [http-log].detail.http_info.ip
Method: [http-log].detail.http_info.method

✅ [http-log].detail.http_info.url == "v1/graphql" "v1alpha1/graphql"
❓ [query-log] – only for GraphQL?
What was queried?
- ✅ Operation name: [query-log].detail.query.operationName
- ✅ Raw query: [query-log].detail.query.query
- ✅ Query variables: [query-log].detail.query.variables
- ✅ (on error): [http-log].detail.operation.query
- ✅ Generated SQL: [query-log].detail.generated_sql
- 🕒 Parsed tables/columns/fields
- ❓ Is it from Hasura or a remote schema?
Who made the query?
- ❓ Client ID (something a client would generate, like a mobile app vs website)
- ✅ (X-Hasura-Role [, X-Hasura-x]) tuple that uniquely identifies the user
  - [http-log].detail.operation.user_vars
Was it successful?
- ✅ Boolean status: [http-log].level == "error"
- ✅ Error indicators: [http-log].detail.error is not null
What was the error?
- ✅ Error body: [http-log].detail.operation.error
- ✅ Query: [htt-log].detail.operation.query
- ✅ Error classification and categorisation
  - [http-log].detail.operation.error.code
🕒 What was the response?
- GraphQL JSON response body
- Did any field return NULL?
How much time did it take?
- ✅ Query execution time [http-log].detail.operation.query_execution_time
- ❓ Total time taken to respond
How do I make it faster?
- Analyze the query from console
- Show the generated SQL from logs?
Non GraphQL errors

[http-info].detail.http_info.url == "v1/query"
What was the action?
- ✅ on error [http-log].detail.operation.query
- ❓ on success
Who executed it?
- ❓ User information parsed from Collaborator-Token
- ❓ [http-log].detail.operation.user_vars
Was it successful?
- ✅ Boolean status: [http-log].level
What was the error?
- ✅ Error body: [http-log].detail.operation.error
- ✅ Error classification: [htt-log].detail.operation.error.code
How much time did it take?
- ✅ Execution time: [http-log].detail.operation.query_execution_time
- ❓ Request latency

The items below might not be available in the first release.

Number of API calls over time
- Error/Success rates
- Preferably real-time rps
- Usage per client
List of GraphQL queries over time
- With user metadata
- Group by user metadata
Metadata API access audit log