This document is meant to be a soup-to-nuts explanation of the Panoptes->Caesar->MAST pipeline for the TESS project.
When a classification is made in the new PFE Light Curve Viewer, the classification is streamed to Caesar, where processing begins.
There are two primary paths existing in parallel depending on the type of subject being classified:
- classification on a training subject
- classification on a real subject
A user annotates a real subject and that classification is streamed to Caesar.
-
key:
points_extractor
type: ExternalExtractor -
The classification is sent to the aggregation-for-caesar TESS column extractor. This is currently an open PR: (zooniverse/aggregation-for-caesar#103) but will be available when it is merged+deployed.
- key:
subject_probability
type: ExternalReducer
mode: running_reduction
reducer_keys: ["points_extractor"]
user_reducer_keys: ["user_skill"]
- Inputs:
- all the subject's
points_extractor
extracts - for contributing user, their skill level.
- all the subject's
- This uses relevant reductions to allow reductions from a different source reducer to be passed to an external reducer. For this external reducer, these relevant reductions are the users' previously calculated skill values.
- The
user_reducer_keys
value matches the reducer to the reductions to be included. This value should match the required reduction'sreducer_key
. - Based on what the users have annotated for this subject (the extracts) and each classifying users' skill levels, the calculation of candidate probability is enabled.
- If the sector that this subject represents has been viewed and uploaded previously as a different subject, the previous subject's id should be included in the new subject's metadata when it is uploaded as
previous_subject_ids: []
. The previous subject's extracts will be included in the new subjects reduction.
- The reducer would return JSON consisting of the calculated probability and potentially data, something like:
{probability: 0.8, highest_skill: 0.7, classifications_count: 10}
- A SubjectRule is created to evaluate the reduction and determine whether to send it along.
- Potential options to include would be values for classifications_count and/or "probability" (or whatever it will called in the data json) as thresholds to pass.
- If the rule passes, a job is enqueued to send the reduction to MAST.
- The reduction created by the external subject probability reducer is sent to MAST as an ExternalEffect.
- This request will be a POST to a MAST API endpoint, with a body formatted like so:
{
"id":4094,
"reducible":{
"id":21853,
"type":"Workflow"
},
"data":{
[TBD, pending Oxford external reducer code]
},
"user_ids": [1, 2, 3],
"subject":{
"id":13626,
"metadata":{
"TIC_ID": "123456789",
[etc]
},
"created_at":"2019-01-30T20:18:00.239Z",
"updated_at":"2019-01-30T20:18:00.239Z"
},
"created_at":"2019-01-30T20:18:00.248Z",
"updated_at":"2019-01-30T20:18:00.248Z"
}
user_ids
is a list of the ids of Zooniverse users whose extracts contributed to this reduction.
Training subjects will allow the user's individual skill to be measured. This skill level is then applied during the reduction of real subjects and used to help determine a candidate's probability.
Details of Nora's user weighting system are required before this section can be confirmed
The following system should be considered a proposal for a Caesar pipeline based on what we know about other projects (SWAP, Gravity Spy). This is currently being discussed in #tess on slack and on github.
Possible requirements for the user weighting algorithm:
- the "correctness" of annotations, as determined by PFE's Feedback system
- the subject's "training accuracy", the percentage of user's who correctly classified that subject
- the user's "skill", which is calculated externally to Caesar using the above two values
- key:
training_correctness
type: PluckFieldExtractor
plucksclassification.metadata.feedback[taskKey][success]
This extractor could pluck one of several things, depending on requirements:
- The whole-subject success boolean from the feedback. This is pending an FE update.
- The
feedback[taskKey]
object, which includes the correctness of each transit and other subject metadata, but no annotation info. - the whole annotation directly, but then Caesar doesn't know and cannot include whether it was correct or incorrect.
- key:
training_accuracy
type: StatsReducer
inputs: [training_correctness] extracts
- Calculates the percentage of people who got this training subject right.
- Built-in Caesar reducer that would do a simple
correct / total
calculation. - This is only possible if the "correctness" is plucked from the classification metadata (and not the annotation).
-
key:
user_skill
type: ExternalReducer
subject_reducer_keys: ["training_accuracy"] -
Inputs:
- Extracts: The specific user's "training correctness" extracts.
- As mentioned, This could be the whole annotations instead of just the correctness, but the training_accuracy reduction could not be calculated by Caesar and included.
- Reductions: These are also relevant reductions: for the subjects that the user has classified, the average percentage of people who got that subject correct. (Only if the correctness is extracted).
- Extracts: The specific user's "training correctness" extracts.
-
Return value should be JSON that includes the user's updated skill value, stored in the
data
field of the reduction:
{ skill: 0.4 }