Proposed Caesar-TESS Pipeline

Zach Wolfenbarger, 31 Jan 2019

This document is meant to be a soup-to-nuts explanation of the Panoptes->Caesar->MAST pipeline for the TESS project.

When a classification is made in the new PFE Light Curve Viewer, the classification is streamed to Caesar, where processing begins.

There are two primary paths existing in parallel depending on the type of subject being classified:

classification on a training subject
classification on a real subject

Classifying a Real Subject

A user annotates a real subject and that classification is streamed to Caesar.

Extractor:

key: points_extractor
type: ExternalExtractor
The classification is sent to the aggregation-for-caesar TESS column extractor. This is currently an open PR: (zooniverse/aggregation-for-caesar#103) but will be available when it is merged+deployed.

SubjectReducer:

key: subject_probability
type: ExternalReducer
mode: running_reduction
reducer_keys: ["points_extractor"]
user_reducer_keys: ["user_skill"]

Inputs:
- all the subject's points_extractor extracts
- for contributing user, their skill level.
This uses relevant reductions to allow reductions from a different source reducer to be passed to an external reducer. For this external reducer, these relevant reductions are the users' previously calculated skill values.
The user_reducer_keys value matches the reducer to the reductions to be included. This value should match the required reduction's reducer_key.
Based on what the users have annotated for this subject (the extracts) and each classifying users' skill levels, the calculation of candidate probability is enabled.
If the sector that this subject represents has been viewed and uploaded previously as a different subject, the previous subject's id should be included in the new subject's metadata when it is uploaded as previous_subject_ids: []. The previous subject's extracts will be included in the new subjects reduction.

The reducer would return JSON consisting of the calculated probability and potentially data, something like:
{probability: 0.8, highest_skill: 0.7, classifications_count: 10}

Rule:

A SubjectRule is created to evaluate the reduction and determine whether to send it along.
Potential options to include would be values for classifications_count and/or "probability" (or whatever it will called in the data json) as thresholds to pass.
If the rule passes, a job is enqueued to send the reduction to MAST.

Effect:

The reduction created by the external subject probability reducer is sent to MAST as an ExternalEffect.
This request will be a POST to a MAST API endpoint, with a body formatted like so:

   {
      "id":4094,
      "reducible":{
         "id":21853,
         "type":"Workflow"
      },
      "data":{
        [TBD, pending Oxford external reducer code]
      },
      "user_ids": [1, 2, 3],
      "subject":{
         "id":13626,
         "metadata":{
           "TIC_ID": "123456789",
           [etc]
         },
         "created_at":"2019-01-30T20:18:00.239Z",
         "updated_at":"2019-01-30T20:18:00.239Z"
      },
      "created_at":"2019-01-30T20:18:00.248Z",
      "updated_at":"2019-01-30T20:18:00.248Z"
   }

user_ids is a list of the ids of Zooniverse users whose extracts contributed to this reduction.

Classifying a Training Subject

Training subjects will allow the user's individual skill to be measured. This skill level is then applied during the reduction of real subjects and used to help determine a candidate's probability.

Details of Nora's user weighting system are required before this section can be confirmed

The following system should be considered a proposal for a Caesar pipeline based on what we know about other projects (SWAP, Gravity Spy). This is currently being discussed in #tess on slack and on github.

Possible requirements for the user weighting algorithm:

the "correctness" of annotations, as determined by PFE's Feedback system
the subject's "training accuracy", the percentage of user's who correctly classified that subject
the user's "skill", which is calculated externally to Caesar using the above two values

Extractor:

key: training_correctness
type: PluckFieldExtractor
plucks classification.metadata.feedback[taskKey][success]

This extractor could pluck one of several things, depending on requirements:

The whole-subject success boolean from the feedback. This is pending an FE update.
The feedback[taskKey] object, which includes the correctness of each transit and other subject metadata, but no annotation info.
the whole annotation directly, but then Caesar doesn't know and cannot include whether it was correct or incorrect.

SubjectReducer:

key: training_accuracy
type: StatsReducer
inputs: [training_correctness] extracts

Calculates the percentage of people who got this training subject right.
Built-in Caesar reducer that would do a simple correct / total calculation.
This is only possible if the "correctness" is plucked from the classification metadata (and not the annotation).

UserReducer:

key: user_skill
type: ExternalReducer
subject_reducer_keys: ["training_accuracy"]
Inputs:
- Extracts: The specific user's "training correctness" extracts.
  - As mentioned, This could be the whole annotations instead of just the correctness, but the training_accuracy reduction could not be calculated by Caesar and included.
- Reductions: These are also relevant reductions: for the subjects that the user has classified, the average percentage of people who got that subject correct. (Only if the correctness is extracted).
Return value should be JSON that includes the user's updated skill value, stored in the data field of the reduction:
{ skill: 0.4 }

zwolf/Caesar-TESS Preliminary Map.md

Proposed Caesar-TESS Pipeline

Zach Wolfenbarger, 31 Jan 2019

Classifying a Real Subject

Extractor:

SubjectReducer:

Rule:

Effect:

Classifying a Training Subject

Extractor:

SubjectReducer:

UserReducer: