Reflection Paper Notes

SE + HCI + ML Causal Testing: Understanding Defects' Root Causes by Brittany Johnson

Brittany Johnson

Understanding software practice
Improving software practice
Supporting ethical software practices

(in particular) 2. Debugging 3. Software Ethics

Software Ethics (interdiciplinary)

Legal Compliance
Societal / Global Impact (in particular)

What can you think?

Positive Societal Impact

Children, Effective Altriusm
EHR Software

Political Meddling

Healthcare algorithm offered less care to black patients (discriminatory, and unfair) Happens a lot in Data Driven Software.

what does it mean for software to behave unfairly? (it depends)

Example:-

AI Loan Software - approved or denied (based on income, savings, age, race)

We don't want that to happen - denied based on race

Emergent behavior? - race is inferred.

NOTE:- Protected attributes, like race, should not affect software behavior.

This definitely sounds like a software requirement. Like:-

Security Requirement - 3 password attempts Fairness Requirement - protected attributes must not affect software behavior Performance Requirement - 0.3 ms

It can happen due to bugs

Performance defect
Security defect
Fairness defect

How we can help SFT teams build FAIR data driven software?

Help and detect and remove fairness defects. Same as any other defect.

How do developers already detect defects? (Performance, Security etc)

Dynamic Analysis (testing) - fault location, execution information - execution trace
Static Analysis (compiler) - we're not executing the code - fault location, relative location, and possible explanation

BUT - Debuging is hard, time-consuming, and frustrating - haha - funny.

BECAUSE - tools are telling us what's correlated with the software defect - WOW! Blanket statement

How can we help understand - WHAT CAUSED THE DEFECT?

Causal inference debugging -

New Technique - Causal Testing

Experiment driven counterfactual analysis. - ICSE 2020

WHAT? - modifying and executing tests and understand causality?

Automated Causal Experiments: -

Failing Tests - start and end in different country Passing Tests - start and end in same country.

Look at all possible pass and failed tests - do experiments and see what is different and what is the same? Goal - find minimally different executions.

This is really good!!!!

1 Failed Test

5 Perturb inputs and execute tests?

Find most similar ones - pass and fail - similar passing, and similar failing

Can causal testing help developers debug?.

Study:

Can causal testing help develipers identify defect root cause?
Can it help improve ability to repair defects?
Are the developers finding causal testing useful? What was most useful?

Proof:

Proof of concept causal testing implementation

6 real world defects

What did we find?

Identifying cause with causal testing (YES!)

Check This Website

Ask develpers - what caused this test to fail? - 86% of the time (holmes) Ask develpers - what caused this test to fail? - 80% of the time (JUnit)

Repairing defects (YES!)

what changes did they make to fix the defect?

96% - Holmes 93% - without holmes

Usefulness: Causal Testing (YES THEY DO!)

46% more useful than JUnit

What was useful?

Similar passing tests providing by causal testing

41% felt holmes and Junit are complementary to existing practice 13% < not as useful as JUnit (likes the functionality of JUnit, Holmes research prototype was still in alpha stage)

How often can causal testing help?

Read the paper - defect test suite on the buggy code

Applicability of Causal Testing

Causal Testing works, is useful, and fast.
Causal Testing works, is useful, and slow.
Causal Testing produces minimally different test but is not useful.
Causal Testing will not work
We could not make a determination

so, $42%$ of the defects categorized.

Causal testing will work - helps and understand/repair defects

Now, what about protected attributes?

How do we measure fairness? - Lot of ways - what ways, where, how???

Group Fairness
- (read up more on this)
Causal / Individual Fairness
- Protected attribute should not affect software behavior.
- Green and Red Brittany got their loans approved, denied respectively.
- group and causal fairness
  - For each test pair, only one of the attributes is flipped.
  - How often is the outcome different because of the race alone?
- How can we help them mitigate?
  1. An ML model is doing this
  2. Some real world data we trained an ML model on?
  3. Mitigate - literature - fair-aware algorithm
  4. Mitigate - I'm really good at fairness metrics and tuning them
  5. Mitigate - Problem - How does it affect performance, finding best model, and reducing time and experience of tuning?
  6. Soln: fairkit-learn
  7. Other Soln: IBM Fairness 360
  8. How does these tools stack up?
    - Best, Fair, and balances both?
    - Result: Fairer models, maintain/improve accuracy, and reduced need for expertise. (promising)

What Next?

Themis helps detect defects.
Causal Testing helps debug defects already detected.
Both uses causality and tests.

Stuff:

Themis UMASS

dheerajrav/notes.md