SE + HCI + ML Causal Testing: Understanding Defects' Root Causes by Brittany Johnson
Brittany Johnson
- Understanding software practice
- Improving software practice
- Supporting ethical software practices
(in particular) 2. Debugging 3. Software Ethics
Software Ethics (interdiciplinary)
- Legal Compliance
- Societal / Global Impact (in particular)
What can you think?
Positive Societal Impact
- Children, Effective Altriusm
- EHR Software
Political Meddling
- Healthcare algorithm offered less care to black patients (discriminatory, and unfair) Happens a lot in Data Driven Software.
what does it mean for software to behave unfairly? (it depends)
Example:-
AI Loan Software - approved or denied (based on income, savings, age, race)
- We don't want that to happen - denied based on race
Emergent behavior? - race is inferred.
NOTE:- Protected attributes, like race, should not affect software behavior.
This definitely sounds like a software requirement. Like:-
Security Requirement - 3 password attempts Fairness Requirement - protected attributes must not affect software behavior Performance Requirement - 0.3 ms
It can happen due to bugs
- Performance defect
- Security defect
- Fairness defect
How we can help SFT teams build FAIR data driven software?
- Help and detect and remove fairness defects. Same as any other defect.
How do developers already detect defects? (Performance, Security etc)
- Dynamic Analysis (testing) - fault location, execution information - execution trace
- Static Analysis (compiler) - we're not executing the code - fault location, relative location, and possible explanation
BUT - Debuging is hard, time-consuming, and frustrating - haha - funny.
BECAUSE - tools are telling us what's correlated with the software defect - WOW! Blanket statement
How can we help understand - WHAT CAUSED THE DEFECT?
Causal inference debugging -
New Technique - Causal Testing
Experiment driven counterfactual analysis. - ICSE 2020
WHAT? - modifying and executing tests and understand causality?
Automated Causal Experiments: -
Failing Tests - start and end in different country Passing Tests - start and end in same country.
Look at all possible pass and failed tests - do experiments and see what is different and what is the same? Goal - find minimally different executions.
This is really good!!!!
Ok
1 Failed Test
5 Perturb inputs and execute tests?
Find most similar ones - pass and fail - similar passing, and similar failing
Can causal testing help developers debug?.
Study:
- Can causal testing help develipers identify defect root cause?
- Can it help improve ability to repair defects?
- Are the developers finding causal testing useful? What was most useful?
Proof:
Proof of concept causal testing implementation
- 6 real world defects
What did we find?
- Identifying cause with causal testing (YES!)
Ask develpers - what caused this test to fail? - 86% of the time (holmes) Ask develpers - what caused this test to fail? - 80% of the time (JUnit)
- Repairing defects (YES!)
what changes did they make to fix the defect?
96% - Holmes 93% - without holmes
- Usefulness: Causal Testing (YES THEY DO!)
46% more useful than JUnit
What was useful?
- Similar passing tests providing by causal testing
41% felt holmes and Junit are complementary to existing practice 13% < not as useful as JUnit (likes the functionality of JUnit, Holmes research prototype was still in alpha stage)
How often can causal testing help?
Read the paper - defect test suite on the buggy code
Applicability of Causal Testing
- Causal Testing works, is useful, and fast.
- Causal Testing works, is useful, and slow.
- Causal Testing produces minimally different test but is not useful.
- Causal Testing will not work
- We could not make a determination
so,
Causal testing will work - helps and understand/repair defects
Now, what about protected attributes?
How do we measure fairness? - Lot of ways - what ways, where, how???
- Group Fairness
- (read up more on this)
- Causal / Individual Fairness
- Protected attribute should not affect software behavior.
- Green and Red Brittany got their loans approved, denied respectively.
- group and causal fairness
- For each test pair, only one of the attributes is flipped.
- How often is the outcome different because of the race alone?
- How can we help them mitigate?
- An ML model is doing this
- Some real world data we trained an ML model on?
- Mitigate - literature - fair-aware algorithm
- Mitigate - I'm really good at fairness metrics and tuning them
- Mitigate - Problem - How does it affect performance, finding best model, and reducing time and experience of tuning?
- Soln: fairkit-learn
- Other Soln: IBM Fairness 360
- How does these tools stack up?
- Best, Fair, and balances both?
- Result: Fairer models, maintain/improve accuracy, and reduced need for expertise. (promising)
What Next?
- Themis helps detect defects.
- Causal Testing helps debug defects already detected.
- Both uses causality and tests.
Stuff:
Themis UMASS