1
- ## Fairness Indicators: Thinking about Fairness Evaluation
2
-
3
- ### Interested in leveraging the Fairness Indicators Beta?
4
-
5
- Before you do, we ask that you read through the following guidance.
1
+ # Fairness Indicators: Thinking about Fairness Evaluation
6
2
7
3
Fairness Indicators is a useful tool for evaluating _ binary_ and _ multi-class_
8
4
classifiers for fairness. Eventually, we hope to expand this tool, in
@@ -19,13 +15,13 @@ human societies are extremely complex! Understanding people, and their social
19
15
identities, social structures and cultural systems are each huge fields of open
20
16
research in their own right. Throw in the complexities of cross-cultural
21
17
differences around the globe, and getting even a foothold on understanding
22
- societal impact can be challenging. Whenever possible, we recommend consulting
23
- with appropriate domain experts, which may include social scientists,
18
+ societal impact can be challenging. Whenever possible, it is recommended you
19
+ consult with appropriate domain experts, which may include social scientists,
24
20
sociolinguists, and cultural anthropologists, as well as with members of the
25
21
populations on which technology will be deployed.
26
22
27
- A single model, for example, the toxicity model that we leverage in our
28
- [ example colab] ( https://github.com/ tensorflow/fairness-indicators/blob/master/g3doc/ tutorials/Fairness_Indicators_Example_Colab.ipynb ) ,
23
+ A single model, for example, the toxicity model that we leverage in the
24
+ [ example colab] ( https://www. tensorflow.org/responsible_ai/fairness_indicators/ tutorials/Fairness_Indicators_Example_Colab ) ,
29
25
can be used in many different contexts. A toxicity model deployed on a website
30
26
to filter offensive comments, for example, is a very different use case than the
31
27
model being deployed in an example web UI where users can type in a sentence and
@@ -36,17 +32,17 @@ concerns.
36
32
37
33
The questions above are the foundation of what ethical considerations, including
38
34
fairness, you may want to take into account when designing and developing your
39
- ML-based product. These questions also motivate _ which _ metrics and _ which _
40
- groups of users you should use the tool to evaluate.
35
+ ML-based product. These questions also motivate which metrics and which groups
36
+ of users you should use the tool to evaluate.
41
37
42
- Before diving in further, here are three resources we recommend as you get
38
+ Before diving in further, here are three recommended resources for getting
43
39
started:
44
40
45
41
* ** [ The People + AI Guidebook] ( https://pair.withgoogle.com/ ) for
46
42
Human-centered AI design:** This guidebook is a great resource for the
47
43
questions and aspects to keep in mind when designing a machine-learning
48
44
based product. While we created this guidebook with designers in mind, many
49
- of the principles will help answer questions like the one we posed above.
45
+ of the principles will help answer questions like the one posed above.
50
46
* ** [ Our Fairness Lessons Learned] ( https://www.youtube.com/watch?v=6CwzDoE8J4M ) :**
51
47
This talk at Google I/O discusses lessons we have learned in our goal to
52
48
build and design inclusive products.
@@ -63,7 +59,7 @@ and harm for users.
63
59
64
60
The below sections will walk through some of the aspects to consider.
65
61
66
- #### Which groups should I slice by?
62
+ ## Which groups should I slice by?
67
63
68
64
In general, a good practice is to slice by as many groups as may be affected by
69
65
your product, since you never know when performance might differ for one of the
@@ -140,7 +136,7 @@ have different experiences? What does that mean for slices you should evaluate?
140
136
Collecting feedback from diverse users may also highlight potential slices to
141
137
prioritize.
142
138
143
- #### Which metrics should I choose?
139
+ ## Which metrics should I choose?
144
140
145
141
When selecting which metrics to evaluate for your system, consider who will be
146
142
experiencing your model, how it will be experienced, and the effects of that
@@ -161,7 +157,7 @@ then consider reporting (for each subgroup) the rate at which that label is
161
157
predicted. For example, a “good” label would be a label whose prediction grants
162
158
a person access to some resource, or enables them to perform some action.
163
159
164
- #### Critical fairness metrics for classification
160
+ ## Critical fairness metrics for classification
165
161
166
162
When thinking about a classification model, think about the effects of _ errors_
167
163
(the differences between the actual “ground truth” label, and the label from the
@@ -176,13 +172,13 @@ when different metrics might be most appropriate.**
176
172
177
173
** Metrics available today in Fairness Indicators**
178
174
179
- _ Note : There are many valuable fairness metrics that are not currently supported
175
+ Note : There are many valuable fairness metrics that are not currently supported
180
176
in the Fairness Indicators beta. As we continue to add more metrics, we will
181
177
continue to add guidance for these metrics, here. Below, you can access
182
178
instructions to add your own metrics to Fairness Indicators. Additionally,
183
179
please reach out to [ tfx@tensorflow.org ] ( mailto:tfx@tensorflow.org ) if there are
184
180
metrics that you would like to see. We hope to partner with you to build this
185
- out further._
181
+ out further.
186
182
187
183
** Positive Rate / Negative Rate**
188
184
@@ -224,8 +220,8 @@ out further._
224
220
These are also important for Facial Analysis Technologies such as face
225
221
detection or face attributes
226
222
227
- ** Note:** When both “positive” and “negative” mistakes are equally important,
228
- the metric is called “equality of
223
+ Note: When both “positive” and “negative” mistakes are equally important, the
224
+ metric is called “equality of
229
225
<span style =" text-decoration :underline ;" >odds</span >”. This can be measured by
230
226
evaluating and aiming for equality across both the TNR & FNR, or both the TPR &
231
227
FPR. For example, an app that counts how many cars go past a stop sign is
@@ -264,12 +260,40 @@ false positive) or accidentally excludes a car (a false negative).
264
260
Cases where the fraction of correct negative predictions should be equal
265
261
across subgroups
266
262
267
- ** Note** : When used together, False Discovery Rate and False Omission Rate
268
- relate to Conditional Use Accuracy Equality, when FDR and FOR are both equal
269
- across subgroups. FDR and FOR are also similar to FPR and FNR, where FDR/FOR
270
- compare FP/FN to predicted negative/positive data points, and FPR/FNR compare
271
- FP/FN to ground truth negative/positive data points. FDR/FOR can be used instead
272
- of FPR/FNR when predictive parity is more critical than equality of opportunity.
263
+ Note: When used together, False Discovery Rate and False Omission Rate relate to
264
+ Conditional Use Accuracy Equality, when FDR and FOR are both equal across
265
+ subgroups. FDR and FOR are also similar to FPR and FNR, where FDR/FOR compare
266
+ FP/FN to predicted negative/positive data points, and FPR/FNR compare FP/FN to
267
+ ground truth negative/positive data points. FDR/FOR can be used instead of
268
+ FPR/FNR when predictive parity is more critical than equality of opportunity.
269
+
270
+ ** Overall Flip Rate / Positive to Negative Prediction Flip Rate / Negative to
271
+ Positive Prediction Flip Rate**
272
+
273
+ * * <span style =" text-decoration :underline ;" >Definition:</span >* The
274
+ probability that the classifier gives a different prediction if the identity
275
+ attribute in a given feature were changed.
276
+ * * <span style =" text-decoration :underline ;" >Relates to:</span >* Counterfactual
277
+ fairness
278
+ * * <span style =" text-decoration :underline ;" >When to use this metric:</span >*
279
+ When determining whether the model’s prediction changes when the sensitive
280
+ attributes referenced in the example is removed or replaced. If it does,
281
+ consider using the Counterfactual Logit Pairing technique within the
282
+ Tensorflow Model Remediation library.
283
+
284
+ ** Flip Count / Positive to Negative Prediction Flip Count / Negative to Positive
285
+ Prediction Flip Count** *
286
+
287
+ * * <span style =" text-decoration :underline ;" >Definition:</span >* The number of
288
+ times the classifier gives a different prediction if the identity term in a
289
+ given example were changed.
290
+ * * <span style =" text-decoration :underline ;" >Relates to:</span >* Counterfactual
291
+ fairness
292
+ * * <span style =" text-decoration :underline ;" >When to use this metric:</span >*
293
+ When determining whether the model’s prediction changes when the sensitive
294
+ attributes referenced in the example is removed or replaced. If it does,
295
+ consider using the Counterfactual Logit Pairing technique within the
296
+ Tensorflow Model Remediation library.
273
297
274
298
** Examples of which metrics to select**
275
299
@@ -294,7 +318,7 @@ Follow the documentation
294
318
[ here] ( https://github.com/tensorflow/model-analysis/blob/master/g3doc/post_export_metrics.md )
295
319
to add you own custom metric.
296
320
297
- #### Final notes
321
+ ## Final notes
298
322
299
323
** A gap in metric between two groups can be a sign that your model may have
300
324
unfair skews** . You should interpret your results according to your use case.
0 commit comments