If you want to see a longer version of the paper summary, please refer to the longer objective summary part.
This paper define a C-Score (Consistency Score) of each training example $(x, y)$ - the expected learnable accuracy of $\hat{y} = y$ (where $\hat{y} = \mathcal{A}(x)$, model prediction when trained with rest of the training set) - to reflect the relative regularity of the example, so that human can understand the underlying structure of the whole dataset intuitively, i.e. to detect adversarial or noisy instances, to identify highly regular instances for abstraction (model forgetting its details) versus exceptional instances for memorization (model remember all its details), etc.
Since the C-Score requires large training model from scratch thousand times, they propose three proxies of C-Score based on label-aware kernel (see below Eq. 3 in paper) distance of the - i) input, ii) hidden - representation $(x, y)$ w.r.t. rest of training instances, the $(x_i, y_i)$s; or on iii) the learning speed of $(x, y)$; so that only one single training is required. Experiments shows iii) correlates well ($\rho = 0.864$) with C-Score. During the process of the above investigations, carefully designed and very logical experiments are carried out, with further insights on:
I suppose the reader read this part along with the paper, and compare what they would take notes down compared with my notes. Actually, here I take notes about the technique part of this paper.
Definition (Consistency score or C-score).
The expected accuracy of a particular model architecture trained with a fixed sized training set on a held-out instance.
The formal form of the definition is:
\[C(x, y)_{\mathcal{A}, \mathcal{P}, n} = \mathbb{E}_{D \sim_n \mathcal{P}} [\hat{y}_{\mathcal{A}} = y \vert D, x]\]
Two derived Qs:
(Estimation problem) to estimate the score accurately and efficiently?
(Usage problem) to utilize the score?
\(\Rightarrow\) Debugging dataset
“we show that the score identifies out-of-distribution and mislabeled examples at one end of the continuum and regular exampels at the other end.”
Relationship to Feldman’s memorization score:
“defined relative to a dataset that includes \((x,y)\) and measures the change in the prediction accuracy on \(x\) when \((x, y)\) is removed from the dataset”.
Empirical Estimation of C-Score
In practice, we usually have a fixed data set \(\mathcal{D}\) consisting of \(N\) i.i.d. samples from the underlying distribution;
Averaging over i.i.d. subsamples of size \(n\)
Empirical C-score
\[\hat{C}(x, y)_{\mathcal{A}, D, n} = \hat{\mathbb{E}}^r_{D \sim_n \mathcal{D}\setminus(x,y)} [\text{Pr}(\hat{y}_{\mathcal{A}}=y \vert x, D)]\]
Proxies of C-Score
Kernel density estimation in input space
$\hat{C}^{\pm L}(x, y) = \frac{1}{N} \sum^{N}_{i=1} 2 \cdot (\mathbb{I}[y = y_i] - 1/2) \cdot \mathcal{K}(x, x_i)$
$\hat{C}^{+L} = \frac{1}{N} \sum^{N}_{i=1} \mathbb{I}[y = y_i] \cdot \mathcal{K}(x, x_i)$
$\hat{C}(x) = \frac{1}{N} \sum^{N}_{i=1} \mathcal{K}(x, x_i)$
Questions.
- How to decide the sample size or support \(N\)?
Kernel density in hidden space
Some with the above but \(x\) currently means the hidden representations.
Questions.
- To use which layer’s hidden representations? The one before the softmax projection classification layer? Or the logits layer?
- Will it make a difference choosing different layers’ representations?
Learning speed
TODO: add more explanations on these possible connections