References

Belinkov, Yonatan, and James Glass. 2019. “Analysis Methods in Neural Language Processing: A Survey.” Transactions of the Association for Computational Linguistics (TACL) 7: 49–72. https://doi.org/10.1162/tacl\_a\_00254.

Dvijotham, Krishnamurthy, Sven Gowal, Robert Stanforth, Relja Arandjelovic, Brendan O’Donoghue, Jonathan Uesato, and Pushmeet Kohli. 2018. “Training Verified Learners with Learned Verifiers.” arXiv Preprint arXiv:1805.10265.

Ebrahimi, Javid, Anyi Rao, Daniel Lowd, and Dejing Dou. 2017. “Hotflip: White-Box Adversarial Examples for Text Classification.” arXiv Preprint arXiv:1712.06751.

Hewitt, J., and P. Liang. 2019. “Designing and Interpreting Probes with Control Tasks.” In Empirical Methods in Natural Language Processing (Emnlp).

Huang, Po-Sen, Robert Stanforth, Johannes Welbl, Chris Dyer, Dani Yogatama, Sven Gowal, Krishnamurthy Dvijotham, and Pushmeet Kohli. 2019. “Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation.” arXiv Preprint arXiv:1909.01492.

Jia, Robin, Aditi Raghunathan, Kerem Göksel, and Percy Liang. 2019. “Certified Robustness to Adversarial Word Substitutions.” arXiv Preprint arXiv:1909.00986.

Koh, Pang Wei, and Percy Liang. 2017. “Understanding Black-Box Predictions via Influence Functions.” In ICML.

Lipton, Zachary C. 2018. “The Mythos of Model Interpretability.” Queue 16 (3). New York, NY, USA: ACM: 30:31–30:57. https://doi.org/10.1145/3236386.3241340.

Ribeiro, Marco Túlio, Sameer Singh, and Carlos Guestrin. 2018. “Semantically Equivalent Adversarial Rules for Debugging Nlp Models.” In ACL.