- This event has passed.
CANSSI Data Science ARES: Beyond sample-splitting: valid inference while “double-dipping” – Sept 14, 2020, 3pm
Session Description
September 14, 2020 @ 3:00 pm - 4:00 pm
Talk title: Beyond sample-splitting: valid inference while “double-dipping”
Dr. Daniela Witten
Professor of Statistics and Biostatistics
Dorothy Gilford Endowed Chair
University of Washington
Free Event | Registration Required: https://www.eventbrite.com/e/data-science-applied-research-and-education-seminar-daniela-witten-tickets-119647409623
Abstract: As datasets continue to grow in size, in many settings the focus of data collection has shifted away from testing pre-specified hypotheses, and towards hypothesis generation. Researchers are often interested in performing an exploratory data analysis in order to generate hypotheses, and then testing those hypotheses on the same data; I will refer to this as ‘double dipping’. Unfortunately, double dipping can lead to highly-inflated Type 1 errors. In this talk, I will consider the special case of hierarchical clustering. First, I will show that sample-splitting does not solve the ‘double dipping’ problem for clustering. Then, I will propose a test for a difference in means between estimated clusters that accounts for the cluster estimation process, using a selective inference framework. I will also show an application of this approach to single-cell RNA-sequencing data. This is joint work with Lucy Gao (University of Waterloo) and Jacob Bien (University of Southern California).