We present a new algorithm to automatically generate high-performance GPU implementations of complex imaging and machine learning pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or hand-optimized kernels, and it targets a diverse range of computations which is significantly broader than existing autoschedulers. We address the scalability challenge of extending previous approaches to schedule large real world programs, while enabling a broad set of program rewrites that take into account the nested parallelism and memory hierarchy introduced by GPU architectures. We achieve this using a hierarchical sampling strategy that groups programs into buckets based on their structural similarity, then samples representatives to be evaluated, allowing us to explore a large space by only considering a subset of the space, and a pre-pass that 'freezes' decisions for the lowest cost sections of a program, allowing more time to be spent on the important stages. We then apply an efficient cost model combining machine learning, program analysis, and GPU architecture knowledge. Our method scales combinatorially better with respect to the deeper nested parallelism required by GPUs compared to previous work. We evaluate its performance on a diverse suite of real-world imaging and machine learning pipelines. We demonstrate results that are on average 1.66X faster than existing automatic solutions (up to 5X), and competitive with what the best human experts were able to achieve in an active effort to beat our automatic results.