Ligand enrichment among top-ranking hits is a key metric of virtual screening. To avoid bias, decoys should resemble ligands physically, so that enrichment is not attributable to simple differences of gross features. We therefore created a directory of useful decoys (DUD) by selecting decoys that resembled annotated ligands physically but not topologically to benchmark docking performance. DUD has 2950 annotated ligands and 95,316 property-matched decoys for 40 targets. It is by far the largest and most comprehensive public data set for benchmarking virtual screening programs that I am aware of. This paper outlines several ways that DUD can be improved to provide better telemetry to investigators seeking to understand both the strengths and the weaknesses of current docking methods. I also highlight several pitfalls for the unwary: a risk of over-optimization, questions about chemical space, and the proper scope for using DUD. Careful attention to both the composition of benchmarks and how they are used is essential to avoid being misled by overfitting and bias.