Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions.