As the Moore's scaling era comes to an end, application specific hardware accelerators appear as an attractive way to improve the performance and power efficiency of our computing systems. A massively heterogeneous system with a large number of hardware accelerators along with multiple general purpose CPUs is a promising direction, but pose several challenges in terms of the run-time scheduling of tasks on the accelerators and design granularity of accelerators. This paper addresses these challenges by developing an example heterogeneous system to enable multiple applications to share the available accelerators. We propose to design accelerators at a lower abstraction to enable applications to be broken down into tasks that can be mapped on several accelerators. We observe that several real-life workloads can be broken down into common primitives that are shared across many workloads. Finally, we propose and design a hardware task scheduler inspired by the hardware schedulers in out-of-order superscalar processors to efficiently utilize the accelerators in the system by scheduling tasks in out-of-order and even speculatively. We evaluate the proposed system on both real-life and synthetic benchmarks based on Digital Signal Processing~(DSP) applications. Compared to executing the benchmark on a system with sequential scheduling, proposed scheduler achieves up to 12x improvement in performance.