AI and Compute – since 2012, and we’ve been publishing analytics showing that the amount of computing used in the most significant AI and compute training runs has grown exponentially, doubling in 3.4 months.
Also, AI and Compute, since 2012, this metric has grown more than 300,000 times (a 2-year doubling period only yields a 7x increase). Improvements in computing have been a critical component of AI and Compute progress. Hence, as this trend continues, preparing for the consequences of systems far from current capabilities is worthwhile.
Overview of AI and Compute
Three factors drive AI forward: algorithmic innovation, data (be it collected data or interactive environments), and the amount of computation available for training. Algorithmic innovation and data are difficult to track, but the analysis is uniquely quantifiable and provides an opportunity to measure an input for AI progress. Of course, the use of batch computing sometimes only exposes the shortcomings of our current algorithms. But at least in many current areas, more computation predictably leads to better performance and often complements algorithmic advances.
For this analysis, we believe the relevant number is the amount of computing used to train a single model, not the speed of a single GPU or the capacity of the largest data center; this is the number that probably correlates with how strong our best models are. Per-model computation differs significantly from total batch computation because the limits of parallelism (both hardware and algorithmic) have limited how large a model can be or how useful it can train. Of course, substantial progress is still complete with modest calculation amounts; this analysis covers only computing power.
The trend represents a nearly 10-fold increase each year. It was driven in part by specialized hardware that allowed more operations per second for a given price (GPU and TPU) but conducted mainly by researchers who repeatedly found ways to use more chips in parallel and were willing to pay the economic cost.
We can see roughly four different periods:
Before 2012: Using GPUs for machine learning was unusual, making it difficult to achieve any results in graphics.
2012 – 2014: Training infrastructure was rare on many GPUs, so most results used 1-8 GPUs rated at 1-2 TFLOPS for a total of 0.001-0.1 pfs-days.
2014 – 2016: Full-scale results used 10-100 GPUs rated 5-10 TFLOPS, resulting in 0.1-10 pfs-days. Diminishing returns to data parallelism meant more extensive training studies were of limited value.
2016 to 2017: Large batch sizes, approaches that enable greater algorithmic parallelism such as architecture search and expert iteration, as well as specialized hardware such as TPUs and faster interconnects have significantly increased these limits, at least for some applications.
We see many reasons to believe that the trend on the chart may continue. Many hardware companies are developing AI-specific chips; some claim to see a significant increase in FLOPS/Watt (corresponding to FLOPS/$) in the next 1-2 years. There may also be gains simply by reconfiguring the hardware to perform the same number of operations at a lower economic cost. On the parallelism side, many of the latest algorithmic innovations describe above can, in principle, can combine multiplicatively, for example, architecture search and massively parallel SGD.
On the other hand, the cost will eventually limit the parallelism side of the trend, and physics will limit the chip efficiency side. For example, today’s most extensive training efforts use hardware that costs millions of single-digit dollars to purchase (although the amortized cost is much less). But most neural network computing today is still spent on inference (distribution), not training; It means that companies can redesign or buy much larger fleets of chips for education. Therefore, if there are sufficient economic incentives, we may see more extensive parallel training expeditions; thus, this trend will continue for several more years.