Theoretical Analysis of ImageNet

Rooflines for All Hardware Platforms and CNNs

Combining application requirements with hardware platform characteristics can be leveraged for performance predictions using UCB’s roofline models. Using assumptions for where weights, activation tensors, and state of a neural network are stored, combined with the size of the datatypes used, allow us to derive the arithmetic intensity of a neural network during inference. Combined with the roofline for a given hardware platform, we can provide insight as to whether a neural network will be memory or compute bound and guidance for what is theoretically possible in regards to its throughput.

Performance Prediction

The following heatmap shows the theoretical performance for the listed hardware platforms for ImageNet classification. The metric used for the theoretical performance is input/second. Looking at the plot, it becomes clear that prunning along with quantization outputs some of the best performance results.

Experimental Data Analysis

Overview of All Measurements for ImageNet

In this table, within the rows, we show the type of hardware platforms that we used for this task (for example FPGA or GPU) and then more specifically the exact name of the different hardware platforms. For each hardware platform, we list the sweep of specific deployment parameters (batch sizes, operating modes etc) that were used for the experimentation in separate columns. In the columns, we show CNN topologies. When a CNN topology was implemented on a given hardware platform, we show in the corresponding cell the precisions (quantization information) and the channel pruning scale. Otherwise, “na” indicates that the topology wasn’t executed on this specific hardware platform. Many combinations between topology and hardware platform are not supported by the vendors dedicated software environments. INTx depicts a fixed point integer representation with x bits. FPy represents a floating point representation with y bits, for example FP32 is singe precision floating point. Table follows below.

Line Plot

Boxplots

Pareto Graphs

The following pareto graph presents the accuracy versus performance in fps for all the Hardware Platforms across different Pruning and Quantization configurations. This provides insights into accuracy-based comparisons.

Note: We could not reproduce Google's accuracies stated here for EfficientNets on the EdgeTPU.

Theoretical Pareto and Measured Pareto Overlapped

In order to easily understand how accurate predictions were, an overlapping between the Theoretical Pareto Plot and Measured Pareto Plot was made. In the plot below we have both theoretical (orange) and measured (blue) pareto lines. All measured datatpoints are represented as crosses and all theoretical datatpoints are represented as circles. Some theoretical datapoints don't have a measured matched datapoint and the same goes for the measured datapoints. The theoretical pareto curve is, as expected, on the right of the measured one, as predictions are sometimes different form measurements.

Note: We could not reproduce Google's accuracies stated here for EfficientNets on the EdgeTPU.

Efficiency Plot

In order to understand the gap between the theoretical predictions and what was measured, an efficiency bar-chart was created. The size of the bar reflects the absolute performance, whereby all theoretical predictions are shown in red, theoretical peak performance in blue, and all measured datapoints in orange. The orange bars are annotated with the efficiency achieved as a percentage of the predicted performance. Please note the logarithmic y-axis scale. The theoretical predictions take memory bottlenecks into account, as such measured performance can actually exceed the predicted result, in which case the percentage can be above 100%.

ImageNet Power Measurements

Plot below shows power consumption evolution over time for the Imagenet machine learning task on Google's EdgeTPU.

		ImageNet Classification
Hardware	Platform	ResNet50	GoogLeNetV1	MobileNet	Batch/Stream/Thread
FPGA	ZCU102-DPU	[INT8]*[100%,80%,50%,30%]	INT8	na	[1,2,3,4,5,6,7,8]
	ZCU104-DPU	INT8	INT8	na	[1,2,3,4,5,6,7,8]
	Ultra96-DPU	[INT8]*[100%,80%,50%,30%]	INT8	INT8	[1,2,3,4,5,6,7,8]
	ZCU104-FINN	na	na	na	[1,2,4,8,16,32,64,128,256,512,10000]
	ZCU104-BISMO	na	na	na	[2,4,8,16,32,64,128]
GPU	TX2-maxn	FP16,FP32	FP16,FP32	na	[1,2,4,8,16,32,64,128]
	TX2-maxp	FP16,FP32	FP16,FP32	na	[1,2,4,8,16,32,64,128]
	TX2-maxq	FP16,FP32	FP16,FP32	na	[1,2,4,8,16,32,64,128]
TPU	TPU-fast clk	na	INT8	INT8	[1]
	TPU-slow clk	na	INT8	INT8	[1]
VLIW	NCS	FP16	na	na	[1,2,4,8,16,32,64,128]
CPU	U96-Quadcore A53	na	na	na	[2,4,8,16,32,64,128]