Introduction

This page presents a Theoretical Analysis of both hardware platforms as well as CNN topologies. In order to get a general overview of all CNNs and Hardware Platforms included in our experiments, we present the following 3 tables.

Tables

CNNs and Their Accuracy Over All Pruning and Quantization Variants

Table below provides a complete overview of all CNNs that were included in the experimentation and their corresponding accuracy over all Pruning and Quantization Variants.

CNNs and Their Compute and Memory Requirements

Next table shows Compute and Memory Requirements for all CNNs in number of operations ([GOPs]), Model Size ([ME]) and Operational Intensity ([OI]) in operations per byte read or written from memory.

We created bar charts to better illustrate compute and memory requirements for all CNNs from the previous table in an interactive and easy way.

Hardware Platforms

Table below summarizes all included hardware platforms, each with its corresponding peak performance for diferent datatypes (INTx, FPx), its Memory Bandwidth, Memory capacity as well as Thermal Design Power.

To better illustrate Hardware Platforms' Peak Performance and Memory Bandwidth, an interactive Bar chart can be found below. Please note, only performance for natively supported datatypes are shown.

Overview of Theoretical Evaluation

link to: https://rcl-lab.github.io/QutibenchWeb/mnist/imagenet/cifar-10/2020/04/30/Overview_of_experiments.html

Rooflines for All Hardware Platforms and CNNs

Combining application requirements with hardware platform characteristics can be leveraged for performance predictions using UCB’s roofline models. Using assumptions for where weights, activation tensors, and state of a neural network are stored, combined with the size of the datatypes used, allow us to derive the arithmetic intensity of a neural network during inference. Combined with the roofline for a given hardware platform, we can provide insight as to whether a neural network will be memory or compute bound and guidance for what is theoretically possible in regards to its throughput.

*Applies to the following pruning factors: 100%, 50%, 25% and 12,5%

Performance Prediction

The following heatmaps show the theoretical performance for the listed hardware platforms across the various machine learning tasks: MNIST, ImageNet and CIFAR-10. The metric used for the theoretical performance is input/second.

MNIST

For MNIST, quantization combined with pruning deliver some of best performance results.

ImageNet

For ImageNet, quantization combined with pruning also deliver some of best performance results.

CIFAR-10

Finally, for CIFAR-10, quantization combined with pruning deliver some of best performance results

Theoretical Pareto Curves

In the following plots we present a theoretical pareto curve for each type of classification.

	INT2	INT4	INT8	FP16	FP32
	top1 (top5) [%]	top1 (top5) [%]	top1 (top5) [%]	top1 (top5) [%]	top1 (top5) [%]
GoogLeNetv1 100	nm	nm	69.24 (88.45)	66.93 (87.83)	66.96 (87.84)
MobileNetv1 100	nm	nm	69.57 (87.71)	nm	nm
EfficientNet-S 100	nm	nm	77	nm	nm
EfficientNet-M 100	nm	nm	78.6	nm	nm
EfficientNet-L 100	nm	nm	80.2	nm	nm
ResNet-50 100	nm	nm	73.29 (91.26)	75.14 (92.12)	75.15 (92.11)
ResNet-50 80	nm	nm	73.30 (91.40)	nm	nm
ResNet-50 50	nm	nm	69.49 (91.00)	nm	nm
ResNet-50 30	nm	nm	68.83 ( 90.16)	nm	nm
CNV 100	86.86	87.4	nm	87.02	87.06
CNV 50	84.29	84.88	nm	85.55	85.6
CNV 25	79.89	81.09	nm	83.28	83.25
CNV 12.5	73.64	75.85	nm	77.82	77.84
MLP 100	98.75	98.77	nm	97.3	97.31
MLP 50	98.49	98.62	nm	97.45	97.46
MLP 25	98.04	98.29	nm	97.49	97.44
MLP 12.5	96.85	97.54	nm	97.95	97.15

	Total OPs	Total Model Size	OI (INT2)	OI (INT4)	OI (INT8)	OI (FP16)	OI (FP32)
	GOPs	[ME]	[Ops/Byte]	[Ops/Byte]	[Ops/Byte]	[Ops/Byte]	[Ops/Byte]
GoogLeNetv1 100%	3.1	6	2093.97	1046.99	523.49	261.75	130.87
MobileNetv1 100%	1.1	4.2	1075.47	537.74	268.87	134.43	67.22
ResNet-50 100%	7.7	25.5	1210.84	605.42	302.71	151.36	75.68
ResNet-50 80%	6.5	23.7	1086.59	543.3	271.65	135.82	67.91
ResNet-50 50%	3.8	15.8	949.85	474.93	237.46	118.73	59.37
ResNet-50 30%	2.5	10.1	970.16	485.08	242.54	121.27	60.64
EfficientNet-S 100%	4.7	5.4	3481.48	1740.74	870.37	435.18	217.59
EfficientNet-M 100%	7.4	6.9	4289.86	2144.93	1072.46	536.23	268.12
EfficientNet-L 100%	19.4	10.6	7313.21	3656.6	1828.3	914.15	457.08
CNV 100%	0.47	6.16	304.95	152.48	76.24	38.12	19.06
CNV 50%	0.12	1.54	308.32	154.16	77.08	38.54	19.27
CNV 25%	0.03	0.39	315.01	157.51	78.75	39.38	19.69
CNV 12.5%	0.01	0.1	332.61	166.3	83.15	41.58	20.79
MLP 100%	0.02	10.01	8	4	2	1	0.5
MLP 50%	0.00582	2.91	8	4	2	1	0.5
MLP 25%	0.0019	0.93	8	4	2	1	0.5
MLP 12.5%	0.0007	0.33	8	4	2	1	0.5

Hardware Platforms	INT2	INT4	INT8	FP16	FP32	Memory Bandwidth	Memory Capacity	Power
	[TOP/sec]	[TOP/sec]	[TOP/sec]	[TOP/sec]	[TOP/sec]	[GBps]	[GB]	[Watt]
Ultra96-DPU	na	na	0.96	na	na	4.26	2	na
ZCU104-DPU	na	na	4.6	na	na	19.2	4	na
ZCU102-DPU	na	na	6.71	na	na	19.2	4	na
ZCU104-FINN	30.7	8.8	na	na	na	19.2	4	na
ZCU104-BISMO	30.7	8.8	na	na	na	19.2	4	na
TX2 - maxn	na	na	na	1.33	0.67	59.7	8	15
TX2 - maxp	na	na	na	1.15	0.57	59.7	8	15
TX2 - maxq	na	na	na	0.87	0.44	59.7	8	15
EdgeTPU-fast	na	na	4	na	na	25.6	1	2
EdgeTPU-slow	na	na	2	na	na	25.6	1	2
NCS (MyriadX)	na	na	1	0.5	na	12.8	2	1
U96-Quadcore A53-INT8	0.192	0.192	0.192	na	na	4.26	2	na