Empirical Evaluation Of Edge AI Deployment Strategies Involving Black-Box And White-Box Operators

Loading...
Thumbnail Image

Authors

Singh, Jaskirat

Date

2024-05-28

Type

thesis

Language

eng

Keyword

Edge AI, Deployment Strategies, Inference Latency, Model Performance

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Edge AI enables deploying models across Mobile, Edge, and Cloud (MEC) tiers using a wide range of ML model transformation operators. Despite the wide range of operators, broadly categorized into white-box (training-based) and black-box (non-training-based) techniques, deciding which type of operator to use in an Edge AI setup to achieve performance advantage is mostly left to the personal judgment of the MLOps engineers. This study involves inference experiments with three black-box (i.e., Partitioning, SPTQ, Early Exiting) and three white-box (i.e., QAT, Pruning, Knowledge Distillation) operators, and their combinations across 3 deployment tiers (i.e., MEC) on 4 Computer Vision (CV) and 2 Natural Language Processing (NLP) models. We used a reproducible docker-based simulation approach for the Edge AI setup of MEC tiers, in which sequential inference requests of a wide range of varying input (i.e., images, texts) sizes were studied to measure the latency introduced by deployment strategies. Findings suggest that for CV models, Edge deployment using the hybrid SPTQ Early Exit black-box operator is preferred when faster latency (1.17x/1.45x of SPTQ/Early Exit) is a concern at medium accuracy loss in terms of effect size. However, if minimizing accuracy loss is a concern, the SPTQ black-box operator on the edge should be used. For models with large input data samples (ResNet, ResNext, DUC), an edge tier with higher network/computational capabilities is more viable than partitioning and mobile/cloud deployment strategies. A network-constrained cloud tier is a better alternative for models with small input data samples (FCN, Bert, Roberta). Regarding the white-box operators, the Distilled operator shows a faster latency than QAT/Pruning in Mobile (3.36x/3.34x) and Edge (2.66x/3.31x) tiers at the cost of small to medium accuracy loss in terms of effect size. Moreover, the Distilled SPTQ hybrid operator should be preferred over non-hybrid operators (i.e., Distilled/SPTQ/QAT/Pruned) when faster latency (1.52x/2.89x/3.93x/5.17x) is a concern in the edge tier at small to medium accuracy loss in terms of effect size. This thesis aims to be a stepping stone in the field of MLOps, evaluating the benefits and trade-offs of deployment strategies with respect to latency and accuracy.

Description

Citation

Publisher

License

Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN