OAK

Argonet ???jsp.layout.header.link.name.2???

HSU Repository 일반대학원 AI응용학과 2. Thesis

다중 딥러닝 모델 실행 효율화를 위한 경량화 기법 및 블록 레벨 스케줄링 연구

= Lightweighting Techniques and Block-Level Scheduling for Efficient Multi-DNN Execution

Metadata Downloads

Type: Thesis

Abstract: 임베디드 환경에서 다중 딥러닝 모델을 동시에 수행할 경우, 모델 간 자원 경쟁으로 인해 실행 지연이 발생하며 이는 지연 시간에 민감한 시스템에서 치명적인 성능 저하로 이어질 수 있다. 이러한 문제를 완화하기 위해서는 모델 경량화 기법을 적용하는 것이 필수적이며, 대표적으로 Quantization과 Pruning이 이에 해당한다. 그러나 이들 기법의 적용 사례는 대부분 분류 모델에 집중되어 있어, 검출 및 추적 모델에 동일한 경량화 기법을 적용한 연구는 상대적으로 드물다. 이를 확장하기 위해 본 논문은 효율적인 DNN 모델을 실행하기 위해 두 가지 측면에서 접근한다. 첫째, 서버 시스템 환경에서 다양한 시각 지능 모델을 대상으로 Pruning과 Quantization을 단독 또는 조합하여 모델 크기, 파라미터 수, 정확도 변화를 체계적으로 분석하였다. 실험 결과, 두 경량화 기법을 같이 적용했을 때 일부 모델에서 더 높은 파라미터 감소 효과를 보이면서도 정확도 손실을 최소화하였다. 또한, 모델 구조 변경 없이 추론 과정을 최적화하는 TensorRT 기반 런타임 최적화 기법을 적용하여, 추가적인 연산 그래프 최적화와 커널 융합을 통해 기존 경량화 기법만으로는 확보하기 어려운 추론 속도 향상과 메모리 효율 개선 효과를 확인하였다. 둘째, 임베디드 환경에서 다중 DNN 작업을 동시에 수행할 때 실행 지연을 최소화하기 위해 블록 단위 동적 스케줄링 및 블록 수준 동적 전환 기법을 제안한다. 해당 기법은 모델을 기능적 단위인 블록으로 분할하여 실행 단위로 구성하고, 병렬 실행 시 오히려 지연을 유발하는 블록을 식별하여 순차 실행으로 전환한다. 또한 각 블록의 실행 지연 정도를 정량화하는 지표인 LAG를 활용해, 지연이 크게 예상되는 블록을 런타임에 경량화된 블록으로 대체하여 지연 시간과 정확도 간의 균형을 실시간으로 유지한다. 대표적인 임베디드 환경인 NVIDIA AGX Jetson Xavier 보드에서 이질적인 다중 DNN을 동시에 실행한 실험 결과, 제안 기법은 최대 29.3%의 지연 시간 감소와 기준 정확도의 90% 이상 유지할 수 있는 성능을 달성하였다.

【주요어】임베디드 딥러닝, LAG, EMA, 모델 압축, 블록 교체, 다중 DNN 스케줄링|In embedded environments, running multiple deep learning models concurrently can lead to execution delays due to resource contention among models, ultimately causing severe performance degradation in latency-sensitive systems. To mitigate this issue, applying model lightweighting techniques becomes essential, with quantization and pruning being the most representative approaches. However, the application of these techniques has been largely limited to classification models, and it is relatively uncommon to apply the same lightweighting methods to detection and tracking models. Therefore, this paper addresses efficient DNN execution from two complementary perspectives.
First, in a server-based environment, pruning and quantization were applied individually and in combination to various vision models to analyze changes in model size, parameter count, and accuracy. Experimental results show that combining the two lightweighting techniques yields greater parameter reduction than using either technique alone, while keeping accuracy degradation minimal.
Second, to minimize execution delays in multi-DNN concurrent environments, we propose a block-level dynamic scheduling and block-level dynamic replacement technique. In this method, each model is divided into functional units called blocks, which serve as the fundamental execution units. The scheduler identifies blocks that cause additional latency when executed in parallel and selectively switches them to sequential execution. Moreover, using a metric that quantifies the execution delay of each block, blocks expected to incur significant latency are dynamically replaced at runtime with lightweight alternatives to maintain a real-time balance between latency and accuracy. Experiments conducted on a representative embedded platform, the NVIDIA AGX Jetson Xavier, show that the proposed method achieves up to a 29.3% reduction in latency while preserving more than 90% of the baseline accuracy when executing heterogeneous DNNs simultaneously.

【Keywords】Embedded Deep Learning, LAG, EMA, Model Compression, Block Switching, Multi-DNN Scheduling