OAK

Argonet ???jsp.layout.header.link.name.2???

HSU Repository 일반대학원 정보컴퓨터공학과 2. Thesis

Bitslicing for Block Ciphers on Embedded Microcontrollers and GPUs Implementation Techniques and Trade-Offs

Metadata Downloads

Type: Thesis

Abstract: This dissertation investigates bitslicing and fixslicing as table-free, timing-uniform software implementation paradigms for block ciphers across heterogeneous platforms, ranging from resource-constrained 32-bit microcontrollers to massively parallel NVIDIA GPUs. In such environments, overall security and performance depend not only on the cipher design but also on concrete implementation choices-especially state layout, packing/unpacking strategy, and the selected slice degree-under strict constraints on registers, memory behavior, and parallel execution efficiency. We present three case studies. First, we implement SPEEDY-5/6/7-192 on ARM Cortex-M3 and RV32I-based RISC-V using a 6×32 bitsliced state representation. By combining SWAPMOVE-based packing, a Boolean-network realization of the 6-bit SubBox, and rotation-centric constant-time diffusion (including rotation-XOR fusion where supported), the SPEEDY-7-192 implementation improves from 15,407/18,096 cycles per byte (byte-oriented reference) to 85.1/109.2 cycles per byte on Cortex-M3/RV32I, respectively, while maintaining a fixed instruction trace and secret-independent memory access. Second, for AES-GCM on ARM Cortex-M4, we build a 2-way fixsliced AES-CTR core and integrate FACE-style caching directly in the fixsliced domain. We also evaluate two GHASH design points that expose a practical performance-assurance trade-off: a compact 4-bit table multiplier and a table-free Karatsuba-based routine for strict constant-time deployments. FACE yields up to 19.4% improvement for long-message AES-GCTR, and the 4-bit GHASH option is roughly twice as fast as the Karatsuba baseline, at the cost of secret-derived table indices. Third, we design high-degree bitsliced CUDA implementations of PRESENT and GIFT on an RTX 3060. Using 32-way bitslicing per thread, branch-free Boolean S-boxes, efficient bit permutations, and device-side bitsliced counter generation, we achieve peak exhaustive-search throughput of 214-584 Gbit/s and bulk-encryption throughput up to 85 Gbit/s (including host-device transfers). Across these studies, we distill actionable cross-platform guidelines for choosing state representations and slice degrees to balance throughput, resource footprint, and timing-uniform execution on embedded microcontrollers and GPUs. |본 학위논문은 자원 제약이 큰 마이크로컨트롤러부터 고성능 GPU에 이르기까지, 이기종 플랫폼에서 블록암호의 비트슬라이싱(bitslicing) 및 픽스슬라이싱(fixslicing) 기반 소프트웨어 구현이 어떤 특성과 성능을 보이는지 분석하고, 상수시간 실행과 실용적 성능 사이의 트레이드오프를 체계적으로 정리한다. 현대 암호 시스템은 임베디드 장치와 병렬 가속기가 공존하는 환경에서 동일한 알고리즘·동작모드를 공유하는 경우가 많으며, 이때 전체 시스템의 보안성과 성능은 알고리즘 자체의 설계뿐 아니라 구체 하드웨어에서의 구현 선택—특히 상태(state) 레이아웃, 패킹/언패킹 전략, 그리고 비트슬라이스 차수(bitslice degree)—에 의해 크게 좌우된다. 본 논문은 세 가지 사례 연구를 제시한다. 첫째, ARM Cortex-M3 및 RV32I 기반 RISC-V 마이크로컨트롤러에서 SPEEDY-5/6/7-192를 대상으로 6×32 비트슬라이스 상태 표현을 도입하고, 어셈블리 수준 최적화를 통해 구현 효율을 개선한다. 구체적으로 SWAPMOVE 기반 패킹을 통해 테이블 기반 구현을 제거하고, 6비트 SubBox를 불리언 네트워크로 구현하며, 확산층을 회전(rotate) 중심의 연산으로 매핑하고(가능한 경우 회전–XOR 결합까지 활용) 상수시간 실행 구조를 유지한다. 그 결과 SPEEDY-7-192는 바이트 지향 기준 구현(15,407/18,096 cpb) 대비 Cortex-M3/RV32I에서 각각 85.1/109.2 cpb로 크게 개선되며, 비밀정보에 의존하지 않는 메모리 접근과 고정된 명령 실행 흐름을 달성한다. 둘째, ARM Cortex-M4에서 AES-GCM을 대상으로 2-way 픽스슬라이스 AES-CTR 코어를 기반으로 FACE 계열 캐싱을 픽스슬라이스 도메인 내부에 직접 통합하고, GHASH 구현을 두 설계 지점으로 비교하여 성능–보안(상수시간 엄격성) 트레이드오프를 명확히 한다. 즉, (1) 4-bit 테이블 기반 곱셈(고성능)과 (2) 테이블을 사용하지 않는 Karatsuba 기반 곱셈(상수시간 지향)을 비교한다. FACE는 장문 AES-GCTR에서 최대 19.4%의 성능 향상을 제공하며, GHASH에서는 4-bit 테이블 방식이 Karatsuba 기준 대비 대략 2배 수준의 속도를 보이지만, 해시 서브키/상태에서 유도되는 테이블 인덱스 접근을 사용한다는 점에서 “엄밀한 상수시간” 요구와의 트레이드오프가 존재한다.
셋째, NVIDIA RTX 3060에서 PRESENT 및 GIFT의 고차(high-degree) 비트슬라이싱 GPU 구현을 설계한다. 스레드당 32-way 비트슬라이싱을 적용하고, 분기 없는 불리언 S-box와 효율적인 비트 순열을 구성하며, 카운터/인덱스를 비트슬라이스 형태로 커널 내부에서 직접 생성함으로써 대량 암호화와 전수(키) 탐색 모두를 효율적으로 지원한다. 제안 구현은 전수 탐색에서 214–584 Gbit/s의 피크 처리량을 달성하고, 호스트–디바이스 전송을 포함한 대량 암호화에서는 최대 85 Gbit/s 수준의 처리량을 보인다.
마지막으로 본 논문은 세 사례를 종합하여, 플랫폼 특성(레지스터 제약, 회전 지원 여부, GPU 점유율/레지스터 압박 등)에 따라 상태 표현과 비트슬라이스 차수를 선택하는 실천적 가이드라인을 제시하고, 패킹/언패킹 비용을 상쇄하는 설계 원칙과 상수시간 실행을 위한 구현 패턴을 정리한다.