OAK

Argonet ???jsp.layout.header.link.name.2???

HSU Repository 경영대학원 경영학과 2. Thesis

복지사각지대 예측을 위한 변수확장 및 합성데이터 결합 모형

= Welfare Blind Spot Prediction via Variable Expansion and Synthetic Data Integration: Structural missingness imputation with recall optimization

Metadata Downloads

Type: Thesis

Alternative Title: 구조적 결측 대치와 재현율 중심 평가체계 구축

Abstract: ABSTRACT

Welfare Blind Spot Prediction via Variable Expansion and Synthetic Data Integration: Structural missingness imputation with recall optimization - Park, Young-Sik, Major in MIS, Dept. of Business Administration, The Graduate School, Hansung University. Welfare blind spots refer to cases in which individuals in need of institutional support fail to be identified in a timely manner due to the limitations of administrative data. In particular, structural missingness, which occurs when the introduction of policy variables is delayed, leads to the absence of specific variables in historical data and undermines both the timeliness of analysis and predictive accuracy. Conventional welfare-target detection systems have largely focused on overall accuracy, thereby overlooking false negatives that result in missing actual households in crisis. Consequently, the most critical task in the policy field is to minimize the omission of welfare recipients, i.e., to optimize recall. However, relying solely on recall as the primary metric may create a dilemma in which welfare benefits must be extended indiscriminately to all applicants. Thus, recall optimization should be complemented by F1-score evaluation to achieve a balanced and practical assessment framework. To address these issues, this study applies a progressive feature expansion method using synthetic data generated by a Tabular Variational AutoEncoder (TVAE). Recall was established as the primary evaluation metric, while F1-score was set as the secondary complementary metric, and a classification model optimized for both criteria was developed. Threshold adjustment was also employed to derive classification results aligned with policy objectives. The dataset consisted of 3,280,593 welfare application records accumulated from January 2018 to November 2023, and three experimental phases were conducted. In Phase 1, an experiment using only complete data (with no missing values) was performed to verify the effectiveness of progressive feature expansion. Applying Random Forest, XGBoost, and LightGBM algorithms demonstrated that feature expansion contributed to consistent improvements in recall, ROC-AUC, and other performance indicators. In Phase 2, the quality of synthetic data was validated by comparing it with the original data using Wasserstein Distance and Jensen-Shannon Divergence. The results indicated that many variables showed distributions nearly identical to the original data, and in some cases, JSD values converged to zero, demonstrating perfect distributional alignment. These results strongly support the validity of TVAE-based imputation. In Phase 3, both original and TVAE-generated synthetic data were combined and applied to the same feature expansion scenario, with performance evaluated primarily based on recall. The XGBoost model achieved a recall of 75.35% and an F1-score of 0.5082 at a threshold of 0.4, demonstrating significantly enhanced detection performance compared to scenarios without addressing structural missingness. Further variable importance analysis revealed that housing instability-related factors, such as “households below a certain rent threshold,” “arrears in public rental housing,” and “failed emergency support applications,” were the most critical predictors of welfare blind spots. Conversely, some variables with extremely low frequency (e.g., confirmed neonatal hearing loss, suicide attempts) had limited predictive contribution. These findings imply that policymakers must weigh data collection efficiency and cost-effectiveness when allocating resources. The contributions of this study are fourfold. First, it overcomes the limitations of accuracy-focused evaluations in previous research by establishing a recall-centered evaluation framework aligned with policy goals. Second, it empirically validates the applicability of TVAE-based synthetic data to real-world welfare risk information, demonstrating the feasibility of synthetic data utilization in data-driven public administration. Third, by combining feature expansion with variable importance analysis, the study suggests effective variable combinations and collection priorities for identifying welfare blind spots. Fourth, the experimental results can be applied to the Haengbok-eum system and local government big data-based welfare risk detection systems, providing a robust policy rationale for the efficient and equitable allocation of limited administrative resources. In conclusion, this study presents an empirical methodology that integrates a recall-centered evaluation framework with generative model-based progressive feature expansion to minimize the omission of welfare recipients. The findings offer a pathway for advancing proactive, data-driven welfare risk detection systems and enhancing the reliability of the welfare delivery framework. 【Keywords】 Welfare blind spot, TVAE, Progressive variable expansion, Recall, Machine learning, Synthetic data, structural missingness|복지사각지대 예측을 위한 변수확장 및 합성데이터 결합 모형 - 구조적 결측 대치와 재현율 중심 평가체계 구축 - 한 성 대 학 교 대 학 원 경 영 학 과 경 영 정 보 전 공 박 영 식 복지사각지대는 제도적 지원이 필요한 대상자가 행정 데이터의 한계로 인해 적시에 발굴되지 못하는 현상을 의미한다. 특히 정책 변수의 도입 시점 이 차이나는 것으로 인해 발생하는 구조적 결측(structural missingness)은 과 거 데이터에 특정 변수가 존재하지 않게 만들어 분석의 시의성과 예측 정확 성을 저해한다. 기존 복지대상자 발굴 시스템은 주로 전체 정확도(Accuracy) 를 중심으로 평가되어왔으나, 이는 실제 위기 가구를 놓치는 오류(False Negative)를 간과하는 한계가 있었다. 따라서 정책 현장에서 가장 중요한 과 제는 복지대상자 누락 최소화, 즉 재현율(Recall)의 최적화이다. 그런데 재현 율만을 중심지표로 설정하게 되면 전체 복지 신청 대상자 모두에게 복지 수 혜를 제공해야 하는 딜레마에 빠지게 된다. 따라서 재현율의 최적화와 함께 정밀도를 고려하여 서로의 가중치가 적용된 조화평균의 정교한 평가 지표 설 계가 이루어져야 한다. 본 연구는 이러한 문제를 해결하기 위해 TVAE(Tabular Variational AutoEncoder) 기반 합성 데이터 생성을 활용한 점진적 변수 확장 기법을 적 용하여 실험을 수행하였다. 또한 재현율을 1차 평가 지표로  를 2차 보완 지표로 설정하여 두 지표에 최적화된 분류 모형을 구축하였다. 또한 임 계치(threshold) 조정을 통해 정책적 목적에 부합하는 분류 결과를 도출하고 자 하였다. 연구 데이터는 2018년 1월부터 2023년 11월까지 축적된 총 3,280,593건의 복지신청 데이터이며, 이를 활용하여 세 단계의 실험을 수행하 였다. Phase 1에서는 기존의 결측이 존재하지 않는 무결측 데이터만을 활용하여 분석에 쓰이는 특징변수(feature)가 점진적으로 확장됨에 따라 효과성이 있는 지 점진적 변수 확장의 효과를 검증하였다. Random Forest, XGBoost, LightGBM 세 가지 알고리즘을 적용한 결과, 변수 확장이 재현율과 ROC-AUC 등 성능 지표 개선에 전반적으로 기여함을 확인하였다. Phase 2에서는 합성 데이터의 품질을 원본과 비교하여 검증하였다. Wasserstein Distance와 Jensen-Shannon Divergence를 활용한 결과, 다수의 변수에서 원본과 거의 동일한 분포를 보였으며, 일부 변수에서는 JSD 값이 0 으로 수렴하는 결과를 나타내어 완벽한 분포적 일치성을 달성하였다. 이는 TVAE 기반 대치의 타당성을 강하게 뒷받침한다. 마지막으로 Phase 3에서는 원본 데이터와 TVAE로 생성한 합성 데이터를 결합하여 동일한 변수 확장 시나리오를 실험하였고, 재현율을 중심으로 성능 을 평가하였다. 그 결과, XGBoost 모델이 임계치 0.4 기준에서 재현율 75.35%, F1-score 0.5082를 기록하며 구조적 결측을 보완하기 이전보다 현 저히 높은 탐지 성능을 나타냈다. 추가적으로 수행한 변수 중요도 분석에서는 ‘월세금액 기준 이하 가구’, ‘공공임대주택 체납’, ‘긴급지원 수급 탈락 경험’ 등 주거 불안정성과 관련된 변수들이 복지사각지대를 설명하는 핵심 요인으로 나타났다. 반면 발생 빈도 가 극히 낮은 일부 변수(예: 신생아 난청 확진, 자살 시도 이력 등)는 예측 기여도가 제한적임을 확인하였다. 이는 향후 정책적 자원 배분 시 수집 효율 성과 비용 대비 효과성을 고려해야 함을 시사한다. 본 연구의 기여점은 다음과 같다. 첫째, 기존 연구가 정확도 중심 평가에 치우쳤던 한계를 극복하고, 재현율 중심의 평가체계를 정립하여 정책 목적에 부합하는 분류모형을 제시하였다. 둘째, 구조적 결측 문제를 보완하기 위해 TVAE 기반 합성 데이터를 실제 복지위기정보 분석에 적용함으로써 데이터 기반 행정에서 합성 데이터 활용의 가능성을 실증하였다. 셋째, 변수 확장과 변수 중요도 분석을 통해 복지사각지대 발굴에 있어 효과적인 변수 조합과 수집 우선순위를 제시하였다. 넷째, 실험 결과가 행복e음 시스템 및 지자체 빅데이터 기반 복지위기 발굴 시스템에 적용 가능하여, 한정된 행정 자원을 효율적이고 형평성 있게 배분하는 정책적 근거로 활용될 수 있다. 종합하면, 본 연구는 재현율 중심의 평가 체계와 생성 모델 기반 점진적 변 수 확장을 결합한 새로운 접근을 통해 복지대상자 누락을 최소화하는 실증적 방법론을 제시하였다. 이는 향후 데이터 기반의 선제적 복지위기 발굴 시스템 을 고도화하고, 복지 전달 체계의 신뢰성 제고에 기여할 수 있을 것이다. 【주요어】 복지사각지대, TVAE, 점진적 변수 확장, 재현율, 머신러닝, 합성 데이터, 구조적 결측