ICU를 이용한 파일 인코딩 검출하기

주로 인코딩 감지에 chardet를 써왔는데 감지가 잘 안 되는 경우가 있었음.

ICU(International Components for Unicode library)라는 게 있어서 테스트해보니 잘 되는 것 같음.

간단하고 빠르게 인코딩 감지 → chardet
정확하고 다양한 텍스트/국제화 지원 → ICU

아래는 ICU를 이용한 입력한 파일의 인코딩 감지 코드.

#include <unicode/ucsdet.h>
#include <fstream>
#include <vector>
#include <iostream>

std::string detectEncoding(const std::string& filename) {
    // 파일 내용 읽기
    std::ifstream file(filename, std::ios::binary);
    if (!file) {
        throw std::runtime_error("파일을 열 수 없습니다: " + filename);
    }
    std::vector<char> buffer((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());

    // ICU CharsetDetector 생성
    UErrorCode status = U_ZERO_ERROR;
    UCharsetDetector* csd = ucsdet_open(&status);
    if (U_FAILURE(status)) {
        throw std::runtime_error("UCharsetDetector 생성 실패");
    }

    // 파일 데이터 설정
    ucsdet_setText(csd, buffer.data(), buffer.size(), &status);
    if (U_FAILURE(status)) {
        ucsdet_close(csd);
        throw std::runtime_error("텍스트 설정 실패");
    }

    // 인코딩 검출
    const UCharsetMatch* match = ucsdet_detect(csd, &status);
    if (U_FAILURE(status) || !match) {
        ucsdet_close(csd);
        throw std::runtime_error("인코딩 검출 실패");
    }

    // 결과 가져오기
    const char* encoding = ucsdet_getName(match, &status);
    std::string result = encoding ? encoding : "unknown";

    ucsdet_close(csd);
    return result;
}

int main(int argc, char* argv[]) {
    if (argc < 2) {
        std::cerr << "사용법: " << argv[0] << " <파일명>" << std::endl;
        return 1;
    }
    try {
        std::string encoding = detectEncoding(argv[1]);
        std::cout << "검출된 인코딩: " << encoding << std::endl;
    } catch (std::exception& e) {
        std::cerr << "오류: " << e.what() << std::endl;
        return 1;
    }
    return 0;
}

아래와 같이 컴파일하고 테스트..

$ g++ detect_encoding_with_icu.cpp -licuuc -licudt -licuin

$ ./a.exe ./detect_encoding_with_icu.cpp
검출된 인코딩: EUC-KR

저작자표시 비영리 변경금지 (새창열림)

'블로그 (Blog) > 개발로그 (Devlogs)' 카테고리의 다른 글

도스 폰트 (0)	2025.09.19
요소 절점 순서 (element node odering) (0)	2025.09.08
OpenLava 4.0 (0)	2025.09.05
STXXL (0)	2025.09.02
makeself (0)	2025.08.08

Korea Tcl/Tk Community

ICU를 이용한 파일 인코딩 검출하기

'블로그 (Blog) > 개발로그 (Devlogs)' 카테고리의 다른 글

티스토리툴바

'블로그 (Blog) > 개발로그 (Devlogs)' 카테고리의 다른 글

검색

티스토리툴바