drug-price-kr 약가 데이터 수집 및 Git 추적 파이프라인

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: HIRA 약제급여목록표를 수집하고, 약가마스터로 표준코드를 매핑하여, data/prices.csv 단일 파일의 git diff로 약가 변동을 추적하는 repo를 구축한다.

Architecture: HIRA 엑셀 직접 다운로드 → 약가마스터(data.go.kr)로 표준코드 매핑 → data/prices.csv 단일 파일 덮어쓰기 커밋. git diff = 약가 변동 이력. GitHub Actions cron으로 월 1회 자동.

Tech Stack: Python 3.11+, openpyxl, requests, csv (stdlib), GitHub Actions

Repo: https://github.com/PharmOS-kr/drug-price-kr Issue: dkmin/clawd#22 (마이그레이션), dkmin/clawd#21 (초기 수집 — 완료)

CSV 스키마 (변경 후)

표준코드,제품코드,제품명,규격,단위,상한금액,업체명,주성분코드,주성분명,투여,분류
8806717050118,645302132,포크랄시럽(포수클로랄)_(9.5g/95mL),95(1),mL/병,129,한림제약(주),130830ASY,chloral hydrate 9.5g(0.1g/mL),내복,112

PK: 표준코드 (KD코드 13자리) — 없으면 제품코드 fallback
핵심 추적 대상: 상한금액
정렬: 표준코드 → 제품코드 순

파일 구조 (변경 후)

drug-price-kr/
├── README.md
├── LICENSE                            # CC BY 4.0
├── .github/workflows/collect.yml      # 월 1회 자동 수집
├── scripts/
│   ├── collect.py                     # HIRA 직접 다운로드 + 매핑 + CSV 저장
│   ├── build_mapping.py               # 약가마스터 → 제품코드↔표준코드 매핑
│   └── requirements.txt               # openpyxl, requests
├── data/
│   ├── prices.csv                     # ← 단일 파일, 매월 덮어쓰기 커밋
│   └── reference/
│       └── drug-master-20251031.csv   # 약가마스터 (표준코드 매핑 원본)
└── .gitignore

핵심 변경: data/{YYYY-MM}/prices.csv (월별 폴더) → data/prices.csv (단일 파일)

Task 1: 약가마스터 UTF-8 변환 및 저장

Files:

Create: data/reference/drug-master-20251031.csv
Step 1: 약가마스터 다운로드 (이미 /tmp에 있으면 스킵)

curl -sL -o /tmp/drug-price-master.csv \
  "https://www.data.go.kr/cmm/cmm/fileDownload.do?atchFileId=FILE_000000003550228&fileDetailSn=1&insertDataPrcus=N"

Step 2: EUC-KR → UTF-8 변환 후 저장

mkdir -p data/reference
iconv -f euc-kr -t utf-8 /tmp/drug-price-master.csv > data/reference/drug-master-20251031.csv

Step 3: 데이터 확인

head -2 data/reference/drug-master-20251031.csv
wc -l data/reference/drug-master-20251031.csv

Expected: 22개 컬럼 헤더, ~298,000행, UTF-8

Step 4: 커밋

git add data/reference/drug-master-20251031.csv
git commit -m "data: add drug master reference (298K items, data.go.kr 20251031)"

Task 2: 매핑 스크립트 작성

Files:

Create: scripts/build_mapping.py
Step 1: build_mapping.py 작성

약가마스터에서 제품코드(개정후) 또는 제품코드 → 표준코드 매핑 딕셔너리를 생성한다.

#!/usr/bin/env python3
"""약가마스터에서 제품코드 → 표준코드 매핑 생성."""

import csv
from pathlib import Path


def load_mapping(master_path: Path | None = None) -> dict[str, str]:
    """제품코드(9자리) → 표준코드(13자리) 매핑 반환."""
    if master_path is None:
        repo_root = Path(__file__).resolve().parent.parent
        master_path = repo_root / "data" / "reference" / "drug-master-20251031.csv"

    mapping: dict[str, str] = {}
    with open(master_path, encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            std_code = (row.get("표준코드") or "").strip()
            product_code = (row.get("제품코드(개정후)") or "").strip()
            if not product_code:
                product_code = (row.get("제품코드") or "").strip()
            if product_code and std_code:
                mapping[product_code] = std_code
    return mapping


if __name__ == "__main__":
    m = load_mapping()
    print(f"Loaded {len(m)} mappings")
    # 샘플 출력
    for k, v in list(m.items())[:5]:
        print(f"  {k} → {v}")

Step 2: 매핑 테스트

python3 scripts/build_mapping.py

Expected: Loaded NNNNN mappings + 샘플 5건

Step 3: 커밋

git add scripts/build_mapping.py
git commit -m "feat: add product-code to standard-code mapping script"

Task 3: collect.py를 HIRA 직접 다운로드 방식으로 교체

Files:

Rewrite: scripts/collect.py
Modify: scripts/requirements.txt
Step 1: requirements.txt 업데이트

requests>=2.31.0
openpyxl>=3.1.0

Step 2: collect.py 전면 교체

#!/usr/bin/env python3
"""HIRA 약제급여목록표 다운로드 → 표준코드 매핑 → data/prices.csv 저장."""

import csv
import sys
import tempfile
from pathlib import Path

import openpyxl
import requests

from build_mapping import load_mapping

# HIRA 약제급여목록표 다운로드 URL 패턴
# brdBltNo는 게시물 ID — 최신 게시물을 찾아야 함
HIRA_DOWNLOAD_URL = (
    "https://www.hira.or.kr/bbs/bbsCDownLoad.do"
    "?apndNo=1&apndBrdBltNo={blt_id}&apndBrdTyNo=1&apndBltNo=59"
)
# 최신 게시물 ID (2026-04 기준: 1703)
LATEST_BLT_ID = 1703

CSV_HEADER = [
    "표준코드", "제품코드", "제품명", "규격", "단위",
    "상한금액", "업체명", "주성분코드", "주성분명", "투여", "분류",
]


def download_excel(blt_id: int) -> Path:
    """HIRA에서 엑셀 다운로드, 임시파일 경로 반환."""
    url = HIRA_DOWNLOAD_URL.format(blt_id=blt_id)
    resp = requests.get(url, timeout=60)
    resp.raise_for_status()

    tmp = tempfile.NamedTemporaryFile(suffix=".xlsx", delete=False)
    tmp.write(resp.content)
    tmp.close()
    print(f"Downloaded {len(resp.content):,} bytes → {tmp.name}", file=sys.stderr)
    return Path(tmp.name)


def parse_new_format(rows: list[tuple]) -> list[dict]:
    """2026년+ 신규 포맷 (16컬럼): 주성분명 포함."""
    items = []
    for row in rows[1:]:
        if len(row) >= 14:
            items.append({
                "제품코드": str(row[8] or ""),
                "제품명": str(row[9] or ""),
                "규격": str(row[11] or ""),
                "단위": str(row[12] or ""),
                "상한금액": str(row[13] or ""),
                "업체명": str(row[10] or ""),
                "주성분코드": str(row[5] or ""),
                "주성분명": str(row[7] or ""),
                "투여": str(row[1] or ""),
                "분류": str(row[2] or ""),
            })
    return items


def parse_old_format(rows: list[tuple]) -> list[dict]:
    """2025년 구 포맷 (12컬럼): 주성분명 행과 제품 행이 번갈아 나옴."""
    items = []
    current_ingredient = ""
    for row in rows[1:]:
        if row[5] is None:  # 주성분명 행
            current_ingredient = str(row[4] or "")
        else:  # 제품 행
            items.append({
                "제품코드": str(row[4] or ""),
                "제품명": str(row[5] or ""),
                "규격": str(row[7] or ""),
                "단위": str(row[8] or ""),
                "상한금액": str(row[9] or ""),
                "업체명": str(row[6] or ""),
                "주성분코드": str(row[3] or ""),
                "주성분명": current_ingredient,
                "투여": str(row[1] or ""),
                "분류": str(row[2] or ""),
            })
    return items


def parse_excel(xlsx_path: Path) -> list[dict]:
    """엑셀 파일 파싱. 포맷 자동 감지."""
    wb = openpyxl.load_workbook(xlsx_path, read_only=True)
    ws = wb.active
    rows = list(ws.iter_rows(values_only=True))
    ncols = len(rows[0]) if rows else 0
    wb.close()

    print(f"Sheet: {ws.title}, {len(rows)-1} data rows, {ncols} cols", file=sys.stderr)

    if ncols >= 14:
        return parse_new_format(rows)
    else:
        return parse_old_format(rows)


def write_csv(items: list[dict], mapping: dict[str, str], output_path: Path) -> None:
    """표준코드 매핑 적용 후 CSV 저장."""
    for item in items:
        code = item["제품코드"]
        item["표준코드"] = mapping.get(code, "")

    items.sort(key=lambda x: (x["표준코드"] or "z", x["제품코드"]))
    output_path.parent.mkdir(parents=True, exist_ok=True)

    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=CSV_HEADER)
        writer.writeheader()
        writer.writerows(items)

    mapped = sum(1 for i in items if i["표준코드"])
    print(f"Saved {len(items)} rows ({mapped} mapped) → {output_path}", file=sys.stderr)


def main():
    import argparse

    parser = argparse.ArgumentParser(description="HIRA 약제급여목록표 수집")
    parser.add_argument("--blt-id", type=int, default=LATEST_BLT_ID,
                        help="HIRA 게시물 ID (default: %(default)s)")
    args = parser.parse_args()

    repo_root = Path(__file__).resolve().parent.parent
    mapping = load_mapping()

    xlsx_path = download_excel(args.blt_id)
    items = parse_excel(xlsx_path)
    output_path = repo_root / "data" / "prices.csv"
    write_csv(items, mapping, output_path)

    xlsx_path.unlink()  # 임시파일 삭제


if __name__ == "__main__":
    main()

Step 3: 로컬 테스트

cd scripts && python3 collect.py --blt-id 1703

Expected: data/prices.csv 생성, ~21,888행, 표준코드 컬럼 포함

Step 4: 표준코드 매핑률 확인

python3 -c "
import csv
with open('../data/prices.csv') as f:
    rows = list(csv.DictReader(f))
mapped = sum(1 for r in rows if r['표준코드'])
print(f'{mapped}/{len(rows)} ({mapped*100//len(rows)}%) mapped')
"

Step 5: 커밋

git add scripts/collect.py scripts/requirements.txt
git commit -m "feat: rewrite collect.py — HIRA download + standard code mapping"

Task 4: Git 이력 재생성 (force push)

⚠️ 기존 이력 전부 삭제하고 새로 쌓음. star 0일 때만 가능.

HIRA 게시물 ID → 월 매핑:

blt_id	월	포맷
1698	2025-11	구 (12컬럼)
1699	2025-12	신 (16컬럼)
1700	2026-01	신
1701	2026-02	신
1702	2026-03	신
1703	2026-04	신

Step 1: 새 orphan 브랜치에서 초기 구조 커밋

git checkout --orphan rebuild
git rm -rf .
# 기본 파일 복원
git checkout main -- LICENSE .gitignore scripts/build_mapping.py scripts/collect.py scripts/requirements.txt data/reference/drug-master-20251031.csv .github/workflows/collect.yml
git add -A
git commit -m "init: drug-price-kr — 의약품 약가 Git 추적 프로젝트"

Step 2: 월별 데이터 순차 커밋

각 월별로 collect.py를 실행하여 data/prices.csv를 덮어쓰고 커밋:

for entry in "1698:2025-11" "1699:2025-12" "1700:2026-01" "1701:2026-02" "1702:2026-03" "1703:2026-04"; do
  blt_id="${entry%%:*}"
  month="${entry##*:}"
  cd scripts && python3 collect.py --blt-id "$blt_id" && cd ..
  rows=$(tail -n +2 data/prices.csv | wc -l | tr -d ' ')
  git add data/prices.csv
  git commit -m "data: ${month} 약제급여목록 (${rows}건)

출처: 건강보험심사평가원 약제급여목록표 (brdBltNo=${blt_id})"
  echo "=== ${month} done ==="
done

Step 3: git diff 작동 확인

git log --oneline
git diff HEAD~1 -- data/prices.csv | head -30
git log -p -S "타이레놀" -- data/prices.csv | head -20

Expected: diff에서 상한금액 변동이 보임

Step 4: README 추가

git checkout main -- README.md
# README 내용을 새 구조에 맞게 수정 후
git add README.md
git commit -m "docs: README for single-file git-tracked drug prices"

Step 5: main으로 교체 + force push

git branch -M main
git push --force origin main

Task 5: GitHub Actions 업데이트

Files:

Modify: .github/workflows/collect.yml
Step 1: workflow를 HIRA 다운로드 방식으로 수정

name: Collect Drug Prices

on:
  schedule:
    - cron: '0 0 1,15 * *'
  workflow_dispatch:

jobs:
  collect:
    runs-on: ubuntu-latest
    permissions:
      contents: write

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r scripts/requirements.txt

      - name: Collect data
        working-directory: scripts
        run: python collect.py

      - name: Check for changes
        id: diff
        run: |
          git diff --quiet data/prices.csv && echo "changed=false" >> $GITHUB_OUTPUT || echo "changed=true" >> $GITHUB_OUTPUT

      - name: Commit and push
        if: steps.diff.outputs.changed == 'true'
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add data/prices.csv
          git commit -m "data: $(date +%Y-%m) 약제급여목록 자동 수집"
          git push

Step 2: 커밋

git add .github/workflows/collect.yml
git commit -m "ci: update workflow for HIRA direct download"
git push origin main

Task 6: Issue 정리

Step 1: dkmin/clawd#22 종료

gh issue close 22 --comment "마이그레이션 완료:
- [x] 약가마스터 UTF-8 저장
- [x] 제품코드 ↔ 표준코드 매핑 스크립트
- [x] CSV 스키마 변경 (표준코드 추가)
- [x] 단일파일 전환 (data/prices.csv)
- [x] 커밋 이력 재생성 + force push
- [x] collect.py HIRA 다운로드 방식으로 교체
- [x] README 업데이트

git diff로 약가 변동 추적 가능:
  git diff HEAD~1 -- data/prices.csv
  git log -p -S '약품명' -- data/prices.csv"

참고

데이터 출처

약제급여목록표 (약가+상한금액): https://www.hira.or.kr/bbsDummy.do?pgmid=HIRAA030014050000
약가마스터 (표준코드 매핑): https://www.data.go.kr/data/15067462/fileData.do

legalize-kr 모델

https://github.com/legalize-kr/legalize-kr (768 star)
핵심: 단일 파일 덮어쓰기 → git diff로 법령 변동 추적
drug-price-kr은 이 패턴을 약가에 적용

dkmin/2026-04-08-drug-price-kr-data-collection.md

Select an option

No results found

Select an option

No results found

drug-price-kr 약가 데이터 수집 및 Git 추적 파이프라인

CSV 스키마 (변경 후)

파일 구조 (변경 후)

Task 1: 약가마스터 UTF-8 변환 및 저장

Task 2: 매핑 스크립트 작성

Task 3: collect.py를 HIRA 직접 다운로드 방식으로 교체

Task 4: Git 이력 재생성 (force push)

Task 5: GitHub Actions 업데이트

Task 6: Issue 정리

참고

데이터 출처

legalize-kr 모델