data-comparison/CHANGES_SUMMARY.md
2025-08-20 15:55:21 +07:00

3.6 KiB

Changes Summary - Data Comparison Logic Fix

Issues Fixed

1. Removed All-Sheet Functionality

  • Problem: The tool was processing all sheets together, causing cross-sheet duplicate detection
  • Solution: Completely removed all-sheet functionality, now only processes one sheet at a time
  • Changes:
    • Replaced extract_kst_coordi_items() with extract_kst_coordi_items_for_sheet(sheet_name)
    • Updated all comparison methods to work sheet-specifically

2. Fixed Duplicate Detection Logic

  • Problem: Items appearing once on each side were incorrectly marked as duplicates
  • Solution: Fixed _find_duplicates_in_list() to only return items that actually appear multiple times
  • Changes: Used Counter to count occurrences and only return items with count > 1

3. Implemented Mixed Duplicate Priority

  • Problem: Items showing as both pure duplicates and mixed duplicates
  • Solution: Mixed duplicates (items in both datasets with duplicates on one side) now take priority
  • Changes: Generate mixed duplicates first, then exclude those keys from pure duplicate lists

4. Sheet-Specific Analysis Only

  • Problem: Cross-sheet contamination in duplicate detection
  • Solution: All analysis now happens within a single sheet context
  • Changes:
    • get_comparison_summary() now requires sheet filter and defaults to first sheet
    • Removed old filtering methods, replaced with sheet-specific extraction

BA Confirmed Cases - All Working

US URGENT Sheet

  • 금수의 영역 - Episode 17 → Coordi duplicate
  • 신결 - Episode 23 → Coordi duplicate
  • 트윈 가이드 - Episode 31 → Mixed duplicate (exists in both, duplicates in Coordi)
  • No longer shows 트윈 가이드 - Episode 31 as pure Coordi duplicate

TH URGENT Sheet

  • 백라이트 - Episode 53-1x(휴재) → KST duplicate (doesn't appear in Coordi)

Code Changes Made

data_comparator.py

  1. New Methods:

    • extract_kst_coordi_items_for_sheet(sheet_name) - Sheet-specific extraction
    • categorize_mismatches_for_sheet(sheet_data) - Sheet-specific categorization
    • generate_mismatch_details_for_sheet() - Sheet-specific mismatch details with priority logic
    • group_by_title_for_sheet() - Sheet-specific grouping
  2. Updated Methods:

    • _find_duplicates_in_list() - Fixed to only return actual duplicates
    • get_comparison_summary() - Now sheet-specific only
    • print_comparison_summary() - Added sheet name to output
  3. Removed Methods:

    • extract_kst_coordi_items() - Replaced with sheet-specific version
    • categorize_mismatches() - Replaced with sheet-specific version
    • generate_mismatch_details() - Replaced with sheet-specific version
    • group_by_title() - Replaced with sheet-specific version
    • filter_by_sheet() - No longer needed
    • filter_grouped_data_by_sheet() - No longer needed
    • calculate_filtered_counts() - No longer needed

web_gui.py

  • Updated matched items extraction to use new grouped data structure
  • Removed dependency on old categorize_mismatches() method

Test Files

  • test_ba_confirmed_cases.py - New test to verify BA confirmed expectations
  • test_sheet_filtering.py - Updated to work with new sheet-specific logic

Performance Improvements

  • Faster analysis since no cross-sheet processing
  • More accurate duplicate detection
  • Cleaner separation of concerns between sheets

Verification

All tests pass:

  • Sheet filtering works correctly
  • Duplicate detection is accurate
  • BA confirmed cases match expectations
  • Web interface works properly
  • Mixed duplicates take priority over pure duplicates