Character Encoding Guide
Learn about common encoding issues in file comparison and how to resolve them.
What is Encoding?
Computers store all data as binary (0s and 1s). To save the letter "A" in a text file, this character must be converted to a specific number (byte value). This conversion rule is called "encoding."
The same character can be stored as different byte values depending on the encoding. For example, in UTF-8, characters use 1-4 bytes, while in legacy encodings, different byte mappings are used.
If a file is read with the wrong encoding, characters appear garbled. This is the commonly known "character corruption" or "mojibake" phenomenon.
Major Encoding Types
UTF-8 is currently the most widely used encoding. It can represent characters from all languages worldwide, and over 95% of the web uses UTF-8. ASCII characters use 1 byte, and other characters use 2-4 bytes. It's the default encoding on macOS and Linux.
EUC-KR (CP949) is a Korean-specific encoding. It has been used as the default encoding in Korean Windows for a long time. Korean characters use 2 bytes, and English uses 1 byte. When Windows Notepad saves as "ANSI," it actually saves as EUC-KR (CP949) on Korean systems.
UTF-16 is the encoding used internally by Windows. All characters are represented in 2 or 4 bytes. Depending on byte order, it's divided into UTF-16LE (Little Endian) and UTF-16BE (Big Endian).
ISO-8859-1 (Latin-1) is an encoding for Western European languages. It represents 256 characters in 1 byte. It cannot represent Korean but is sometimes used as a fallback when reading binary data as text.
DiffMate's Automatic Encoding Detection
DiffMate automatically detects encoding in the following order when opening files:
Step 1 - BOM Check: Checks if a BOM (Byte Order Mark) exists at the beginning of the file. UTF-8 BOM: 0xEF 0xBB 0xBF UTF-16LE BOM: 0xFF 0xFE UTF-16BE BOM: 0xFE 0xFF
Step 2 - Try UTF-8: If no BOM, it first attempts UTF-8 decoding. UTF-8 raises errors on invalid byte sequences, so success means it's a UTF-8 file.
Step 3 - Try EUC-KR: If UTF-8 fails, it attempts EUC-KR (CP949) decoding. Most Korean files succeed at this step.
Step 4 - Fallback: If all above fail, it sequentially tries ISO-8859-1, UTF-16LE, and UTF-16BE.
This process is a browser port of the Python backend's encoding detection logic, using the TextDecoder API's fatal option to accurately filter incorrect encodings.
How to Fix Encoding Issues
Methods to try when characters appear garbled:
Convert encoding with Notepad: Open the file in Notepad, choose "Save As" → select "UTF-8" encoding and save. This resolves most encoding issues.
Check encoding with VS Code: When you open a file in VS Code, the current encoding is shown in the bottom status bar. Click it to reopen with a different encoding or save with a different encoding.
Use iconv on Linux/Mac: Convert encoding in the terminal with the command: iconv -f euc-kr -t utf-8 input.txt > output.txt
Excel CSV encoding issues: When saving CSV in Excel, select "CSV UTF-8 (Comma delimited)" to save as UTF-8. Regular "CSV" saves in the system's default encoding (EUC-KR on Korean Windows).
Encoding FAQ
Q: How can I tell what encoding a file uses? A: Most text editors (VS Code, Notepad++, etc.) display the encoding at the bottom when a file is opened. Or use the "file -i filename" command on Linux/Mac.
Q: What's the difference between UTF-8 and UTF-8 BOM? A: UTF-8 BOM adds 3 bytes (EF BB BF) at the file's beginning. These bytes indicate "this file is UTF-8." Both formats work in most programs, but some programs don't recognize the BOM and display garbled text.
Q: Why are file sizes different for the same content? A: Different encodings use different byte counts for the same character. For Korean, UTF-8 uses 3 bytes while EUC-KR uses 2 bytes. So files with lots of Korean are smaller in EUC-KR, while English-heavy files are nearly identical.
Q: Can data be lost from incorrect encoding conversion? A: Yes. For example, if you read an EUC-KR file as UTF-8 and save it (without proper conversion), Korean text will be corrupted. Always specify the correct source encoding before converting.