DiffMate

Supported File Formats Detailed Guide

A detailed guide to file formats you can compare in DiffMate, their characteristics, and limitations.

TXT (Plain Text)

TXT files are the most basic text file format. They contain only pure text without formatting information and can be read by virtually all operating systems and programs.

DiffMate compares TXT files line by line. It precisely compares the content of each line and color-codes added, deleted, and modified lines. For modified lines, it highlights exactly which characters changed at the character level.

Automatic encoding detection is supported. It sequentially tries encodings including UTF-8, EUC-KR (CP949), ISO-8859-1, UTF-16LE, and UTF-16BE to read the file with the optimal encoding. Files with BOM (Byte Order Mark) are also handled correctly.

An important consideration for TXT comparison is line ending style. If Windows (CRLF) and Mac/Linux (LF) line endings differ, identical content may show as different. It's best to compare files created on the same OS when possible.

There's no file size limit, but since it relies on browser memory, files over tens of MB may take longer to process. Regular text files (within a few MB) are compared almost instantly.

CSV (Comma-Separated Values)

CSV stores data separated by commas and is widely used for database exports, spreadsheet sharing, and API data exchange.

DiffMate compares CSV files on a text basis. Each row is treated as a single line for line-by-line comparison. It automatically recognizes various delimiters including commas (,), tabs ( ), and semicolons (;).

The advantage of CSV comparison is quickly identifying data differences without spreadsheet software. It color-codes changed cells, added rows, and deleted rows in thousands of rows of data.

Note that cells containing commas in CSV must be enclosed in double quotes for correct parsing. Also, when comparing CSVs with and without header rows, care is needed in interpreting results.

Regarding encoding, CSV files containing Korean are often saved in EUC-KR (especially when using "Save as CSV" in Windows Excel). DiffMate automatically detects and handles this.

XLSX (Microsoft Excel)

XLSX is Microsoft Excel's default file format, with a complex structure that can include multiple sheets, formulas, formatting, and charts.

DiffMate uses the SheetJS library to parse XLSX files. It converts the first sheet's data to text and compares line by line. Cell data in each row is separated by tabs for comparison.

The comparison target is the displayed value in cells. For cells containing formulas, the calculated result is compared, not the formula itself. Cell formatting (fonts, colors, borders, etc.) is not compared.

Current limitations: Only the first sheet is compared. To compare a specific sheet in a multi-sheet file, save that sheet as a separate file first.

Non-text elements like charts, images, and pivot tables are not compared. Only pure cell data is compared.

Merged cells are unmerged for comparison. The visual layout may differ from the original.

XLS (legacy Excel format) is also supported since SheetJS can handle both formats.

PDF (Portable Document Format)

PDF is a format that maintains document layout while sharing, most commonly used for contracts, reports, manuals, and other official documents.

DiffMate uses pdf.js (a PDF rendering library developed by Mozilla) to extract text from PDFs. It arranges text elements on each page by Y-coordinate to form lines, then compares them.

Due to the nature of PDF text extraction, line breaks in extracted text may differ from the original document. This is because PDF is originally a print format, so text flow information is not perfectly preserved.

Text cannot be extracted from scanned PDFs (PDFs composed of images). Only OCR (Optical Character Recognition) processed PDFs can be compared as text. You need to add a text layer first using Adobe Acrobat or another OCR tool.

Encrypted PDFs may have restricted text extraction. You'll need to remove the password or use a PDF that allows text copying before comparison.

For PDFs containing tables, cell separation may not be accurate. When tabular data needs precise comparison, converting the PDF to Excel or CSV before comparing is recommended.

Multi-column layout PDFs may also have unexpected text extraction order. Single-column layout PDFs provide the most accurate comparison results.

Format Comparison Table

Compare characteristics of each file format at a glance:

TXT: Most accurate comparison possible, automatic encoding detection, no file size limit CSV: Accurate row-by-row comparison, automatic delimiter recognition, automatic encoding detection XLSX: First sheet cell value comparison, formatting/chart comparison not possible, XLS also supported PDF: Text-based comparison, layout differences possible, scanned PDFs not supported

Common features across all formats: All processing occurs in the browser, and files are never transmitted to external servers. Character-level highlighting shows exact differences, and comparison results can be saved as files.

Start Comparing Now