Text diffing is the process of comparing two pieces of text and identifying the differences between them. A "diff" (short for "difference") is the output of this comparison -- a structured representation of what has changed from one version to another. The concept is fundamental to software development, version control, document management, and any workflow where tracking changes matters.
The idea of automated text comparison dates back to the early 1970s. The original diff utility was written by Douglas McIlroy at Bell Labs in 1974 and became part of the Unix operating system. McIlroy's implementation was based on an algorithm by James W. Hunt and Thomas G. Szymanski for computing the longest common subsequence (LCS) of two sequences. This foundational work laid the groundwork for all modern diff tools.
At its core, a diff tool answers a simple question: "What changed between version A and version B?" The answer is expressed as a series of edit operations -- typically insertions, deletions, and modifications -- that would transform the original text into the modified text. This sequence of operations is called an "edit script."
Today, diffing is everywhere. Every time you review a pull request on GitHub, examine a commit log, compare configuration files before a deployment, or track revisions in a document, you are using diff technology. Understanding how diffing works -- and how to use diff tools effectively -- is an essential skill for developers, system administrators, technical writers, and anyone who works with text.
The applications extend beyond code. Legal professionals compare contract revisions. Translators compare source and translated documents. Data engineers compare schema definitions. Security analysts compare system configurations to detect unauthorized changes. The underlying technology is the same: algorithmic text comparison.
Understanding how diff algorithms work gives you insight into why certain outputs look the way they do, and helps you choose the right tool and settings for your use case. The problem of computing the "best" diff between two texts is well-studied in computer science, with several algorithms offering different tradeoffs between speed, memory usage, and output quality.
Most diff algorithms are based on the Longest Common Subsequence (LCS) problem. Given two sequences, the LCS is the longest sequence of elements that appear in both, in the same order, but not necessarily contiguously. For text comparison, each "element" is typically a line of text.
For example, given two files:
File A: File B:
line 1 line 1
line 2 line 3
line 3 line 4
line 4 line 5
The LCS is ["line 1", "line 3", "line 4"]. The diff output shows that "line 2" was deleted from File A, and "line 5" was added in File B. Everything in the LCS is "unchanged" -- it serves as the anchor points around which insertions and deletions are identified.
The most widely used diff algorithm in practice is Eugene W. Myers' algorithm, published in 1986 in the paper "An O(ND) Difference Algorithm and Its Variations." Git, GNU diff, and most modern diff tools use Myers' algorithm or a variation of it.
Myers' algorithm models the diff problem as a graph search. Imagine a 2D grid where the x-axis represents lines of File A and the y-axis represents lines of File B. Moving right means deleting a line from A; moving down means inserting a line from B. When lines match (A[x] == B[y]), you can move diagonally -- which is "free" because matching lines require no edit operations.
The algorithm finds the path from the top-left corner to the bottom-right corner that uses the fewest horizontal and vertical moves (i.e., the fewest insertions and deletions). This produces the shortest edit script -- the minimum set of changes needed to transform A into B.
The time complexity is O(ND), where N is the sum of the lengths of both files and D is the size of the minimum edit script (the number of differences). This means the algorithm is fast when the files are similar (small D) and slower when the files are very different (large D). For most practical use cases -- comparing two versions of a file that differ in a few places -- the algorithm is extremely efficient.
Patience diff is a variation that often produces more human-readable output. It works by first finding "unique" matching lines -- lines that appear exactly once in each file. These unique lines serve as anchors that split the problem into smaller sub-problems. The algorithm then recursively diffs the text between the anchors using the standard LCS approach.
The advantage of patience diff is that it tends to align changes along structural boundaries. For example, when comparing code where a function was added between two existing functions, patience diff will correctly show the new function as an insertion, rather than producing a confusing interleaving of old and new code. Git supports patience diff via git diff --patience.
Histogram diff is an extension of patience diff developed for the JGit project (the Java implementation of Git). It uses a histogram of line occurrences to find low-occurrence matching lines as anchors. This approach handles cases where no truly unique lines exist (which would cause patience diff to fall back to Myers' algorithm). Histogram diff is available in Git via git diff --histogram and is the default diff algorithm in JGit.
Standard diff algorithms compare text line by line. But when you need finer granularity -- for example, to see exactly which word changed within a line -- word-level or character-level diffing is used. These approaches apply the same LCS algorithms but use words or characters as the comparison units instead of lines.
Character-level diffing is particularly useful for prose and documentation, where a single changed word in a long paragraph would otherwise show as the entire line being modified. Many modern diff tools and code review platforms combine line-level diffing (to identify changed lines) with word-level or character-level highlighting within those lines (to pinpoint exactly what changed).
Diff tools can present their output in several formats, each suited to different use cases. Understanding these formats helps you read diffs efficiently and choose the right presentation for your needs.
Unified diff is the most common format in modern development. It shows changes in a single stream with context lines for orientation. Each chunk (or "hunk") starts with a header showing the line numbers in both files:
--- a/config.yaml
+++ b/config.yaml
@@ -10,7 +10,8 @@
database:
host: localhost
port: 5432
- name: mydb_dev
+ name: mydb_prod
+ ssl: true
pool_size: 10
cache:
enabled: true
Lines prefixed with - exist only in the original file (deletions). Lines prefixed with + exist only in the modified file (additions). Lines with no prefix are context -- they appear in both files and help you locate the change. The @@ header indicates that this chunk starts at line 10 in the original file (showing 7 lines) and line 10 in the modified file (showing 8 lines).
Unified diff is the default output of git diff and is the standard format for patch files. It is compact, readable, and unambiguous.
Side-by-side diff displays the original and modified text in two parallel columns. Added lines appear only in the right column, deleted lines appear only in the left column, and modified lines appear in both columns with differences highlighted. This format is highly visual and makes it easy to see changes at a glance.
Side-by-side diff is the preferred format in graphical diff tools, code review platforms, and online diff checkers. While it requires more horizontal space than unified diff, it provides a more intuitive view of how the text has changed.
Inline diff (also called "rendered diff" or "rich diff") shows the original text with additions highlighted in green and deletions highlighted in red, often with strikethrough styling on deleted text. This format is commonly used in document comparison tools, CMS revision history, and word-processor "track changes" features.
Context diff is an older format that shows changed lines with surrounding context. It uses *** to mark the original file section and --- for the modified file section, with ! to indicate changed lines. While largely superseded by unified diff, you may encounter context diffs in legacy systems and older patch files.
The original Unix diff output format uses a terse notation like 3c3 (line 3 changed), 5a6,7 (lines 6-7 added after line 5), or 8,10d7 (lines 8-10 deleted). This format is the most compact but also the hardest for humans to read. It is rarely used directly today but remains the default output format of the traditional diff command without flags.
Code review is perhaps the most important application of diff tools. Every pull request, merge request, or code review session revolves around examining the diff -- the set of changes a developer is proposing to merge into the codebase. Reading diffs effectively is a core skill for software engineers.
When you open a pull request on GitHub, GitLab, or Bitbucket, the platform shows you a diff of all changes in the branch compared to the base branch. This diff is computed by running the equivalent of git diff base-branch...feature-branch. The output shows every file that was added, modified, or deleted, with changes highlighted at the line level and often at the word level within changed lines.
Effective code review requires more than just reading the diff top to bottom. Here are strategies for reviewing diffs efficiently:
Large diffs (hundreds or thousands of lines changed) are notoriously difficult to review. Research has shown that review quality drops significantly as diff size increases. When faced with a large diff:
Modern code review platforms allow you to leave comments on specific lines of a diff. This is a powerful feature for targeted feedback. Best practices for diff comments include:
While diff tools originated in software development, document comparison is an equally important use case. Any workflow involving iterative editing of text documents benefits from the ability to see exactly what changed between versions.
In legal work, precision matters. When a contract goes through multiple rounds of negotiation, each party needs to see exactly what the other side changed. A single word change -- from "may" to "shall," or "reasonable" to "best" -- can have significant legal implications. Diff tools designed for legal documents highlight every character-level change, ensuring nothing slips through unnoticed.
Unlike code diffs, legal document diffs typically need to handle rich text formatting, paragraph reflows, and non-structural whitespace changes. Specialized legal comparison tools (like document comparison features in Microsoft Word) account for these differences, but even a plain-text diff checker can be invaluable for comparing Markdown or plain-text drafts.
Technical writers frequently compare documentation versions to review edits, track contributions, and ensure consistency. When documentation is stored as plain text (Markdown, reStructuredText, AsciiDoc), standard diff tools work perfectly. The key challenge is distinguishing meaningful content changes from formatting or structural changes like line rewrapping.
For documentation review, word-level diffing is particularly valuable. A line-level diff might show an entire paragraph as changed when only one sentence was modified. Word-level highlighting within changed lines reveals the actual edit.
Content management systems often include built-in revision comparison. Editors can view the diff between any two revisions of a page or article, seeing additions in green and deletions in red. This is especially useful for:
Comparing structured data like CSV files or database exports requires special consideration. A standard line-level diff can identify changed rows, but it does not understand column structure. Specialized CSV diff tools can show changes at the cell level, identify added or removed columns, and handle row reordering. For quick ad-hoc comparisons, however, a text diff checker often suffices -- sort the data first if row order is not significant.
Configuration files are critical infrastructure. A single incorrect value in a configuration file can cause an outage, a security vulnerability, or data loss. Diffing is an essential practice for managing configuration changes safely.
In modern DevOps workflows, infrastructure is defined in configuration files -- Terraform HCL, Kubernetes YAML, Ansible playbooks, CloudFormation templates, Dockerfiles. Every change to these files is reviewed as a diff before being applied. The terraform plan command, for example, produces a diff-like output showing what infrastructure changes will be made.
Configuration diffs require extra scrutiny because the blast radius can be enormous. A small change to a load balancer configuration might affect millions of users. A security group change might expose internal services to the internet. When reviewing infrastructure diffs:
Comparing configuration between environments (development, staging, production) is a common use case. When something works in staging but fails in production, diffing the configuration files between environments can quickly reveal the discrepancy -- a different database URL, a missing feature flag, a different timeout value.
# Compare production and staging Kubernetes configs
diff production/deployment.yaml staging/deployment.yaml
# Compare environment variable files
diff .env.production .env.staging
Many compliance frameworks require tracking and reviewing all configuration changes. Diffing provides an auditable record of exactly what changed, when, and by whom. Storing configuration in version control (Git) provides a complete diff history that satisfies audit requirements.
Security teams use configuration diffing to detect unauthorized changes. By periodically comparing the current state of configuration files against a known-good baseline, drift can be detected and investigated. Tools like OSSEC, Tripwire, and AWS Config automate this process, but manual diffing remains a valuable skill for incident investigation.
Database migration tools generate diffs between the current schema and the desired schema. Tools like Alembic (Python/SQLAlchemy), Flyway, and Liquibase produce migration scripts that are essentially diffs -- they describe the sequence of ALTER TABLE, CREATE INDEX, and other DDL operations needed to transform the current schema into the target schema.
Reviewing schema migration diffs is critical because database changes are often irreversible (or expensive to reverse). Dropping a column, changing a data type, or modifying an index can have significant performance and data integrity implications.
Git is the most widely used version control system, and git diff is one of the most frequently used Git commands. Understanding its options and output is essential for any developer working with Git.
# Show unstaged changes (working directory vs staging area)
git diff
# Show staged changes (staging area vs last commit)
git diff --staged
# Compare two branches
git diff main..feature-branch
# Compare two specific commits
git diff abc123 def456
# Show changes in a specific file
git diff -- path/to/file.js
# Show changes introduced by a specific commit
git show abc123
Git diff supports numerous flags that control the output format and comparison behavior:
| Flag | Description |
|---|---|
--stat | Show a summary of files changed with insertion/deletion counts |
--name-only | Show only the names of changed files |
--name-status | Show names and status (Added, Modified, Deleted, Renamed) |
-w | Ignore all whitespace differences |
-b | Ignore changes in amount of whitespace |
--ignore-blank-lines | Ignore changes that only add or remove blank lines |
--word-diff | Show word-level differences inline |
--color-words | Show word-level diffs with color highlighting |
-U<n> | Show <n> lines of context (default is 3) |
--patience | Use the patience diff algorithm |
--histogram | Use the histogram diff algorithm |
--no-index | Compare two files outside a Git repository |
A Git diff output starts with several header lines before the actual changes:
diff --git a/src/utils.js b/src/utils.js
index 3a4b5c6..7d8e9f0 100644
--- a/src/utils.js
+++ b/src/utils.js
@@ -42,7 +42,9 @@ function processData(input) {
The first line identifies the compared files. The index line shows the abbreviated object hashes and file mode. The --- and +++ lines label the original and modified versions. The @@ line (called the "hunk header") shows line numbers and, when available, the enclosing function name -- a feature called "funcname" that helps orient you in the code.
Git can detect when a file was renamed (and optionally modified). Use git diff -M to enable rename detection, or git diff -M90% to set the similarity threshold (90% means files must be at least 90% similar to be considered a rename). When a rename is detected, the diff shows only the content changes, not the entire file as deleted-and-recreated.
# Detect renames with default 50% threshold
git diff -M
# Detect renames with 80% similarity threshold
git diff -M80%
# Also detect copies
git diff -C
Git diffs can be saved as patch files and applied to other repositories or branches. This is useful for sharing changes without direct repository access:
# Generate a patch file from the last commit
git format-patch -1 HEAD
# Generate a diff and save it
git diff > my-changes.patch
# Apply a patch
git apply my-changes.patch
# Apply a formatted patch (includes commit message and author)
git am 0001-fix-bug.patch
Standard diff tools are line-oriented -- they do not understand the structure of the content they are comparing. Semantic diff tools parse the content (as code, JSON, XML, YAML, etc.) and compare the structural representation rather than the raw text. This produces more meaningful diffs that ignore irrelevant formatting changes.
For example, consider two JSON files that differ only in key order and whitespace:
// Version A
{"name": "Alice", "age": 30, "city": "NYC"}
// Version B
{
"age": 30,
"city": "NYC",
"name": "Alice"
}
A standard text diff would show every line as changed. A semantic JSON diff would report no differences, because the two objects are equivalent. Semantic diffing is available for many formats through specialized tools and editor plugins.
Three-way merging is an extension of diffing that compares three versions of a file: the common ancestor (base), and two modified versions (ours and theirs). The merge algorithm diffs both modified versions against the base, then combines the changes:
Three-way merging is the foundation of Git's merge and rebase operations. Tools like vimdiff, VS Code's merge editor, and kdiff3 provide visual three-way merge interfaces.
Sometimes you need to compare entire directory trees, not just individual files. Directory diff tools recursively compare two directories, showing files that are added, deleted, modified, or identical. This is useful for:
# Compare two directories recursively
diff -rq dir1/ dir2/
# Using git diff to compare directories
git diff --no-index dir1/ dir2/
Real-world diffs often contain noise -- changes that are technically different but not meaningful. Common sources of noise include:
Most diff tools support ignore patterns, regex-based exclusions, or custom comparison functions that let you filter out noise and focus on substantive changes.
While standard diff tools operate on text, binary diff algorithms like bsdiff, xdelta, and VCDIFF can compute efficient deltas between binary files. These are used in software update systems (to distribute patches rather than full files), backup systems (for incremental backups), and version control (Git uses a custom binary delta format for pack files).
For human consumption, binary diffs are typically presented through specialized viewers: image diff tools that overlay before/after images or highlight pixel differences, hex editors that show byte-level changes, and structured format viewers that understand specific file formats (PDF, Office documents, etc.).
Follow these guidelines to get the most out of diff tools and produce clean, reviewable diffs in your own work.
The single most impactful practice for clean diffs is keeping changes focused. Each commit or pull request should address one concern: a bug fix, a feature addition, a refactoring, or a formatting cleanup -- but not all of these mixed together. Mixed-purpose diffs are hard to review, hard to revert, and hard to understand in the commit history.
If you need to reformatting code (changing indentation, wrapping lines, applying a linter) and also make logic changes, do them in separate commits. A formatting commit will touch many lines but make no functional change. A logic commit will touch few lines but require careful review. Combining them forces the reviewer to distinguish formatting noise from meaningful changes -- a tedious and error-prone process.
A diff shows what changed, but the commit message explains why. Good commit messages provide context that makes the diff easier to understand. When reviewing a diff months later, the commit message is often the only clue to the developer's intent.
Before pushing a commit or opening a pull request, review your own diff. Run git diff --staged to see exactly what you are about to commit. Look for:
Set up your diff tools to match your workflow. Configure your Git diff tool, merge tool, and editor integrations once, and they will save you time on every future diff operation:
# Set VS Code as your diff tool
git config --global diff.tool vscode
git config --global difftool.vscode.cmd 'code --wait --diff $LOCAL $REMOTE'
# Set VS Code as your merge tool
git config --global merge.tool vscode
git config --global mergetool.vscode.cmd 'code --wait $MERGED'
# Use patience diff by default
git config --global diff.algorithm patience
# Enable rename detection by default
git config --global diff.renames true
Git can show function names in hunk headers if it knows the file's language. Configure .gitattributes to enable language-specific diff drivers:
# .gitattributes
*.py diff=python
*.rb diff=ruby
*.java diff=java
*.go diff=golang
*.rs diff=rust
With these settings, hunk headers will show the enclosing function or class name, making it much easier to orient yourself in a diff without additional context.
When comparing files with many changes, use word-level diffing to see exactly what changed within each line. The git diff --word-diff or git diff --color-words commands are invaluable for reviewing changes to prose, configuration values, or any content where line-level diffs are too coarse.
Our free Diff Checker tool lets you compare two pieces of text instantly in your browser. No data is sent to any server -- all comparison is performed locally on your machine using client-side JavaScript.
Paste your original text on the left and the modified text on the right, then click "Compare." The tool highlights additions, deletions, and modifications with clear color coding. Matching lines are aligned so you can scan the diff visually.
Switch to unified view for a compact single-column display. Added lines are highlighted in green, deleted lines in red, and context lines appear without highlighting. This view mirrors the format used by git diff and patch files.
Within changed lines, the tool highlights the specific characters that differ. This makes it easy to spot the exact change when a line has been partially modified -- no more squinting to find the one changed character in a long line of code.
Stop eyeballing differences in text files. Use our free Diff Checker to compare any two texts side by side with character-level highlighting -- right in your browser, with zero data sent to any server.
Try the Diff Checker NowMaster JSON syntax, formatting best practices, validation techniques, and common parsing errors.
Master Kubernetes YAML from Deployments and Services to advanced scheduling and security contexts.
Master cron syntax, understand cron fields, build complex schedules, and avoid common scheduling pitfalls.