如何快速比较两个文件内容？一种比diff更快的算法

Linux

2024-03-29 20:22:58

如何高效比较两个文件的内容

问题

在 Unix/Linux 系统中，经常需要比较两个文件的内容以确定它们是否相同。虽然 diff 命令通常用于此目的，但对于大型文件，它可能会很慢。本文将介绍一种自定义算法，该算法比 diff 更快地比较文件内容，从而提高性能。

解决方案

我们的自定义算法通过分步比较文件内容来实现高效比较：

文件大小比较： 如果两个文件的大小不同，则它们肯定包含不同的内容。
文件哈希比较： 计算每个文件的哈希值（如 MD5 或 SHA256）。如果哈希值不同，则文件内容不同。
逐字节比较： 如果文件大小和哈希值都相同，则逐字节比较文件内容。

优势

与 diff 命令相比，我们的算法具有以下优势：

速度： 直接比较文件大小和哈希值可以避免对大型文件的昂贵逐字节比较。
内存效率： 算法不需要将整个文件加载到内存中。
准确性： 算法提供相同内容文件的精确比较。

实现

自定义算法的 C 代码实现如下：

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <openssl/md5.h>

int main(int argc, char *argv[]) {
    // 检查参数
    if (argc != 3) {
        printf("Usage: %s file1 file2\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    // 打开文件
    int fd1 = open(argv[1], O_RDONLY);
    int fd2 = open(argv[2], O_RDONLY);

    // 获取文件大小
    struct stat st1;
    struct stat st2;
    fstat(fd1, &st1);
    fstat(fd2, &st2);

    // 比较文件大小
    if (st1.st_size != st2.st_size) {
        printf("Files have different sizes.\n");
        exit(EXIT_FAILURE);
    }

    // 计算文件哈希
    unsigned char md5_1[MD5_DIGEST_LENGTH];
    unsigned char md5_2[MD5_DIGEST_LENGTH];
    MD5_CTX ctx1;
    MD5_CTX ctx2;
    MD5_Init(&ctx1);
    MD5_Init(&ctx2);
    MD5_Update(&ctx1, argv[1], st1.st_size);
    MD5_Update(&ctx2, argv[2], st2.st_size);
    MD5_Final(md5_1, &ctx1);
    MD5_Final(md5_2, &ctx2);

    // 比较文件哈希
    if (memcmp(md5_1, md5_2, MD5_DIGEST_LENGTH) != 0) {
        printf("Files have different hashes.\n");
        exit(EXIT_FAILURE);
    }

    // 逐字节比较
    char buf1[1024];
    char buf2[1024];
    int n;
    while ((n = read(fd1, buf1, sizeof(buf1))) > 0) {
        if (read(fd2, buf2, n) != n || memcmp(buf1, buf2, n) != 0) {
            printf("Files have different contents.\n");
            exit(EXIT_FAILURE);
        }
    }

    // 输出结果
    printf("Files have the same contents.\n");
    return EXIT_SUCCESS;
}