Shell脚本并行处理大量文件的陷阱：如何避免处理中断和文件遗漏

2024-03-08 15:46:23

在 Shell 脚本中并行处理大量文件的陷阱

介绍

在处理海量文件时，并行处理是一种高效的方式，可以充分利用计算资源，加快任务完成时间。然而，在 Shell 脚本中实现并行处理时，会出现各种挑战，尤其是当需要限制并行进程数量时。本文将探讨并行处理大量文件时遇到的一个常见陷阱，并提供解决方法。

问题

假设我们有一个 Shell 脚本，旨在处理一个文件夹中包含大量文件的任务。为了避免系统过载，我们限制同时运行的并行进程数量为 15 个。然而，脚本的行为与预期不符，导致只处理了部分文件，而且是交替进行的。

陷阱所在

问题的根源在于使用 wait -n 命令。当达到最大并行进程数时，脚本使用 wait -n 等待任何正在运行的进程完成。然而，-n 选项会在等待任何进程完成时立即返回，而不是等待特定进程完成。

解决方法

为了正确处理所有文件，需要对脚本进行以下修改：

使用 wait 代替 wait -n： wait 命令将在所有子进程完成之前阻塞，确保所有文件都已处理。
移动 wait 命令到循环外： 将 wait 命令移动到循环外部，使其在所有文件都放入后台处理后执行。

修改后的脚本

# Path to the folder containing the files
INPUT_FILES_FOLDER="/mnt/data/INPUT"
OUTPUT_FILES_FOLDER="/mnt/data/OUTPUT"

# Path to the Docker image
DOCKER_IMAGE="your_docker_image"

# Number of parallel instances of Docker to run
MAX_PARALLEL=15

# Counter for the number of parallel instances
CURRENT_PARALLEL=0

# Function to process files
process_files() {
    for file in "$INPUT_FILES_FOLDER"/*; do
        input_file=`basename $file`
        output_file="PROCESSED_${input_file}"

        input_folder_file="/data/INPUT/${input_file}"
        output_folder_file="/data/OUTPUT/${output_file}"

        echo "Input File: $input_file"
        echo "Output File: $output_file"

        echo "Input Folder + File: $input_folder_file"
        echo "Output Folder + File: $output_folder_file"


        # Check if the current number of parallel instances is less than the maximum allowed
        if [ "$CURRENT_PARALLEL" -lt "$MAX_PARALLEL" ]; then
            # Increment the counter for the number of parallel instances
            ((CURRENT_PARALLEL++))

            # Run Docker container in the background, passing the file as input
            docker run --rm -v /mnt/data/:/data my-docker-image:v5.1.0 -i $input_folder_file -o $output_folder_file &

            # Print a message indicating the file is being processed
            # echo "Processing $file"
        fi
    done

    # Wait for all remaining Docker instances to finish
    wait
}

# Call the function to process files
process_files