Skip to content

Parallel diff and cmp on binary files? #121

@pauschuu

Description

@pauschuu

I just had a revelation:

$ time b3sum dreamshaper_8\ \(1\).safetensors dreamshaper_8.safetensors
771c807db56dbfc33feda5638d920f6c507db971da44772ee44a08dc38c3b437  dreamshaper_8 (1).safetensors
771c807db56dbfc33feda5638d920f6c507db971da44772ee44a08dc38c3b437  dreamshaper_8.safetensors

real    0m0.172s
user    0m2.193s
sys     0m0.423s


$ time cmp dreamshaper_8\ \(1\).safetensors dreamshaper_8.safetensors

real    0m0.596s
user    0m0.183s
sys     0m0.411s

$ time diff dreamshaper_8\ \(1\).safetensors dreamshaper_8.safetensors

real    0m0.509s
user    0m0.079s
sys     0m0.428s

As you can see, even though the b3sum method has an additional cost (calculating a hash) it is way faster overall since it's leveraging parallelism.

Wouldn't it be a good improvement to bring parallelism to some of the tools like diff and cmp?
Maybe with a new (not-standardized) option?
Maybe by default because why not?

I guess diff has a special code path once it is sure that it's just a binary file, right? So in that code path it wouldn't be much of a problem to parallelize it.

This whole topic can even be pushed further when comparing directories... parallel diffing of files.

Come on it's 2025! :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions