maildir_deduplicate package

Submodules

maildir_deduplicate.cli module

maildir_deduplicate.cli.validate_regexp(ctx, param, value)[source]

Validate and compile regular expression.

maildir_deduplicate.cli.validate_maildirs(ctx, param, value)[source]

Check that folders are maildirs.

maildir_deduplicate.deduplicate module

class maildir_deduplicate.deduplicate.DuplicateSet(hash_key, mail_path_set, conf)[source]

Bases: object

A duplicate set of mails sharing the same hash.

Implements all deletion strategies applicable to a set of duplicate mails.

size

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

newest_timestamp

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

oldest_timestamp

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

biggest_size

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

smallest_size

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

delete(mail)[source]

Delete a mail from the filesystem.

check_differences()[source]

In-depth check of mail differences.

Compare all mails of the duplicate set with each other, both in size and content. Raise an error if we’re not within the limits imposed by the threshold setting.

diff(mail_a, mail_b)[source]

Return difference in bytes between two mails’ normalized body.

TODO: rewrite the diff algorithm to not rely on naive unified diff result parsing.

pretty_diff(mail_a, mail_b)[source]

Returns a verbose unified diff between two mails’ normalized body.

apply_strategy()[source]

Apply deduplication with the configured strategy.

Transform strategy keyword into its method ID, and call it.

dedupe()[source]

Performs the deduplication and its preliminary checks.

delete_older()[source]

Delete all older duplicates.

Only keeps the subset sharing the most recent timestamp.

delete_oldest()[source]

Delete all the oldest duplicates.

Keeps all mail of the duplicate set but those sharing the oldest timestamp.

delete_newer()[source]

Delete all newer duplicates.

Only keeps the subset sharing the most ancient timestamp.

delete_newest()[source]

Delete all the newest duplicates.

Keeps all mail of the duplicate set but those sharing the newest timestamp.

delete_smaller()[source]

Delete all smaller duplicates.

Only keeps the subset sharing the biggest size.

delete_smallest()[source]

Delete all the smallest duplicates.

Keeps all mail of the duplicate set but those sharing the smallest size.

delete_bigger()[source]

Delete all bigger duplicates.

Only keeps the subset sharing the smallest size.

delete_biggest()[source]

Delete all the biggest duplicates.

Keeps all mail of the duplicate set but those sharing the biggest size.

delete_matching_path()[source]

Delete all duplicates whose file path match the regexp.

delete_non_matching_path()[source]

Delete all duplicates whose file path doesn’t match the regexp.

class maildir_deduplicate.deduplicate.Deduplicate(conf)[source]

Bases: object

Read messages from maildirs and perform a deduplication.

Messages are grouped together in a DuplicateSet

static canonical_path(path)[source]

Return a normalized, canonical path to a file or folder.

Removes all symbolic links encountered in the path to detect natural mail and maildir duplicates on the fly.

add_maildir(maildir_path)[source]

Load up a maildir and compute hash for each mail found.

run()[source]

Run the deduplication process.

We apply the removal strategy one duplicate set at a time to keep memory footprint low and make the log of actions easier to read.

report()[source]

Print user-friendly statistics and metrics.

maildir_deduplicate.mail module

class maildir_deduplicate.mail.Mail(path, conf)[source]

Bases: object

Encapsulate a single mail and its metadata.

message

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

timestamp

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

size

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

body_lines

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

subject

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

hash_key

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

header_text

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

canonical_headers

The cachedproperty is used similar to property, except that the wrapped method is only called once. This is commonly used to implement lazy attributes.

After the property has been accessed, the value is stored on the instance itself, using the same name as the cachedproperty. This allows the cache to be cleared with delattr(), or through manipulating the object’s __dict__.

static canonical_header_value(header, value)[source]

Module contents

Expose package-wide elements.

exception maildir_deduplicate.InsufficientHeadersError[source]

Bases: exceptions.Exception

Issue was encountered with email headers.

exception maildir_deduplicate.MissingMessageID[source]

Bases: exceptions.Exception

No Message-ID header found in email headers.

exception maildir_deduplicate.SizeDiffAboveThreshold[source]

Bases: exceptions.Exception

Difference in mail size is greater than threshold.

exception maildir_deduplicate.ContentDiffAboveThreshold[source]

Bases: exceptions.Exception

Difference in mail content is greater than threshold.

class maildir_deduplicate.Config(**kwargs)[source]

Bases: object

Holds global configuration.

default_conf = {'content_threshold': 768, 'progress': True, 'dry_run': False, 'size_threshold': 512, 'regexp': None, 'show_diff': False, 'message_id': False, 'time_source': None, 'strategy': None}