Dar Documentation

Dar/Libdar Internals - Notes

Introduction

Here takes place a collection of notes. These have been created after implementation of a given feature, mainly for further reference but also as user information. The idea behind these notes are to remind some choices of implementation, the arguments and reasons that lead to them, but also let the user have a way to be informed about the choices done and be able to bring his remarks without having to deeply look into the code to learn dar's internals.

EA & differential backup

Brief presentation of EA:

EA stands for Extended Attributes. In Unix filesystem a regular file is composed of a set of bytes (the data) and an inode (containing the metadata). The inode add properties to the file, such as owner, group, permission, dates (last modification date of the data [mtime], last access date to data [atime], and last inode change date [ctime]), etc). Last, the name of the file is not contained in the inode, but in the directory(ies) it is linked to. When a file is linked more than once in the directory tree, we speak about "hard links". This way the same data and associated inode appears several time in the same or different directories. This is not the same as a symbolic links, which later one is a file that contains the path to another file (which may or may not exist). A symbolic link has its own inode. OK, now let's talk about EA:

Extended attributes is a recent feature of Unix file system (at the time of the writing, year 2002). They are not part of the inode, nor part of the data, nor part of a given directory. They are stored beside the inode and are a set of pair of key and value. The owner of the file can add or define any key and eventually associate it an arbitrary data value. The user has also the means to list and remove a particular key for the EA. What are they used for: Simply a more flexible way to associate information to a file.

One particular interesting use of EA, is ACL: Access Control List. ACL are implemented using EA (Linux) and add a more fine grain in assigning access permission to file. For more information one EA and ACL, see the site of Andreas Grunbacher:

http://acl.bestbits.at/

File forks under MACOS X also relies on EA, as well the security features brought by SELinux.

EA & Differential Backup

to determine that an EA has changed, dar looks at the ctime value of the inode. if ctime has changed, (due to EA change, but also due to permission or owner change) dar saves the EA. ctime also changes, if atime or mtime changes. So if you access a file or modify it, dar will consider that the EA have changed also. This is not really fair, I admit.

It may sound better would be to compare EA one by one, and record those that have changed or have been deleted. But to be able to do so, all EA and their associated values must reside in the archive table of content (the catalogue), which is by design stored in memory. As EA can grow up to 64 KB by file, this can lead to a quick saturation of the virtual memory, which is already enough solicited by other information from catalogue.

These two schemes implies a different pattern for saving EA in archive. In the first case (no EA in memory except at time of operation on it), to avoid skipping in the archive (and ask the user to change of disks too often, or just put pressure on the cache and disk I/O), EA must be stored beside the data of the file (if present). Thus they must be distributed all along the archive (except at the end that only contains the catalogue).

In the second case (EA are loaded in memory for comparison), EA must reside beside or within the catalogue, in any case at the end of the archive, not to have to user to need all disks to just take an archive as reference.

As the catalogue grows already fast with the number of file to save (from a few bytes for hard_link to 400 bytes around per directory inode), the memory saving option has been adopted.

Thus, EA changes are based on the ctime change. Unfortunately, no system call permits to restore ctime. Thus, restoring a differential backup after its reference has been restored, will present restored inode as more recent than those in the differential archive, thus the -r option would prevent any EA restoration. In consequence, -r has been disabled for EA, it does only concern data contents. If you don't want to restore any EA but just more recent data, you can use the following : -r -u "*"

Archive structure in brief

The Slice Level

A slice is composed of a header, data and trailer (the trailer appeared with archive format version 8)

+--------+-------------------------------------------+-------+ | header | Data |Trailer| | | | | +--------+-------------------------------------------+-------+

the slice header is composed of:

The TLV list will receive any future new field related to slice header.

+-------+----------+------+-----------+-------+ | Magic | internal | flag | extension | TLV | | Num. | name | byte | byte | list | +-------+----------+------+-----------+-------+

The header is the first thing to be written, and if the current slice is not the last slice (all data to write could not fit in it), before format 8, the flag field was changed indicating that another slice follows. Since archive format 8, the flag is set to a specific value indicating that the information telling whether the slice is the last or not is placed in a slice trailer, a new "structure" that appeared with that format and which is located at the end of each slice.

The header is also the first part to be read.

A TLV list is of course a list of TLV:

+-------+----------+------+-----------+- ...-----+-------+ | Number| TLV 1 | TLV 2| TLV 3 | | TLV n | | of TLV| | | | | | +-------+----------+------+-----------+--...-----+-------+

Each TLV item is, as commonly, defined as set of three fields:

+---------+-------------------------+-------------------------+ | Type | Length | Value | |(2 bytes)| (arbitrary large value) | (arbitrary large data) | +---------+-------------------------+-------------------------+

The 2 bytes type is large enough for today's need (65535 different types while only three used), however TLV 65535 is reserved for future use and will signal a new format for the type field.

To know in which slice and at which position to find a particular data, dar needs to know each file's size. This is the reason why each slice contains the slice size information, in particular the last slice. In older version, dar had to read the first slice first to get this slicing information. Then it could read the archive contents at the end of the last slice. Today, reading the last slice, dar can fetch the slicing scheme from the slice header (what we just detailed) and fetch the archive contents at the end of this same last slice.

The trailer (which is one byte length) is new since archive format version 8 (released 2.4.0). It contains the value that was located in the header flag field in older archive format, telling whether the slice is the last of the archive or not. When writting down a single sliced archive (no -s option provided), both the header and the trailer tell that the slice is the last of the archive (duplicated information). However, when doing multi-sliced archive, it is not possible to known whether a slice is the last before reaching the requested amount of data per slice (which depends on the amount of byte to save, compression ratio, encryption overhead, etc.). Thus the header flag contains a value telling that to know whether the slice is the last or not, one must read the trailer.

In older format, it was necessary to seek back to update the header with the correct information when a new slice had to be created. But, keeping this behavior, it would not have been possible to make a digest "on the fly" (see --hash option). The addition of the trailer was required for that feature: to compute a md5 or sha1, ... hash for each slice. But, this costs one byte per slice, yes.

Data Name

As seen above in the header fields, we have among others the three following identifiers:

As already said, magic number is constant and let dar be (almost) sure a given file is a dar slice file, this is also based in particular on that field that the common unix 'file' command identifies an dar archive. Also briefly explained, the internal_name is a identifier that let dar be almost sure that several slices are from the same archive (problem car arise if two archives of same basename have their slices mixed together: dar will see that and report it to the user).

The new and not yet described field is the "data_name". The data_name field is also present in the archive catalogue (the table of content) of each archive. It may be the same value as the one in the slice headers (normal archives) or another value if the archive results from a catalogue isolation process.

Why this field? A new feature with release 2.4.0 is the ability to use an extracted catalogue to backup a internal catalogue of a given archive. Comparing the data_name value of the catalogue resulting from the isolation operation to the data_name value present int the slices of an archive to rescue, dar can be (almost) sure that the extracted catalogue matches the data present in the archive the user is trying to use it with.

In brief:

Fields Normal Archive Resliced Using dar_xform Resulting From Isolation isolated archive resliced with dar_xform
Internal_name (slice header) A B C D
data_name (slice header) A A C C
data_name (archive catalogue) A A A A

Archive Level

The archive level describes the structure of the slice's data field (removing header and trailer of each slice), when they are all sticked together from slice to slice:

+---------+----------------------------+-----------+--------+---------+--------+ | version | Data | catalogue | term 1 | version | term 2 | | header | | | | trailer | | +---------+----------------------------+-----------+--------+---------+--------+

version header

The version header is an almost diplication of version trailer. It is used when reading an archive in sequential mode, to be able to prepare the proper compression layer, and known whether escape sequence mark are present in the archive. It is present by default but absent when removing tape marks (-at option).

version trailer

the version trailer (which may still be called "version header" in some part of the documentation because it was originally located at the beginning of the archive in previous archive format) is composed of:

+---------+------+---------------+------+--------+-------+----------+---------------+------------+------+------+-----------+-----+-------+ | edition | algo | command line | flag | initial| crypto| crypted | gnupg crypted | reference | salt | salt | iteration |hash | CRC | | | | | | offset | algo | key size | sym. key | slicing | size | (KDF)| count(KDF)|(KDF)| | +---------+------+---------------+------+--------+-------+----------+---------------+------------+------+------+-----------+-----+-------+

The trailer is used when reading an archive in direct access mode, to build the proper compression layer, escape layer (it is needed if mark have been inserted in the archive to un-escape data that could else be taken as an escape sequence mark) and encryption layer.

Slice Layout

slice_layout is a class holding information about the first slice size, other file size as well as size of slice headers. It is used in sar class to store the slicing information read from the class header which for historical reason does not use the slice_layout class but read/write those fields as independent ones. Class sar thus relies on class header to read or write this information to/from an archive.

This class is also used in class header_version which is used to read/write the archive header/trailer at the beginning and at the end of the backup. The use of the class slice_layout in that structure is only present for isolated catalogues and contains the slicing information of the archive of reference. The class header_version relies on the slice_layout::read() and slice_layout::write() methods to read and write those fields to/from an archive's trailer/header. Here, this information is needed for the -Tslice option to work when applied to an isolated catalogue.

The isolation process: At the end of archive creation, either read from filesystem (normal isolation) or just written to filesystem (on-fly isolation), the field archive::i_archive::slices holds the slice layout of the current archive. It has been read for the sar layer if the archive is sliced or is set to a default value if there is no sar layer and thus no slicing. The isolation process is defined in archive::i_archive::op_isolate(): it transmits the slices information to macro_tools_create_layers() which, in particular, uses it to setup the header_version. Once the layers are built by this routine, this header_version structure is written at the beginning of the archive (version header). After the catalogue has been written to these layers, op_isolate() calls macro_tools_close_layers() giving it this same generated header_version (containing the slice_layout of the current archive, which will be the reference for the isolated catalogue under construction), for it be written at the end of the archive (this is then the "version trailer").

The data

The data is a suite of file contents, with EA and FSA if present. When tape mark is used, a copy of the CRC is placed after's file Data and file's EA, to be used when reading the archive in sequential mode. This CRC is also dropped into the catalogue which takes place at the end of the archive to be used when reading the archive in direct access mode (the default). Last when delta binary is used, a file signature may follow the file's data:

 ....--+---------------------+-------+----+----+------------+------------+----+-----------+-----+-....   | file1 data | delta | EA | FSA| file2 data | file3 data | EA | file4 | FSA |   | (may be compressed) | sig | | |(no EA/FSA) | | | delta sig | |  ....--+---------------------+-------+----+----+------------+------------+----+-----------+-----+-....

In the previous archive example, we find:

file1 shows all fields that can be associated with an inode, but none is mandatory, though if present they always follow this order:

when tape marks are not set, the CRC listed above are not present inline in the archive, though they are still stored in the catalogue at the end of the archive (see below)

More precisely about delta signature combined with tape marks, there is additional fields present than just the delta sig and its CRC:

+------+------+------+---------------+----------+--------+ | base | sig | sig | sig data | data CRC | result | | CRC | size | block| (if size > 0) | if | CRC | | (*) | | len | | size > 0 | | +------+------+------+---------------+----------+--------+

The catalogue

the catalogue, contains all inode, directory structure and hard_links information as well as data and EA CRC. The directory structure is stored in a simple way: the inode of a directory comes, then the inode of the files it contains, then a special entry named "EOD" for End of Directory. Considering the following tree:

 - toto   | titi   | tutu   | tata   | | blup   | +--   | boum   | coucou   +---

it would generate the following sequence for catalogue storage:

+-------+------+------+------+------+-----+------+--------+-----+ | toto | titi | tutu | tata | blup | EOD | boum | coucou | EOD | | | | | | | | | | | +-------+------+------+------+------+-----+------+--------+-----+

the EOD entries take on byte. Doing this way, there is no need to store the full path of each file, just the filename is recorded. The file order and the EOD can be used to find the relative path of each entry.

To be complete, the previous sequence is preceeded by:

This sequence is then followed by a CRC calculated on the whole catalogue dumped data:

+------------+----------+------------------------------------+-----+ | data_name | in_place |<--------catalogue content--------->| CRC | | | | toto|titi|tata|blup|EOD|boum|... | | +------------+----------+------------------------------------+-----+

The terminator

the terminator stores the position of the beginning of the catalogue, it is the last thing to be written. Thus dar first reads the terminator, then the catalogue. Well, there is now two terminators, both are meant to be read backward. The second terminator points to the beginning of the "trailer version" which is read first in direct access mode. The first terminator points to the start of the catalogue, which is read once the adhoc compression and encryption layers has been built based on the information found on the "trailer version"

All Together

Here is an example of how data can be structured in a four sliced archive:

+--------+--------+------------------------+--+ | slice | version| file data + EA |Tr| | header | header | | | +--------+--------+------------------------+--+

the first slice (just above) has been defined smaller using the -S option

+--------+-----------------------------------------------------------------+--+ | slice | file data + EA |Tr| | header | | | +--------+-----------------------------------------------------------------+--+ +--------+-----------------------------------------------------------------+--+ | slice | file data + EA |Tr| | header | | | +--------+-----------------------------------------------------------------+--+ +--------+---------------------+-----------+------ +---------+--------+--+ | slice | file data + EA | catalogue | term 1| version | term 2 |Tr| | header | | | | trailer | | | +--------+---------------------+-----------+-------+---------+--------+--+

the last slice is smaller because there was not enough data to make it full.

Other Levels

Things get a bit more complicated if we consider compression and encryption. The way the problem is addressed in dar's code is a bit like networks are designed in computer science, using the notion of layers.

Here, there is a additional constraint, a given layer may or may not be present (encryption, compression, slicing for example). So all layer must have the same interface for serving the layer above them.

This interface is defined by the pure virtual class name generic_file, which provides generic methods for reading, writing, skipping, getting the current offset when writing or reading data to any layer. Here follows some example of implementation:

Here are below is an example of possible layer stacking libdar is using to address a particular combinaison of commands and options:

  +----+--+----+-................+---------+  archive |file|EA|file| |catalogue|  layout |data| |data| | |   +----+--+----+-................+---------+   | | | | |   +-----+ | +-------+ | |  sparse |spars| | |sparse | | |  file |file | | |file | | |  detection |detec| | |detect.| | |  layer +-----+ | +-------+ | |  (optional) | | | | |   | | | | |   +-----+ | | +-----+ |   |delta| | | |delta| |   |sig | | | |sig | |   +-----+ | | +-----+ |   | | | | |   | | | | |   V V V V V   +-----------------------------------------+  compression | (compressed) data |   +-----------------------------------------+   | |   | |   V V   +-----------------------------------------+  escape layer | escaped data / escape sequences |  (optional) +-----------------------------------------+   | | / First Terminateur   | | |   | | V  elastic +---+ | | +----+---+  buffers |EEE| | | | T1 |EEE|   +---+ | | +----+---+   | | | | Second   V V V V Terminator   +---------------------------------------------------------+ |  cipher | (encrypted) data / cache if no encryption | |   +---------------------------------------------------------+ V   | | +---------+----+  +-------+ | | | trailer | T2 |  | header| | | +---------+----+  +-------+ | | | |   | | | | |   V V V V v  +----------------------------------------------------------------------------------+  | data |  +----------------------------------------------------------------------------------+   | | | | | | | | | | | |  +---------+ +---------+ +---------+ +---------+ +---------+ +---------+ +----+  |hash_file| |hash_file| |hash_file| |hash_file| |hash_file| |hash_file| |hash|  +---------+ +---------+ +---------+ +---------+ +---------+ +---------+ +----+   | (hash_file are optional) | | | | | | |  slice | | | | | | | | | | | |  headers | | | | | | | | | | | |   | | | | | | | | | | | | | |   | +---|------\ | | | | | | | | | | |   | | | | | | | | | | | | | |   V V V V V V V V V V V V V V  +---------+ +---------+ +---------+ +---------+ +---------+ +---------+ +----+  |HH| data | |HH| data | |HH| data | |HH| data | |HH| data | |HH| data | |HH| |  +---------+ +---------+ +---------+ +---------+ +---------+ +---------+ +----+   slice 1 slice 2 slice 3 slice 4 slice 5 slice 6 slice 7

The elastic buffers are here to prevent plain text attack, where one knows which data is expected at a given place, an trying to guess the cipher comparing the expected data and the encrypted one. As dar generates structured archives, there would have some possibility that one use this attack to crack an archive encryption. To overcome this problem, elastic buffers have been added at the beginning and at the end of encrypted data. This way it is not possible to know where is located a given archive structure within the encrypted data. The elastic buffers are random data that contain at a random place a pattern that tells the overall size of the buffer (which size is randomly chosen during archive creation). The pattern is of the form ">###<" where the hash field (###) contains the elastic buffer size in binary. Small elastic buffer can be "><" for two bytes or "X" for one byte, but as it is encrypted beside archive data, it is not possible to determine its size for one that does not hold the archive encryption key. Elastic buffer are usually several kilobyte long. Here follows an example of elastic buffer:

972037219>20<8172839

For clarity, the size field between '>' and '<' has been written in decimal instead of binary, as well as the random data inside the elastic buffer. The location of the size field '>20<' is also randomly chosen at creation time.

Teminateur is short structure that is intended to be read backward. It gives the absolute position of a given item within the archive: The second terminateur let dar skip at the beginning of the archive trailer. The first terminateur (eventually encrypted) let dar skip at the beginning of the catalogue).

Overflow in arithmetic integer operations

Some code explanation about the detection of integer arithmetic operation overflows. We speak about *unsigned* integers, and we have only portable standard ways to detect overflows when using 32 bits or 64 bits integer in place of infinint.

Written in binary notation, a number is a finite suite of digits (0 or 1). To obtain the original number from its binary representation, we must multiply each digit by successive powers of two. Example the binary representation "101101" designs the number N where: N = 2^5 + 2^3 + 2^2 + 2^0

In that context we will say that 5 is the maximum power of N (the power of the higher non null binary digit).

For the addition operation ("+"), if an overflow occurs, the result is less than one or both operands, so overflow is not difficult to detect. To convince you, let's assume that the result is greater than both operands while it has overflowed. Thus the real result (without overflow) less the first operands should gives the second argument, but here we get a value that is greater than the all 1 bits integer (because there was an overflow and the resulting overflowed value is greater than the second and the first operand), so this is absurd, and in case of overflow the resulting value is less than one of the operands.

For substraction operation ("-"), if the second operand is greater than the first there will be an overflow (result must be unsigned thus positive) else there will not be any overflow. Thus detection is even more simple.

for division ("/") and modulo ("%") operations, there is never an overflow (there is just the illicit division by zero).

for multiplication operation ("*"), a heuristic has been chosen to quickly detect overflow, the drawback is that it may triggers false overflow when number get near the maximum possible integer value. Here is the heuristic used:

given A and B two integers, which max powers are m and n respectively, we have:

A < 2^(m+1)

and

B < 2^(n+1)

thus we also have:

A.B < 2^(m+1).2^(n+1)

which is:

A.B < 2^(m+n+2)

In consequences we know that the maximum power of the product of A by B is at most m+n+1 and while m+n+1 is less than or equal to the maximum power of the integer field there will not be overflow else we consider there will be an overflow even if it may not be always the case (this is an heuristic algorithm).

libdar and thread-safe requirement

The following should only concern those who plan to use libdar in their own programs.

If expect to only have one thread using libdar, there is no problem, of course, you will however have to call one of the get_version() first, as usual. Thing change if you intend to have several concurrent threads using libdar library in a same process.

libdar is thread-safe but you need to respect certain conditions:

Thread-save support must have been activated at libdar compilation time. Several 'configure' options have an impact this thread-safe support:

--enable-test-memory
is a debug option that avoid libdar to be thread-safe, so don't use it unless debugging non thread relative behaviors.
--disable-thread-safe
this disable thread safe support in libdar, it may be used for debugging purposes or to marginally speed up the libdar execution

If you want to rely in multi-thread support in libdar, a good practice is to check that a call to libdar::compile_time::thread_safe() returns true. If not your program should behave accordingly either abort with an error or have all libdar interaction done in a single thread at a time.

IMPORTANT
It is vital to to call one of the get_version(...) as first call to libdar and wait for its return before assuming multi-thread support is ready for use.

For more information about libdar and its API, check the doc/api_tutorial.html document and the API reference manual under doc/html/index.html

Native Language Support

Native Language Support (NLS) is the ability a given program has to display its messages in different languages. For dar, this is implemented using the gettext tools. This tool must be installed on the system for dar can be able to display messages in another language than English (some would joke telling that this is more Frenglish than good old English, they might be right, so feel free to report any too Frenchy syntax, spelling or grammar)

The dar behavior is the following:

If NLS is available you just have to set the LANG environment variable to your locale settings to change the language in which dar displays its messages (see ABOUT-NLS for more about the LANG variable).

Just for information, gettext() is the name of the call that makes translations of string in the program. This call is implemented in the library called 'libintl' (intl for Internationalization).

Refer to the ABOUT-NLS file at the root of the source package to learn more about the way to display dar's messages in your own language. Note that not all languages are yet supported, this is up to you to send me a translation in your language and/or contact a translating team as explained in ABOUT-NLS.

To know which languages are supported by dar, read the po/LINGUAS file and check out for the presence of the corresponding *.po files in this directory.

Dar Release Process

General view

Dar does not follow the so called modern continuous development also known as agile. First dar is not a devops code while it can be used for devops. dar is rather a (free) product outside any particuar customer specific context. Second keeping the good old development process separating development from quality engineering in different phases over time brings the expected robustness a backup tool must have.

This separation takes the form of branches of a tree where the trunk is the development code, which grows when receiving new features, and the branches carry the so called releases. Several releases can be found on a given branch but branches only receive bug fixes between two releases.

Several branches may be active at a time, leading to several concurrent releases (for examples releases 2.5.22 and 2.6.5 have the same date).

Last all bug fixes from an older and active branch are merged into the more recent branches up to the trunk, which is the development code. But the modification of code in branches is kept as minimal as possible.

Phasing

Development Phase:

During that phase, dar receives new features. At this stage sources are modified and tested unitary after each feature addition. These modifications and additions take place on the trunk.

Frozen API Phase:

At this stage, no new feature that would change the API are added. The API shall be documented enough to let API users give their feedback about the design and its implementation.

During this time, development may continue, whatever is necessary while it does not changes the API, like documentation of the whole project, problem fixes in libdar, new features in command-line part of the source, and so on. A set of non regression test is also updated and run to check that new features works as expected with old ones and old ones still work fine toghether.

Pre-release Phase:

This phase is announced on dar-news mailing-list which you can subscribe to also be informed about new releases, security issues, and other major problems.

The goal of this phrase is to let anyone test the release candidate in their own environment and report any code building problem and any bug met to the pre-release mailing-list.

Usually this code is on the trunk (the master in GIT) but pre-built packages are also provided daily for testing.

Release Phase:

When the pre-release ends, the first official release is provided (its last number version is zero like for release 2.7.0). It is available broadly by mean of sourceforge mirrors.

A new branch (like branch_2.7.x) is created to hold the released code, and will receive any further releases by mean of bug fixes (releases 2.7.1, releases 2.7.2,...).

As for any new release, an email is sent with the list of Changes to the dar-news mailing-list. During that phase, users are welcome to report bugs/problem either first asking for support on the dar-support mailing-list or openning a bug report.

Interim Releases:

This type of release is fast to setup and will later become an official release. Doing so saves time to update the website, time to generate the windows binaries, time to push data to the sourceforge mirror update, and so on. Interim releases are provided for validating bug fixes, and are setup from the latest code of a branch. They are usually named after the future release they will produce suffixed by RC# (for Release Candidate). For example, we got version 2.6.6.RC2 in the past. Note that interim releases are deleted once the official release is done (release 2.6.6 in our example).

Dar's Versions

package release version

Dar packages are release during the pre-release phase (see above). Each version is identified by three number separated by dot like for example, version 2.3.0:

Note that release versionning is completely different from what is done for the Linux kernel, here for dar all versionnized packages are stable released software and thus stability increases with the last number of the version.

Libdar version

Unfortunately, the release version does not give much information about the compatibility of different libdar version, from the point of view of an external application, that thus has not been released with libdar and may be faced to different libdar versions. So, libdar has its own version. It is also a three number version, (for example, libdar version 6.2.7), but each number has a different meaning:

Other versions

Beside the libdar library, you can find several command-line applications: dar, dar_xform, dar_slave, dar_manager, dar_cp and dar_split. These have their own version which is, here too, made of three numbers. Their meaning is the same as the meaning for the package release version: The last number increases upon bug fix, the middle upon new feature, the first upon major architecture changes.

Archive format version

When new features come, it is sometime necessary to change the structure of the archive. To be able to know the format used in the archive, a field is present in each archive that defines this format.

Currently the archive format version is 11 meaning there is around 11 different version of dar archive format, plus the version increment that were needed to fix a bug (like version 8.1 and 10.1), but dar and libdar have been developped to be able to read them all (all those that existed when the version was released).

Cross reference matrix

OK, you may now find that this is a bit complex so a list of version is give below. Just remember that there are two points of view: The command-line user and the external application developer.

Date release (and dar version) Archive format Database Format libdar version dar_xform dar_slave dar_manager dar_cp dar_split
April 2nd, 2002 1.0.0 01 ----- ----- ----- ----- ----- ----- -----
April 24th, 2002 1.0.1 01 ----- ----- ----- ----- ----- ----- -----
May 8th, 2002 1.0.2 01 ----- ----- ----- ----- ----- ----- -----
May 27th, 2002 1.0.3 01 ----- ----- ----- ----- ----- ----- -----
June 26th, 2002 1.1.0 02 ----- ----- 1.0.0 1.0.0 ----- ----- -----
Nov. 4th, 2002 1.2.0 03 01 ----- 1.1.0 1.1.0 1.0.0 ----- -----
Jan. 10th, 2003 1.2.1 03 01 ----- 1.1.0 1.1.0 1.0.0 ----- -----
May 19th, 2003 1.3.0 03 01 ----- 1.1.0 1.1.0 1.1.0 ----- -----
Nov. 2nd, 2003 2.0.0 03 01 1.0.0 1.1.0 1.1.0 1.2.0 1.0.0 -----
Nov. 21th, 2003 2.0.1 03 01 1.0.1 1.1.0 1.1.0 1.2.0 1.0.0 -----
Dec. 7th, 2003 2.0.2 03 01 1.0.2 1.1.0 1.1.0 1.2.0 1.0.0 -----
Dec. 14th, 2003 2.0.3 03 01 1.0.2 1.1.0 1.1.0 1.2.1 1.0.0 -----
Jan. 3rd, 2004 2.0.4 03 01 1.0.2 1.1.0 1.1.0 1.2.1 1.0.0 -----
Feb. 8th, 2004 2.1.0 03 01 2.0.0 1.2.0 1.2.0 1.2.1 1.0.0 -----
March 5th, 2004 2.1.1 03 01 2.0.1 1.2.1 1.2.1 1.2.2 1.0.0 -----
March 12th, 2004 2.1.2 03 01 2.0.2 1.2.1 1.2.1 1.2.2 1.0.0 -----
May 6th, 2004 2.1.3 03 01 2.0.3 1.2.1 1.2.1 1.2.2 1.0.1 -----
July 13th, 2004 2.1.4 03 01 2.0.4 1.2.1 1.2.1 1.2.2 1.0.1 -----
Sept. 12th, 2004 2.1.5 03 01 2.0.5 1.2.1 1.2.1 1.2.2 1.0.1 -----
Jan. 29th, 2005 2.1.6 03 01 2.0.5 1.2.1 1.2.1 1.2.2 1.0.1 -----
Jan. 30th, 2005 2.2.0 04 01 3.0.0 1.3.0 1.3.0 1.3.0 1.0.1 -----
Feb. 20th, 2005 2.2.1 04 01 3.0.1 1.3.1 1.3.1 1.3.1 1.0.1 -----
May 12th, 2005 2.2.2 04 01 3.0.2 1.3.1 1.3.1 1.3.1 1.0.2 -----
Sept. 13th, 2005 2.2.3 04 01 3.1.0 1.3.1 1.3.1 1.3.1 1.0.2 -----
Nov. 5th, 2005 2.2.4 04 01 3.1.1 1.3.1 1.3.1 1.3.1 1.0.2 -----
Dec. 6th, 2005 2.2.5 04 01 3.1.2 1.3.1 1.3.1 1.3.1 1.0.2 -----
Jan. 19th, 2006 2.2.6 04 01 3.1.3 1.3.1 1.3.1 1.3.1 1.0.3 -----
Feb. 24th, 2006 2.2.7 04 01 3.1.4 1.3.1 1.3.1 1.3.1 1.0.3 -----
Feb. 24th, 2006 2.3.0 05 01 4.0.0 1.4.0 1.3.2 1.4.0 1.1.0 -----
June 26th, 2006 2.3.1 05 01 4.0.1 1.4.0 1.3.2 1.4.0 1.1.0 -----
Oct. 30th, 2006 2.3.2 05 01 4.0.2 1.4.0 1.3.2 1.4.0 1.1.0 -----
Feb. 24th, 2007 2.3.3 05 01 4.1.0 1.4.0 1.3.2 1.4.1 1.2.0 -----
June 30th, 2007 2.3.4 06 01 4.3.0 1.4.0 1.3.2 1.4.1 1.2.0 -----
Aug. 28th, 2007 2.3.5 06 01 4.4.0 1.4.1 1.3.3 1.4.2 1.2.1 -----
Sept. 29th, 2007 2.3.6 06 01 4.4.1 1.4.1 1.3.3 1.4.2 1.2.1 -----
Feb. 10th, 2008 2.3.7 06 01 4.4.2 1.4.2 1.3.4 1.4.3 1.2.2 -----
June 20th, 2008 2.3.8 07 01 4.4.3 1.4.2 1.3.4 1.4.3 1.2.2 -----
May 22nd, 2009 2.3.9 07 01 4.4.4 1.4.2 1.3.4 1.4.3 1.2.2 -----
April 9th, 2010 2.3.10 07 01 4.4.5 1.4.2 1.3.4 1.4.3 1.2.2 -----
March 13th, 2011 2.3.11 07 01 4.5.0 1.4.3 1.3.4 1.4.3 1.2.2 -----
February 25th, 2012 2.3.12 07 01 4.5.1 1.4.3 1.3.4 1.4.3 1.2.2 -----
June 2nd, 2011 2.4.0 08 02 5.0.0 1.5.0 1.4.0 1.5.0 1.2.3 -----
July 21st, 2011 2.4.1 08 02 5.1.0 1.5.0 1.4.0 1.6.0 1.2.3 -----
Sept. 5th, 2011 2.4.2 08 02 5.1.1 1.5.0 1.4.0 1.6.0 1.2.3 -----
February 25th, 2012 2.4.3 08 03 5.2.0 1.5.0 1.4.0 1.7.0 1.2.3 -----
March 17th, 2012 2.4.4 08 03 5.2.1 1.5.0 1.4.0 1.7.1 1.2.3 -----
April 15th, 2012 2.4.5 08 03 5.2.2 1.5.1 1.4.1 1.7.2 1.2.4 -----
June 24th, 2012 2.4.6 08 03 5.2.3 1.5.2 1.4.2 1.7.3 1.2.5 -----
July 5th, 2012 2.4.7 08 03 5.2.4 1.5.2 1.4.3 1.7.3 1.2.5 -----
September 9th, 2012 2.4.8 08 03 5.3.0 1.5.3 1.4.4 1.7.4 1.2.6 -----
January 6th, 2013 2.4.9 08 03 5.3.1 1.5.3 1.4.4 1.7.4 1.2.7 -----
March 9th, 2013 2.4.10 08 03 5.3.2 1.5.3 1.4.4 1.7.4 1.2.7 -----
Aug. 26th, 2013 2.4.11 08 03 5.4.0 1.5.4 1.4.5 1.7.5 1.2.8 -----
January 19th, 2014 2.4.12 08 03 5.5.0 1.5.4 1.4.5 1.7.6 1.2.8 -----
April 21st, 2014 2.4.13 08 03 5.6.0 1.5.5 1.4.5 1.7.7 1.2.8 -----
June 15th, 2014 2.4.14 08 03 5.6.1 1.5.5 1.4.5 1.7.7 1.2.8 -----
September 6th, 2014 2.4.15 08 03 5.6.2 1.5.6 1.4.6 1.7.8 1.2.8 -----
January 18th, 2015 2.4.16 08 03 5.6.3 1.5.6 1.4.6 1.7.8 1.2.8 -----
January 31th, 2015 2.4.17 08 03 5.6.4 1.5.6 1.4.6 1.7.8 1.2.8 -----
August 30th, 2015 2.4.18 08.1 03 5.6.5 1.5.6 1.4.6 1.7.8 1.2.8 -----
October 4th, 2015 2.4.19 08.1 03 5.6.6 1.5.6 1.4.6 1.7.8 1.2.8 -----
November 21th, 2015 2.4.20 08.1 03 5.6.7 1.5.8 1.4.8 1.7.10 1.2.10 -----
April 24th, 2016 2.4.21 08.1 03 5.6.8 1.5.9 1.4.9 1.7.11 1.2.10 -----
June 5th, 2016 2.4.22 08.1 03 5.6.9 1.5.9 1.4.9 1.7.11 1.2.10 -----
October 29th, 2016 2.4.23 08.1 03 5.6.9 1.5.9 1.4.9 1.7.11 1.2.10 -----
January 21st, 2017 2.4.24 08.1 03 5.6.10 1.5.9 1.4.9 1.7.11 1.2.10 -----
October 4th, 2015 2.5.0 09 04 5.7.0 1.5.7 1.4.7 1.7.9 1.2.9 1.0.0
October 17th, 2015 2.5.1 09 04 5.7.1 1.5.8 1.4.8 1.7.10 1.2.10 1.0.0
November 21st, 2015 2.5.2 09 04 5.7.2 1.5.8 1.4.8 1.7.10 1.2.10 1.0.0
January 4th, 2016 2.5.3 09 04 5.7.3 1.5.8 1.4.8 1.7.10 1.2.10 1.0.0
April 24th, 2016 2.5.4 09 04 5.8.0 1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
June 5th, 2016 2.5.5 09 04 5.8.1 1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
September 10th, 2016 2.5.6 09 04 5.8.2 1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
October 29th, 2016 2.5.7 09 04 5.8.3 1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
January 2nd, 2017 2.5.8 09 04 5.8.4 1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
January 21st, 2017 2.5.9 09 04 5.9.0 1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
April 4th, 2017 2.5.10 09 04 5.10.0 1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
June 23rd, 2017 2.5.11 09 04 5.11.0 1.5.9 1.4.9 1.7.12 1.2.10 1.0.0
September 2nd,2017 2.5.12 09 04 5.11.1 1.5.9 1.4.9 1.7.12 1.2.10 1.0.0
October 28th, 2017 2.5.13 09 04 5.12.0 1.5.10 1.4.10 1.7.13 1.2.10 1.0.0
December 20th, 2017 2.5.14 09 04 5.12.1 1.5.10 1.4.10 1.7.13 1.2.10 1.1.1
April 28th, 2018 2.5.15 09 04 5.12.2 1.5.10 1.4.10 1.7.13 1.2.10 1.1.1
July 19th, 2018 2.5.16 09 04 5.12.3 1.5.10 1.4.10 1.7.13 1.2.10 1.1.1
September 30th, 2018 2.5.17 09 04 5.13.0 1.5.10 1.4.10 1.7.13 1.2.10 1.1.1
December 8th, 2018 2.5.18 09 04 5.13.1 1.5.10 1.4.10 1.7.13 1.2.10 1.1.1
January 19th, 2019 2.5.19 09 04 5.13.2 1.5.11 1.4.11 1.7.14 1.2.10 1.1.1
February 9th, 2019 2.5.20 09 04 5.13.3 1.5.11 1.4.11 1.7.14 1.2.10 1.1.1
May 25th, 2019 2.5.21 09 04 5.13.4 1.5.11 1.4.11 1.7.14 1.2.10 1.1.1
July 6th, 2019 2.5.22 09 04 5.13.5 1.5.11 1.4.11 1.7.14 1.2.10 1.1.1
December 16th, 2018 2.6.0 10 05 6.0.0 1.6.0 1.5.0 1.8.0 1.2.11 1.1.2
January 19th, 2019 2.6.1 10 05 6.0.1 1.6.0 1.5.0 1.8.0 1.2.11 1.1.2
February 9th, 2019 2.6.2 10 05 6.0.2 1.6.1 1.5.1 1.8.1 1.2.12 1.1.2
March 30th, 2019 2.6.3 10.1 05 6.1.0 1.6.1 1.5.1 1.8.1 1.2.12 1.1.2
May 25th, 2019 2.6.4 10.1 05 6.1.1 1.6.1 1.5.1 1.8.1 1.2.12 1.1.2
July 6th, 2019 2.6.5 10.1 05 6.1.2 1.6.1 1.5.1 1.8.1 1.2.12 1.1.2
September 21st, 2019 2.6.6 10.1 05 6.2.0 1.6.2 1.5.2 1.8.2 1.2.12 1.1.2
January 12th, 2020 2.6.7 10.1 05 6.2.1 1.6.2 1.5.2 1.8.2 1.2.12 1.1.2
February 8th, 2020 2.6.8 10.1 05 6.2.2 1.6.2 1.5.2 1.8.2 1.2.12 1.1.2
March 22nd, 2020 2.6.9 10.1 05 6.2.3 1.6.2 1.5.2 1.8.2 1.2.12 1.1.2
May 31st, 2020 2.6.10 10.1 05 6.2.4 1.6.3 1.5.3 1.8.3 1.2.13 1.1.3
September 5th, 2020 2.6.11 10.1 05 6.2.5 1.6.3 1.5.3 1.8.3 1.2.13 1.1.3
September 11th, 2020 2.6.12 10.1 05 6.2.6 1.6.3 1.5.3 1.8.3 1.2.13 1.1.3
November 8th, 2020 2.6.13 10.1 05 6.2.7 1.6.3 1.5.3 1.8.3 1.2.13 1.1.3
March 14th, 2021 2.6.14 10.1 05 6.2.8 1.6.3 1.5.3 1.8.3 1.2.13 1.1.3
May 13th, 2021 2.6.15 10.1 05 6.2.9 1.6.3 1.5.3 1.8.3 1.2.13 1.1.3
January 2nd, 2022 2.6.16 10.1 05 6.2.10 1.6.3 1.5.3 1.8.3 1.2.13 1.1.3
April 24th, 2021 2.7.0 11 06 6.3.0 1.7.0 1.6.0 1.9.0 1.2.13 1.2.0
May 13th, 2021 2.7.1 11.1 06 6.4.0 1.7.0 1.6.0 1.9.0 1.2.13 1.2.0
September 25th, 2021 2.7.2 11.1 06 6.4.1 1.7.0 1.6.0 1.9.0 1.2.13 1.2.0
January 2nd, 2022 2.7.3 11.1 06 6.4.2 1.7.0 1.6.0 1.9.0 1.2.13 1.2.1
March 13th, 2022 2.7.4 11.1 06 6.4.3 1.7.0 1.6.0 1.9.0 1.2.13 1.2.1
April 13th, 2022 2.7.5 11.1 06 6.4.4 1.7.0 1.6.0 1.9.0 1.2.13 1.2.1
June 19th, 2022 2.7.6 11.1 06 6.4.5 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
August 7th, 2022 2.7.7 11.1 06 6.4.6 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
November 29th, 2022 2.7.8 11.1 06 6.5.0 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
March 26th, 2023 2.7.9 11.2 06 6.5.1 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
June 25th, 2023 2.7.10 11.2 06 6.6.0 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
August 5th, 2023 2.7.11 11.2 06 6.6.1 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
September 3rd, 2023 2.7.12 11.3 06 6.7.0 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
October 1st, 2023 2.7.13 11.3 06 6.7.1 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
March 23rd, 2024 2.7.14 11.3 06 6.7.2 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2
June 6th, 2024 2.7.15 11.3 06 6.7.3 1.7.0 1.6.0 1.9.0 1.3.0 1.2.2

Scrambling (weak encryption)

Before strong encryption was implemented, dar had only a very simple and weak encryption mechanism. It remains available in current release under the "scram" algorithm name. It mains advantage is that is does not rely on any external library, it is completely part of libdar.

How does it works?

Consider the pass phrase as a string, thus a sequence of bytes, thus a sequence of integer each one between 0 and 255 (including 0 and 255). The data to "scramble" is also a sequence of byte, usually much longer than the pass phrase. The principle is to add byte by byte the pass phrase to the data, modulo 256. The pass phrase is repeated all along the archive. Let's take an example:

the pass phrase is "he\220lo" (where \220 is the character which decimal value is 220). The data is "example"

Taken from ASCII standard we have:

  e x a m p l e   101 120 97 109 112 108 101     + h e \220 l o h e   104 101 220 108 111 104 101     ---------------------------------------------------------------     205 221 317 217 223 212 202     ---------------------------------------------------------------   modulo   256 : 205 221 61 217 223 212 202   \205 \201 = \217 \223 \212 \202

thus the data "example" will be written in the archive as \205\201=\217\223\212\202

This method allows to decode any portion without knowing the rest of the data. It does not consume much resources to compute, but it is terribly weak and easy to crack. Of course, the data is more difficult to retrieve without the key when the key is long. Today dar can also use strong encryption (blowfish and few others) and thanks to a encryption block can still avoid reading the whole archive to restore any single file.

Strong encryption internals

Encryption per block

To not completely break the possibility to directly access a file withing a dar backup, the archive is not encrypted as a whole (as would probably do an external program, like openssl). The encryption is done per large block of data. This way, each block can be decrypted independently from the others, and if you want to read some data somewhere you only need to decrypt the whole block(s) it is located in.

Such encryption block size can range from 10 bytes to 4 GB (10 kiB is the default value). We will design these as libdar cipher block, to differentiate with what the underlying cipher algoritm uses as block.

CBC and IV

Inside each libdar cipher block, dar relies on the CBC mode (Cipher Block Chaining). In this mode, the data of a libdar cipher block is split in small plaintext blocks (a few tens of bytes as defined by the cipher algorithm). These plaintext blocks are not ciphered independently, but the ciphering result is melt with the next plaintext block before the ciphering operation. The advantage of doing so, is that a repeated data pattern does not lead to a repeated encryption pattern. However the first plaintext block of a libdar cipher block is not melt with anything. To cope with that an Initial Vector (IV) can be provided to 'randomize' the encryption result, though one has to record what was the IV at time of encryption to be able to decipher the data.

The IV for each libdar cipher block is derived from its number inside the dar backup/archive and from the encryption key. The result is that even if you have the same data in two libdar cipher blocks, it will not result in the same encrypted data common to both (or to more libdar cipher blocks)

Elastic buffer

An "elastic buffer" is introduced at the beginning and at the end of the archive, to protect against plain text attack. The elastic buffer size randomly varies and is defined at execution time. It is composed of randomly generated data (using gcry_create_nonce()).

But to be able at reading time to determine the amount of random garbage that has been put at the beginning and at the end of the archive by mean of such elastic buffers, a small structure is randomly dropped inside the garbage that tells the overall size of this random data:

Two marks characters '>' and '<' delimit a size field, which indicate the byte size of the elastic buffer. The size field is randomly placed in the elastic buffer. Last, the buffer is encrypted with the rest of the data. Typical elastic buffer size range from 1 byte to 10 kB, for both initial and terminal elastic buffers.

Elastic buffers may also be used inside libdar encryption blocks, when the libdar cipher block size is not a multiple of the the ciphering algorithm plaintext block size. Such elastic buffer is added for padding at the end of the libdar cipher block.

Key Derivation Fonction - KDF

Just above, we saw:

First, what if you use always the same password/passphrase between different archives? Breaking one archive will let the attacker access all archives or backups. Worse, an attacker has much more data to work on and compute statistical decryption methods to get to its goal.

Second, directly using a human provided password would provide a weak encryption. Because in all the possible key values, humans choices lead to a very restricted set of values: only letters, digits, space and eventually punctuation characters, with a certain order at words level and phrase level, even 5ubst1tu1n9 (substituing) letters by digits does not bring much more randomness... Think that in the 256 different values a byte can have we barely use 50 of them. Thing that in the may letter combinaisons we have, we are far from using all of them (how many English words do you know that contain 'xz'? ...Or even more unreadable larger sequences?)

To get one step further in securing your data, libdar makes use of a Key Derivation Function (KDF):

Strong encryption uses secret key, to cipher/uncipher data, we will say "password" to simplify in the following.

To overcome this human weakness, the state of the art is to use a Key Derivation Function, which takes in input the human provided password plus a "salt" and outputs a stronger key (salt is described below). This function is based on a hash algoritm (sha1 by default, but this value can be modified since archive format 10/release 2.6.0) and an iteration count (2000 before format 10 and 200000 interations since release 2.6.0 by default). The KDF mixes salt and passphrase then hash the result and applies the hashing algorithm on it repeatedly the number of iteration count that has been selected, the final result is the hashed password

Note that as an alternative to PBKDF2 (from PKCS#5 v2) which is the name of the KDF algorithm we have just described, release 2.7.0 brings the modern argon2 KDF, which is the default, it has been built against libargon2 library. Argon2 also makes use of a salt and an interation could, so what was described above stays valid at a global view level.

So the KDF transform the well know human namespace of secret keys in a new namespace for keys, not wider, but not that well known. Of course, anyone with the knownledge of the salt, iteration count and algorithm is able to know it, but it costs a lot of effort in computing resource! And this is the objective. No need to hide or make secret the KDF parameters, when an attacker that would requires hours or days to crack a password based on a dictionnary attack now needs years or centuries to do so.

Choosing the salt

The salt is a randomly chosen value by libdar and stored in clear in the archive header, beside the iteration count and hash algorithm used for the KDF. Thus even if the user has the same password for different archive, the effective real key used for strong encryption will differ from archive to archive or from backup to backup, making much more difficult for an attacker to crack an archive using statistical methods over a large number of archives sharing the same human provided password. And second point, a dictionnary attack is much more costly to succeed, either you need hundred thouthand more time of hundred thouthand more CPU power.

In summary

In summary, salt randomize keys between archive, elastic buffers randomize data location inside the archive, IV randomize encryption between libdar encryption blocks, and CBC mode randomize encryption withing a libdar encryption block.

Here follows a diagram of the way key, block number, cipher algorithm and Initial Vector (IV) interact together:

  +------------------- [ algo in CBC mode ] -----------------------------> main key handle  algorithm -+ ^ |   +---> max key len | |   | | |   | | |   v | |  password ------> [ KDF ] ------> hashed_key -------------+ |   ^ | |   | | |  salt ----------------+ v |   ^ [ SHA1/SHA256 ] |   | | |  hash algo -----------+ | |   ^ v |   | essiv_password |  iteration count -----+ | |   | |   v |   [ Blowfish/AES256 in ECB mode ] |   | |   | |   v |  . . . . . . . . . . essiv key handle |  . Initialization . | |  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .|. . . . . . . . . . . . . . . | . . . . . . . . . .   v |  libdar cipher block's number -----------------------> [ encrypt ] ------> IV -----+ |   | |   | |   v v  libdar cipher block's data ---------------------------------------------------> [ encrypt/decrypt ] -----> data  

Public key based encryption internals

Dar does not encrypt the whole archive with a recipient's public key, but rather randomly chooses a (long) password for symmetric encryption (as seen above, except it does not need KDF and salt as it can select a much better random key than human can do), encrypts that password with the recipient's public keys (eventually signing it with your own private key) and drops a copy of this ciphered/signed data into the archive. At reading time, dar read the archive header to find the encrypted password, decrypt it using the user's private key then use that password, now in clear, to decrypt the rest of the archive with the adhoc symmetric algorithm.

Signing is done on the randomly chosen ciphering key as seen above, but also on the catalogue (the table of content located at the end of the archive). More precisely the catalogue is sha512-hashed and the resulting hash is signed. The catalogue hash and its signature are stored encrypted in the archive right after the catalogue.

Why not using only asymmetric encryption end to end?

First, for a given fixed amount of data, the resulting ciphered data size may vary. Thus sticking ciphered block together in the archive would not lead to easily know where a block starts and where is located its end. Second, doing that way allows an archive to have several different recipients, the password is ciphered for each of them and the archive is readable by any specified recipient, while they do not share any key. Doing that way has very little impact of archive size.

Why not using signing from end to end?

First, this can be done without dar, just use gpg on the slices of an archive, so there is no advantage in having such feature inside dar. The other reason is time, as signing the whole archive would be long. it would also be very painful to validate a given archive authenticity (time to read a whole archive), here we can leverage the fact the archive is encrypted (thus tampering the archive would be seen when unciphering it) and that dar uses of several mechanismes (compression, clear text CRC) to detect data corruption.

Security restriction when signing an archive for multiple recipients

For a multi recipient archive, any recipient has decode the signed and encrypted key. One could even reuses this same encryption key which may also be signed by the original sender to build a new archive with totally different content. Since 2.7.0 a warning is issued when this condition is met, but the action is still allowed as it may make sense when exchanging data between a set of people that trust each other, but still need to validate the authenticity of the exchanged data (over Internet for example).

But to exploit this security restriction the attack should still be difficult as the table of content (the catalogue) should is also signed within the archive. Touching any part of it would break this signature, thus the archive table of content cannot be tampered if the signature algorithm is strong. This catalogue contains the CRC of any file data, EA and FSA, as well as the size of the compressed data. Thus the attacker may modify the file's data but must manage for the resulting modification to still satisfy the CRC and have the same lenght. Worse, if compression is used, the modification must not exceed the amount of compressed data and must stay uncompressible by the same compression algorithm that was used for that file in the original archive.

In short, if this security restriction exists and in exploitable in therory, it is not so easy to realize it for a large portion of the archive.

Multi-threading in libdar

Ciphering

The ciphering in dar/libdar is performed by large block of data (see -# option and consorts), it was thus possible (added in 2.7.0) to parallelize the ciphering/deciphering operation assigning different blocks to different threads without changing the archive format. The measured performance gain is quite good, but usually hidden by the compression load or disk I/O. More than 2 threads for ciphering/deciphering does not bring much benefit to the overall processing time.

To assign blocks to worker threads (those that compute the ciphering algorithm), two helper threads are needed: one that copy data into block of memory and scatters it to workers for they can work in total autonomy, and another one that gather the resulting block of memory in the correct order.

If performance gain is present and no change in the archive format was necessary to leverage multi-threading for ciphering/deciphering, however, this has some impact on memory usage required by dar. Each thread will need a pair of blocks (one for the clear data, one for the ciphered equivalent). In addition, the gathering and scattering structures of blocks to/from the workers (which we will globally design under the term ratelier can hold each N block of memory (N being the number of workers) in fact the size of these structures has been chosen a bit wider: N+N/2. Last the helper thread that feeds data to the worker may hold one pair of block under construction, and the gathering thread may obtain from the ratelier_gather a whole structure (N+N/2 blocks) to work with. For N worker, the number of memory block is N+N/2 for each ratelier plus N for each worker, plus N+N/2 for the gathering thread and one for the scattering thread, which makes around 7*N blocks of additional memory requirement. Concretely, having 5 threads with encryption block of 10240 bytes (10 KiB) leads dar to allocate 350 KiB (7x5x10) of additional memory compared to the single threaded implementation.

Compression

For compression, things have been more complicated, as before release 2.7.0 compression was done per file in streaming mode. In that mode dar has to provide a sequential flow of bytes to the compression engine and at the same time retrieve the compression/decompression counterpart. The API provided by the different libraries (libz, libbzip2, liblzo2, liblzma5, libzstd, liblz4) do not provide multi-threaded approach for this mode of compressing data, however this gives the best compression ratio you can expect for a given file, given compression algorithm and compression level.

Before 2.7.0 parallelization had been tried but with little success by running compression and encryption in different threads. Usually one of the threads spent most of its time waiting for the other and the overall performance gain was little due to the added thread management and synchronization overhead. This scheme has been completely abandoned to the one exposed above for encryption. To extend this method to compression, a new compression mode has to be added, to the legacy streaming one to: the block compression mode, which is described below. Note that with release 2.7.0 both mode will coexist, and you will still be able to read old archive and generate new one with the old single thread/streaming approach.

The implementation is quite similar to the encryption approach, a set of worker proceed to the compression or decompression of one block at a time and two helper threads are used, one feed on one side and the other gather the results on the other side.

The consequence is a slightly degraded compression ratio when the block size is chosen small. But with larger blocks there is not noticeable compression ratio difference, though it has have an impact on the memory usage. That's a balance to find between what memory mean you have and want to use, how much time you have and want to spend for the compression and how much space you have want to use to store the resulting archive. Choosing a block size of 102400 bytes (see extended syntax of -z option) with 4 threads (see -G option) will lead libdar to allocated 2800 KiB (~ 2.7 GiB) in addition to what libdar already uses for single thread approach (same formula as what is described above for encryption). If you use both parallel encryption and parallel compression at the same time you have to sum the additional memory requirements of each feature.

Command-line

The thread number for both encryption and compression is driven by the -G option. See man page for details. Note that, as described above, multi-threaded compression/decompression is only possible with the new compression per block method, which means you have to specify a compression block size (see -z option).

Streaming versus block Compression mode

Streaming compression

Since its first day with version 1.0.0, dar/libdar provides streaming compression. It started with gzip algorithm, has been extended later to bzip2, then lzo, xz/lzma, and since release 2.7.0 two new algorithms have been added: zstd and lz4.

Streaming compression let libdar provide new decompress data bytes and in return from time to time get back some compressed data bytes from the compression engine. There is no limitation in size: a stream of uncompressed data enters the compression engine from which exists another stream of compressed data. You do not need very large memory buffers to feed and gather data to/from the engine, and can obtain very good compression ratio. Though the process cannot be parallelized as the stream of incoming data must be provided to this single engine in the expected order. The process is similar for decompression, you give back to the engine the compressed data. It does not matter if you feed the engine by smaller or larger groups of bytes than what the engine provided when compressing, as long as you keep the byte sequence order, the resulting uncompressed stream will be the original one you gave for compression.

Such streaming compression is natively provided by libz, libbz2, libxz/lzma and libzstd. So dar store these compressed data as provided: an unstructured and arbitrarily long sequence of bytes corresponding to the (compressed) data of a given file. Yes, dar compress file per file: a new file, a new stream.

Block compression

For lzo algorithm (added with 2.4.0 release) the liblzo2 library does not provide such streaming interface, only a per buffer compression is available. That is to say, you have to provide block of uncompressed data from which you get a variable amount of compressed bytes. To decompress the data you have to provide back this exact block to the decompression engine. If you sticked the blocks together and do not provide them back at their original boundaries, the decompression will fail.

For that reason dar use a structure inside the compressed file's data to record where each block starts and ends, this way it is possible to feed the decompressing engine with coherent compressed data and get back the original clear data.

The structure used is made of an arbitrary long sequence of blocks. Each block is constitued of three fields:

+--------+-----------+---------------------+--------+------------+-----------------+-----+--------+----------------------+
| Type=D |block size | compressed data ... | Type=D | block size | compressed data | ... | type=E | block size = 0 (EOF) |
+--------+-----------+---------------------+--------+------------+-----------------+-----+--------+----------------------+

One thing you cannot guess from a given compressed data is how much uncompressed data it will generate, while you have to prepare and allocate in memory a buffer large enough to receive it. It has been decided to slice the file's data in uncompressed block not larger than the maximum liblzo2 can handle, which is 240 kiB. So at decompression time, whatever is the size of the compressed block we can provide a 240 kiB buffer to receive the decompressed data and repeat the process as much time as necessary to process the whole compressed data of a given file.

With release 2.7.0 when adding lz4 compression algorithm, it has been found that the "streaming API" of lz4 library was not really a streaming API as it requested to provide back the exact compressed block and not a stream of bytes for the decompression to be possible. For that reason, lz4 streaming is done the exact same way as what has been implemented for lzo streaming: This is a block compression with 240 kiB block size.

The block compression new feature you have starting release 2.7.0 is thus just an extension of this method by providing to the user the ability to select the maximum size of the uncompressed data block that will be compressed at a time. The -z option has been extended to allow one to specify that block size. If set to zero we stick to the streaming mode, else we use block compression.

But for a given algorithm (except lzo and lz4) streaming mode and block mode are not compatible, the first has no structure (no block and cannot thus be processed in parallel) the second adds some non-compressed bytes beside the compressed data (the block headers). For that reason the new archive format contains the block size used at creation time in the archive header and trailer, which first let libdar allocate buffer large enough to receive decompressed data and second use the correct mode to retrive the original data.

Multi threading

As mentionned earlier, to provide multi-threading support for compression (added at release 2.7.0) the use of per block compression is necessary. In the dar archive, this mode of compression provides the same structured data of compressed blocks as described above for lzo, but with the additional freedom to choose the block size (which stay fixed at 240 kiB for lzo and lz4 pseudo-streaming mode) and any compression algorithm supported by libdar.

But that is not enough to have per block structure in archive to have multi-threaded processing: you have also to setup the code that read in advance, feed block in parallel to different threads, having each their own compression/decompression engine set and ready for use, gather the resulting work, reordering them if necessary before writing down to filesystem. The method used here is very similar to what was used to add multi-threading suppor for encryption/decryption (see previous chapter)

Last, if multi-threading is not possible for real streaming compression, it is available in dar/libdar for lz4 and lzo algorithm that are implemented as 240 KiB block compression. For other algorithm, streaming compression use different API of the libraries libdar relies on from the block compression, and is not possible to parallelize.