Cassandra 1.0でLeveled Compaction Strategy (LCS)がサポートされたが、DataStaxのドキュメントを読んでも今一つ動作が理解できない。
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

他にあまりドキュメントも整備されていないので、LCSの元になったGoogle LevelDBのほうを調べてみた。
http://leveldb.googlecode.com/svn/trunk/doc/impl.html

以下おおざっぱな翻訳

Files

The implementation of leveldb is similar in spirit to the representation of a single Bigtable tablet (section 5.3). However the organization of the files that make up the representation is somewhat different and is explained below.

Leveldbの実装は単一のBigtableテーブレット表現に概念的に似ています。
しかしながらそのファイル構成は下記に示すように若干異なります。

Each database is represented by a set of files stored in a directory. There are several different types of files as documented below:

それぞれのデータベースは特定のディレクトリに配置されたファイル群によって表現されます。以下に示すように、ファイルには数種類の異なるタイプがあります。

Log files

A log file (*.log) stores a sequence of recent updates. Each update is appended to the current log file. When the log file reaches a pre-determined size (approximately 4MB by default), it is converted to a sorted table (see below) and a new log file is created for future updates.

logファイルは直近の更新情報の順列を格納しています。logファイルが事前に設定された大きさ（デフォルトでは約4MB）に達するとそれはsorted table（後述）に変換され、新たな更新を格納する為に新しいlogファイルが作成されます。

A copy of the current log file is kept in an in-memory structure (the memtable). This copy is consulted on every read so that read operations reflect all logged updates.

使用中のlogファイルのコピーがメモリ内のデータ構造に保持されています（memtable）。読み出し操作にすべての更新が反映されるように、すべての読み出しでこのコピーが参照されます。

Sorted tables

A sorted table (*.sst) stores a sequence of entries sorted by key. Each entry is either a value for the key, or a deletion marker for the key. (Deletion markers are kept around to hide obsolete values present in older sorted tables).

Sorted table (*.sst) はキーによってソートされたエントリ列を格納します。それぞれのエントリはキーと値のペアか、もしくはキーに対する削除マーカーです。(削除マーカーは過去に作成されたsorted tableに格納されている値を無効にするために保持されます。)

The set of sorted tables are organized into a sequence of levels. The sorted table generated from a log file is placed in a special young level (also called level-0). When the number of young files exceeds a certain threshold (currently four), all of the young files are merged together with all of the overlapping level-1 files to produce a sequence of new level-1 files (we create a new level-1 file for every 2MB of data.)

sorted tableは一連のレベルに組織化されています。logファイルから作成されたsorted tableは特別なYoung Levelに配置されます。(level-0とも呼ばれます。)

Youngファイルが一定の閾値(現在は４)を超えると、すべてのyoung fileはキーがオーバーラップするすべてのlevel-1ファイルと共にマージされ、新しい一連のlevel-1ファイルが作成されます。(2MBごとに新たなlevel-1ファイルが作成されます。)

Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping key ranges. Consider level number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L, and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).

young levelでは、ファイルの間で重複するキーが含まれる場合あります。しかし他のレベルではファイル間でキーはオーバーラップしません。

L >= 1であるレベルを想定して下さい。Level-Lのファイルの合計サイズが10^L MB (level-1では10MB、level-2では100MB) を超過すると、level-Lから1ファイルを選択し、そのファイルとキーが重複するすべてのlevel-(L+1)ファイルをマージして、新たなlevel-(L+1)ファイルを(複数)作成します。

これらのマージ操作により更新はyoung levelから最大のlevelまで、

バルクリード、バルクライトのみを使用して徐々にマイグレートされます。

(高価なseekの使用を最小化します。)

Manifest

A MANIFEST file lists the set of sorted tables that make up each level, the corresponding key ranges, and other important metadata. A new MANIFEST file (with a new number embedded in the file name) is created whenever the database is reopened. The MANIFEST file is formatted as a log, and changes made to the serving state (as files are added or removed) are appended to this log.

MANIFESTファイルはそれぞれのlevelを構成するsorted tableファイルのリストであり、キーレンジ情報他の重要なメタデータを含みます。データベースがreopenされるたびに新しいMANIFESTファイルが作成されます。MANIFESTファイルはログとして整形されており、ファイルの追加
や削除のような状態の変更はこのログに追記されます。

Current

CURRENT is a simple text file that contains the name of the latest MANIFEST file.

CURRENTは最新のMANIFESTファイルの名前が格納されている単純なテキストファイルです。

Info logs

Informational messages are printed to files named LOG and LOG.old.

動作情報に関するメッセージはLOGおよびLOG.oldという名称のファイルに出力されます。

Others

Other files used for miscellaneous purposes may also be present (LOCK, *.dbtmp).

その他の用途のためにいくつか雑多なファイルが存在します。(LOCK, *.dbtmp)

Level 0

When the log file grows above a certain size (1MB by default):

logファイルがあるサイズを超えた場合(デフォルトでは1MB):

Create a brand new memtable and log file and direct future updates here
新しいmemtableとlogファイルを作成し、新たな更新を振り向ける。
In the background:
Backgroundでは:
- Write the contents of the previous memtable to an sstable
- 以前のmemtableの内容をsstableに書き込む
- Discard the memtable
- memtableを破棄する
- Delete the old log file and the old memtable
- 古いlogファイルとmemtableを削除する
- Add the new sstable to the young (level-0) level.
- 新しいsstableをyoung(level-0)レベルに追加する

Compactions

When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.

level Lのサイズがリミットを超過したら、バックグラウンドスレッドでcompactionが実行されます。

Compactionはlevel Lから一つのファイルを選択し、更にそのファイルとオーバーラップするすべてのlevel L+1ファイルを選択します。

level-Lファイルと、あるlevel-(L+1)ファイルの一部分だけが重複していたとしても、そのlevel-(L+1)ファイル全体がコンパクションの入力として使用され、コンパクション完了後に削除されます。

level-0は特殊(ファイルが互いに重複していてもよい)なので、level-0からlevel-1へのコンパクションも特別に扱われます。level-0のファイルが互いに一部重複している場合、level-0コンパクションはオーバーラップした複数のlevel-0ファイルをコンパクション対象として選択します。

A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).

コンパクションは選択した複数のファイルからlevel-(L+1)ファイルのシーケンスを生成します。

出力先のファイルがターゲットファイルサイズ(2MB)に達したら、出力を切り替え、新しいlevel-(L+1)ファイルを作成します。

また作成中の出力ファイルのキーレンジが10個以上のlevel-(L+2)ファイルとオーバーラップするようになった場合にも出力先を新たなファイルに切り替えます。

この最後のルールはlevel-(L+1)のコンパクションにおいて、level-(L+2)のファイルが過剰に処理されることを抑制します。

The old files are discarded and the new files are added to the serving state.

古いファイルは廃棄され、新しいファイルがserving stateに追加されます。

Compactions for a particular level rotate through the key space. In more detail, for each level L, we remember the ending key of the last compaction at level L. The next compaction for level L will pick the first file that starts after this key (wrapping around to the beginning of the key space if there is no such file).

特定のlevelにおけるCompactionはキースペースを通じて循環的に実行されます。

詳細には、それぞれのlevelにおいて、そのレベルで最後に実施したコンパクションの最終キーを記憶します。

level-Lの次回のコンパクションは記憶されたキー以降のキーから始まるファイルを選択します。そのようなファイルが存在しない場合は、キー空間の先頭に戻ります。

Compactions drop overwritten values. They also drop deletion markers if there are no higher numbered levels that contain a file whose range overlaps the current key.

コンパクションにより、上書きされた値は削除されます。またより上位のレベルにオーバーラップするファイルが存在しない削除マーカーも削除されます。

Timing

Level-0 compactions will read up to four 1MB files from level-0, and at worst all the level-1 files (10MB). I.e., we will read 14MB and write 14MB.

Level-0コンパクションは４個までの1MBファイルをlevel-0から、また最大全てのlevel-1ファイル(合計10MB)を読みます。つまり最大14MBの読み込み、14MBの書き込みを行う可能性があります。

Other than the special level-0 compactions, we will pick one 2MB file from level L. In the worst case, this will overlap ~ 12 files from level L+1 (10 because level-(L+1) is ten times the size of level-L, and another two at the boundaries since the file ranges at level-L will usually not be aligned with the file ranges at level-L+1). The compaction will therefore read 26MB and write 26MB. Assuming a disk IO rate of 100MB/s (ballpark range for modern drives), the worst compaction cost will be approximately 0.5 second.

特殊なケースであるlevel-0コンパクションを除き、Level Lからは1つの2MBファイルを選択します。

最悪のケースでは、このデータはlevel L+1の12個のファイルとオーバーラップします。level L+1 は最大 Level Lのサイズの10倍となるので10個、更にlevel Lのファイルレンジは通常level L+1のファイルレンジと揃っていないので、前後の境界に各1個、合計12個となります。

このためlevel Lコンパクションは26MB読み、26MB書きます。

disk IOレートが100MB/s(最近のドライブの概算値)とするとコンパクションに要する時間は最悪0.5秒程度と予測できます。

If we throttle the background writing to something small, say 10% of the full 100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at 10MB/s, we might build up lots of level-0 files (~50 to hold the 5*10MB). This may signficantly increase the cost of reads due to the overhead of merging more files together on every read.

バックグラウンドにおける書き込みを小さな値、例えば100MB/sの10%に制限するとコンパクションには最大5秒要することになります。

ユーザが10MB/sで書き込みを行っている場合、多数のlevel-0ファイルが作成されることが予想されます。(10MB/s * 5sec = 50MB, 1MB/fileでは50個程度) この状態ではすべてのリードにおいて多数のファイルをマージする必要があるため、リードのコストが非常に高くなるでしょう。

Solution 1: To reduce this problem, we might want to increase the log switching threshold when the number of level-0 files is large. Though the downside is that the larger this threshold, the more memory we will need to hold the corresponding memtable.

解決策１：この問題を軽減するためには、level-0ファイル数が大きい場合にlog切り替えの閾値を増やすことが考えられます。トレードオフとして、閾値を大きくするにつれて、対応するmemtableを保持するためにより大きなメモリ領域が必要になります。

Solution 2: We might want to decrease write rate artificially when the number of level-0 files goes up.

解決策２:level-0ファイルの個数が増えた場合に、書き込みのレートを人為的に減少させることが考えられます。

Solution 3: We work on reducing the cost of very wide merges. Perhaps most of the level-0 files will have their blocks sitting uncompressed in the cache and we will only need to worry about the O(N) complexity in the merging iterator.

解決策3:我々は非常に多数のファイルのマージのコストを減少させることに取り組んでいます。ほとんどのlevel-0ファイルはキャッシュ内で未圧縮の状態で保持されていると思われるので、マージイタレータでのO(N)の複雑性だけを考慮すればいいでしょう。

Number of files

Instead of always making 2MB files, we could make larger files for larger levels to reduce the total file count, though at the expense of more bursty compactions. Alternatively, we could shard the set of files into multiple directories.

常に2MBのファイルを作成する代わりに、上位レベルではより大きなファイルを作成することもできます。これにより、全体のファイル数を減らすことができますが、一方でより「爆発的な」コンパクションを引き起こすことになります。代わりにファイル群を複数のディレクトリにシャーディングすることもできます。

An experiment on an ext3 filesystem on Feb 04, 2011 shows the following timings to do 100K file opens in directories with varying number of files:

2011年2月4日にext3ファイルシステム上で実施した実験では、100K個のファイルを開くのに必要な時間をディレクトリごとのファイル数を変化させて測定したところ、次のような結果となりました。

Files in directory	Microseconds to open a file
1000	9
10000	10
100000	16

So maybe even the sharding is not necessary on modern filesystems?

従って近代的なファイルシステムではシャーディングは必要ないかもしれません。

Recovery

Read CURRENT to find name of the latest committed MANIFEST
Read the named MANIFEST file
Clean up stale files
We could open all sstables here, but it is probably better to be lazy...
Convert log chunk to a new level-0 sstable
Start directing new writes to a new log file with recovered sequence#

Garbage collection of files

DeleteObsoleteFiles() is called at the end of every compaction and at the end of recovery. It finds the names of all files in the database. It deletes all log files that are not the current log file. It deletes all table files that are not referenced from some level and are not the output of an active compaction.

その2へ

楽しいプログラム

2012年2月9日木曜日

CassandraのLeveled Compactionについて調べる