NoSQL Zone is brought to you in partnership with:

Ayende Rahien is working for Hibernating Rhinos LTD, a Israeli based company producing developer productivity tools for OLTP applications such as NHibernate Profiler (nhprof.com), Linq to SQL Profiler(l2sprof.com), Entity Framework Profiler (efprof.com) and more. Ayende is a DZone MVB and is not an employee of DZone and has posted 462 posts at DZone. You can read more from them at their website. View Full User Profile

Reviewing LevelDB: Part VI, the Log is Base for Atomicity

03.29.2013
| 2304 views |
  • submit to reddit

Here we are starting to get into the interesting bits. How do we actually write to disk. There are two parts of that. The first part is the log file. This is were all the recent values are stored, and it is an unsorted backup for the MemTable in case of crashes.

Let us see how this actually works. There are two classes which are involved in this manner. leveldb::log::Writer and leveldb::WritableFile. I think that WritableFile is the leveldb abstraction, so it is bound to be simpler. We’ll take a look at that first.

Here is what it looks like:

        // A file abstraction for sequential writing.  The implementation

        // must provide buffering since callers may append small fragments

        // at a time to the file.

        class WritableFile {

         public:

          WritableFile() { }

          virtual ~WritableFile();

        

          virtual Status Append(const Slice& data) = 0;

         virtual Status Close() = 0;

         virtual Status Flush() = 0;

         virtual Status Sync() = 0;

        

        private:

         // No copying allowed

         WritableFile(const WritableFile&);

         void operator=(const WritableFile&);

       };

Pretty simple, overall. There is the buffering requirement, but that is pretty easy overall. Note that this is a C++ interface. There is a bunch of implementations, but the one that I think will be relevant here is PosixMmapFile. So much for it being simple. As I mentioned, this is Posix code that I am reading, and I have to do a lot of lookup into the man pages. The implementation isn’t that interesting, to be fair, and full of mmap files on posix minutia. So I am going to skip it.

I wonder why the choice was map to use memory mapped files, since the API exposed here is pretty much perfect for streams. As you can imagine from the code, calling Apend() just writes the values to the mmap file, flush is a no op, and Sync() actually ask the file system to write the values to disk and wait on that. I am guessing that the use of mmap files is related to the fact that mmap files are used extensively in the rest of the code base (for reads) and that gives leveldb the benefit of using the OS memory manager as the buffer.

Now that we got what a WritableFile is like, let us see what the leveldb::log::Writer is like. In terms of the interface, it is pretty slick, it has a single public method:

    Status AddRecord(const Slice& slice);

As a remind, those two are used together in the DBImpl::Write() method, like so:

    status = log_->AddRecord(WriteBatchInternal::Contents(updates));

    if (status.ok() && options.sync) {

     status = logfile_->Sync();

    }

From the API look of things, it appears that this is a matter of simply forwarding the call from one implementation to another. But a lot more is actually going on:

    Status Writer::AddRecord(const Slice& slice) {

      const char* ptr = slice.data();

      size_t left = slice.size();

     

      // Fragment the record if necessary and emit it.  Note that if slice

      // is empty, we still want to iterate once to emit a single

      // zero-length record

      Status s;

      bool begin = true;

     do {

       const int leftover = kBlockSize - block_offset_;

       assert(leftover >= 0);

       if (leftover < kHeaderSize) {

         // Switch to a new block

         if (leftover > 0) {

           // Fill the trailer (literal below relies on kHeaderSize being 7)

           assert(kHeaderSize == 7);

           dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));

         }

         block_offset_ = 0;

       }

    

       // Invariant: we never leave < kHeaderSize bytes in a block.

       assert(kBlockSize - block_offset_ - kHeaderSize >= 0);

    

       const size_t avail = kBlockSize - block_offset_ - kHeaderSize;

       const size_t fragment_length = (left < avail) ? left : avail;

    

       RecordType type;

       const bool end = (left == fragment_length);

       if (begin && end) {

         type = kFullType;

       } else if (begin) {

         type = kFirstType;

       } else if (end) {

         type = kLastType;

       } else {

         type = kMiddleType;

       }

    

       s = EmitPhysicalRecord(type, ptr, fragment_length);

       ptr += fragment_length;

       left -= fragment_length;

       begin = false;

     } while (s.ok() && left > 0);

     return s;

   }

Let us see if we do a lot here. But I don’t know yet what is going on. From the first glance, it appears that we are looking at fragmenting the value into multiple records, and we might want to enter zero length records (no idea what that is for?maybe compactions?).

It appears that we write in blocks of 32Kb at a time. Line 12 – 21 are dealing with how to finalize the block when you have no more space. (Basically fill in with nulls).

Lines 26 – 40 just set the figure out what the type of the record that we are going to work (a full record, all of which can sit in a single buffer, a first record, which is the start in a sequence of items or middle / end, which is obvious).

And then we just emit the physical record to disk, and move on. I am not really sure what the reasoning is behind it. It may be to avoid having to read records that are far too big?

I looked at EmitPhysicalRecord to see what we have there and it is nothing much, it writes the header, including CRC computation, but that is pretty much it. So far, a lot of questions, but not a lot of answers. Maybe I’ll get them when I’ll start looking at the reading portion of the code. But that will be in another post.


Published at DZone with permission of Ayende Rahien, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)