NoSQL Zone is brought to you in partnership with:

By day, Python hacker. By night, semi-pro ping pong player. Benjamin has posted 2 posts at DZone. You can read more from them at their website. View Full User Profile

Some Useful Tips for MongoDB Memory Management

03.22.2012
| 6384 views |
  • submit to reddit

Feed Mongo!!

Several months ago at Yipit, we decided to cross the NoSQL rubicon and port a large portion of our data storage from MySQL over to MongoDB.  

One of the main drivers behind our move to Mongo was the composition of our data (namely, our recommendation engine system) which consists of loosely structured, denormalized objects best represented as a JSON-style documents. Here’s an example of a typical recommendation object.

{
"_id" : ObjectId("4f15adc6e9f3d887f2ba7a31"),
"uid" : 274790,
"did" : 720717,
"city_id" : 4,
"score" : 10000,
"published" : ISODate("2012-01-17T17:20:54Z"),
"active" : true,
"recs" : [
{
"category" : "TagMatch",
"details" : {
"tag_name" : "Treats",
"tag_id" : NumberLong(30)
}
},
{
"category" : "LocationMatch",
"details" : {
"city" : "New York",
"distance" : 0.3662158792607932
}
}
]
}

How Key Expansion Cause Memory Bloat

Because any given recommendation can have a number of arbitrary nested attributes, Mongo’s “schemaless” style is much preferred to the fixed schema approach imposed by a relational database. 

The downside here, though, is that this structure produces extreme data duplication.  Whereas a MySQL column is stored only once for a given table, an equivalent JSON attribute is repeated for each document in a collection.

Why Memory Management in Mongo is Crucial

When your data set is sufficiently small, this redundancy is usually acceptable; however, once you begin to scale up, it becomes less palatable. At Yipit, an average key size of 100 Bytes per document, spread over roughly 65 million documents, adds somewhere between 7GB-10GB of data (factoring in indexes) without providing much value at all.

Mongo is so awesome, on good days, because it maps data files to memory.  Memory based reads and writes are super fast.  On the other hand, Mongo is absolutely not awesome once your working data set surpasses the memory capacity of the underlying box. At that point, painful page faults and locking issues ensue. Worse yet, if your indexes begin grow too large to remain in memory, you are in trouble (seriously, don’t let that happen).

Quick Tips on Memory Management

You can get around this memory problem in a number of ways.  Here’s a non-exhaustive list of options:

  • Add higher memory machines or more shards if cash is not a major constraint (I would recommend the latter to minimize the scourge of write locks).

  • Actively utilize the “_id” key, instead of always storing the default ObjectID.

  • Use namespacing tricks for your collections. In other words, create separate collections for recommendations in different cities, rather than storing a city key within each collection document.

  • Embed documents rather than linking them implicitly in your code.

  • Store the right data types for your values (i.e. integers are more space efficient than strings)

  • Get creative about non-duplicative indexing on compound keys.

The Key Compression Approach

After you’ve checked off those options, you may still wish to cut down on stored key length. The easiest path here probably involves creating a translation table in your filesystem that compresses keys on the way to Mongo from your code and then decompresses during the return trip. 

For simplicity sake, a developer could hardcode the translations, updating the table on schema changes.  While that works, it would be nice if there were a Mongo ORM for Python that just handled it for us automatically.  It just so happens that MongoEngine is a useful, Django style ORM on top of the PyMongo driver.  Sadly, it does not handle key compression.

Automatic Compression Tool

As a weekend project, I thought that it would be cool to add this functionality. Here’s an initial crack at it (warning: it may not be production ready).

from mongoengine.base import TopLevelDocumentMetaclass

class CompressedKeyDocumentMetaclass(TopLevelDocumentMetaclass):

    def __new__(cls, name, bases, attrs):
        """
        MongoEngine Document Classes access the 'TopLevelDocumentMetaclass'
        __metaclass__. We allow that metaclass to set attrs on the class
        and then compress the fields in the event that the instantiated
        Document Class contains the meta attr 'compress_keys'
        
        That is not the most efficient flow. Going forward, we should
        either fork MongoEngine and insert this logic directly into
        the TopLevelDocumentMetaclass
        OR
        process attrs before instantiating the new_class
        """

        new_class = super(CompressedKeyDocumentMetaclass, cls).__new__(cls, name, bases, attrs)
        if ('meta' in attrs and attrs['meta'].get('compress_keys', False)):
            if hasattr(new_class, '_fields'):
                key_mapping = new_class._map_fields()
            # HANDLE INDEX CREATION HERE by resetting cls._meta['indexes]
            if new_class._meta.get('indexes'):
                for index in new_class._meta.get('indexes'):
                    fields = index['fields']
                    i_list = []
                    for f in fields:
                        raw_field_name = f[0]
                        compressed_name = key_mapping[raw_field_name]
                        direction = f[1]
                        i_list.append((compressed_name, direction))
                    index['fields'] = i_list
        return new_class

    @property
    def _mapping_collection(cls):
        """
        Connects to (or produces on first lookup) a mapping collection
        whose name is created by appending '_mapping' to the MongoEngine class
        """
        collection_name = '%s_mapping' % cls._get_collection_name()
        return getattr(connection, collection_name)

    def _is_embedded_field(cls, field):
        """
        Checks whether a given field is an EmbeddedObject
        """
        return hasattr(field, 'field') and getattr(field, 'field') is not None
    
    def _field_name_set(cls, subfield=None):
        """
        Returns a set of all field names within that nested level
        If field is embedded, this method returns nested level field names
        """
        if not subfield:
            fields = cls._fields.values()
        else:
            fields = subfield.field.document_type._fields.values()
        return set(f.name for f in fields)
        
    def _set_fields(cls, fields, collection=None, document=None):
        """
        Set mapped collection values here. Handles all fields the first
        time a class is evaluated. Subsequently, handles only changed fields
        
        Records the uncompressed field name, the compressed field name,
        and the datetime at which a field is added to the class
        
        Compressed names represent the minimum unique, sequential slices
        of a full string.
            'test' --> 't'
            'trial' --> 'tr'
        We could avoid multiple chars here in a variety of ways;
        Advantage of this route is that compressed field name more clearly
        relate to uncompressed names.

        Logic in the range iterator attempts to handle collisions such as:
            'rechandler' --> 'r'
            'recluse' --> 're'
            'recsize' --> 'rec'
            'rec' -->
        May not be necessary (or elegant).
        
        Embedded fields are handled recursively. May be possible to compress
        directly on EmbeddedObject class but it was not working for me.
        Should revisit that possibility.
        
        TODO: Handle setting of embedded fields who parent field has not changed.
        """
        import random

        old_fields = dict((k, v) for k, v in document.items()) if document else {}
        old_fields_name_set = set(f.get('db_key') for f in old_fields.values())
        new_fields_dict = {}
        for f in fields:
            f_len = len(f.db_field)
            if f.db_field not in ('_id', '_cls', '_types'):
                # Avoid edge case where substrings collide
                for i in xrange(f_len + 5):
                    packed_name = f.db_field[:i + 1]
                    if not old_fields_name_set or packed_name not in old_fields_name_set:
                        new_fields_dict[f.name] = {'db_key': packed_name, 'set': datetime.datetime.now()}
                        old_fields_name_set.add(packed_name)
                        f.db_field = packed_name
                        break
                    if i > f_len:
                        # Check if value has been set successfully, otherwise append a random digit
                        f.db_field = '%s_%d' % (packed_name, + random.randrange(1, 10))
            else:
                new_fields_dict[f.db_field] = {'db_key': f.db_field, 'set': datetime.datetime.now()}
            # Handle Embedded Documents recursively
            if cls._is_embedded_field(f):
                embedded_fields = cls._set_fields(f.field.document_type._fields.values(), document=document)
                embed_dict = {}
                for embed in embedded_fields:
                    embed_dict[embed.name] = {'db_key': embed.db_field, 'set': datetime.datetime.now()}
                new_fields_dict[f.name].update({'embedded_fields': embed_dict})

        if collection:
            if document:
                obj = {'%s.db_key' % old_fields.items()[0][0]: old_fields.items()[0][1].get('db_key')}
                collection.update(obj, {'$set': new_fields_dict})
            else:
                collection.save(new_fields_dict)
            return new_fields_dict
        else:
            return fields
            
    def _unset_fields(cls, collection, field_key, field_value, document, embedded_key=None, embedded_key_packed=None):
        """
        Unsets mapped fields by looking up the appropriate key in the mapped
        collection document and adding an "unset" attribute (refactor naming here to avoid modifier confusion).
    
        This marks the datetime that the field was inactivated, rather than deleting the field.
        If you were to delete the field, new compacted names could conflict with existing documents
        in the collection. Embedded fields are unset as well.
        """
        if field_key not in ('_id', '_cls', '_types'):
            if not embedded_key:
                old_doc = {'%s.db_key' % (field_key): field_value}
                new_doc = {'%s.unset' % field_key: datetime.datetime.now()}
                collection.update(old_doc, {'$set': new_doc})
            else:
                old_doc = {'%s.embedded_fields.%s.db_key' % (field_key, embedded_key): embedded_key_packed}
                new_doc = {'%s.embedded_fields.%s.unset' % (field_key, embedded_key): datetime.datetime.now()}
                collection.update(old_doc, {'$set': new_doc})
    
    def _pack_field(cls, field, dict_key, dict_value):
        if dict_key == field.name:
            field.db_field = dict_value.get('db_key')
        return field.db_field
        
    def _map_fields(cls):
        collection = cls._mapping_collection
        meta_keys_doc = collection.find_one()
        cls_fields = cls._fields.values()
        cls_field_set = cls._field_name_set()
        
        if not meta_keys_doc:
            meta_keys_doc = cls._set_fields(cls_fields, collection=collection)
        else:
            new_fields = [f for f in cls_fields \
                if (f.name not in meta_keys_doc.keys()
                and f.name is not None)]
            if new_fields:
                fields_dict = cls._set_fields(new_fields, collection=collection, document=meta_keys_doc)
                meta_keys_doc.update(fields_dict)

        key_mapping = {}
        for field_key, field_value in meta_keys_doc.items():
            # Unset inactive top level fields
            if not field_key in cls_field_set and not meta_keys_doc[field_key].get('unset'):
                cls._unset_fields(collection, field_key, field_value['db_key'], meta_keys_doc)
            else:
                for cf in cls_fields:
                    # Unset inactive embedded fields
                    if cls._is_embedded_field(cf):
                        for k, v in meta_keys_doc[cf.name]['embedded_fields'].items():
                            embed_field_set = cls._field_name_set(cf)
                            if not v.get('unset') and k not in embed_field_set:
                                cls._unset_fields(collection, cf.name, cf.db_field, meta_keys_doc, embedded_key=k, embedded_key_packed=v.get('db_key'))
                        
                    if field_key == cf.name:
                        # Map all active field names within the class obj to compacted names
                        # Happens everytime as opposed to the _set_fields method
                        key_mapping[field_key] = cls._pack_field(cf, field_key, field_value)
                        if cls._is_embedded_field(cf):
                            for f in cf.field.document_type._fields.values():
                                sub_key = field_value.get('embedded_fields').get(f.name)
                                if sub_key:
                                    f.db_field = sub_key.get('db_key')
        return key_mapping

The docstrings and inline comments are fairly extensive, but I should repeat a couple of main points:

  • This logic adds some overhead to the process of defining a class.  This happens only once, when the class is loaded, and quick benchmarking seems to suggest that it’s not overly prohibitive.  That being said, I mention several ways of improving the efficiency of this code.  First, you could move it directly into the TopLevelDocumentMetaclass or you could process the attrs before instantiating the class. Both would avoid the double work incurred here.

  • Embedded fields are not handled completely in this code.  The first time you set an embedded document, the underlying fields will be compressed.  However, if you change the nested fields subsequently but do not change the parent field, the nested fields will not be reset.  This means that you’ll have an uncompressed key for each nested field that you change.  You can get around this by dropping the mapped collection and recreating it (simple operation).  I plan to handle this logic in the code shortly.

  • Indexing in the meta attribute of the class should work as expected, though I would generally suggest that you set indexes administratively as a best practice.

The Final Mapped Output

Here is a working example of the code (you’ll need to add an abstract class to make this work).

from mongoengine.document import Document, EmbeddedDocument

class DecisionDocument(EmbeddedDocument):

    guilt = fields.StringField(required=True)
    details = fields.DictField()
    
    meta = {'allow_inheritance': False}


class BaseDocument(Document):

     __metaclass__ = CompressedKeyDocumentMetaclass

     meta = {'abstract':True}


class TrialDocument(BaseDocument):

    jury = fields.StringField()
    judge = fields.IntField()
    decision = fields.ListField(fields.EmbeddedDocumentField(DecisionDocument))
    
    meta = {'allow_inheritance': False, 'compress_keys': True}

When you define the TrialDocument Class, this document will be created in a collection titled, “trial_document_mapping”.

{
"_id" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "_id"
},
"judge" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "j"
},
"decision" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "d",
"embedded_fields" : {
"guilt" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "g"
},
"details" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "d"
}
}
},
"jury" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "ju"
}
}

If you were to then remove the judge field from the TrialDocument and add a reporter field, you’d get the following:

{
"_id" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "_id"
},
"decision" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "d",
"embedded_fields" : {
"guilt" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "g"
},
"details" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "d"
}
}
},
"judge" : {
"db_key" : "j",
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"unset" : ISODate("2012-02-08T18:40:41.856Z")
},
"jury" : {
"set" : ISODate("2012-02-08T18:37:52.191Z"),
"db_key" : "ju"
},
"reporter" : {
"set" : ISODate("2012-02-08T18:40:41.856Z"),
"db_key" : "r"
}
}

If you were to then go into the shell, you could interact with MongoEngine like this:

from test.app import DecisionDocument, TrialDocument

subdoc = DecisionDocument(guilt='definitely', details={'bad_person_score': 'Very Bad Man'})
doc = TrialDocument(decision=[subdoc], reporter='Tom', jury='peers')
doc.save()

-----Object Created------

{
"_id" : ObjectId("4f32c2f4d0ba3a4b5e000000"),
"ju" : "peers",
"r" : "Tom",
"d" : [
{
"d" : {
"bad_person_score" : "Very Bad Man"
},
"g" : "definitely"
}
]
}

Success! We’ve got compressed keys. Just one thing before we go. Beyond key space optimization, this is also a quick primer for smart value storage. Never use long string field values like this if you can help it (we can definitely help it here by using integers).

Next Steps Ahead

Hope that’s interesting and (even better) useful. I’ll try to update this post once I’ve worked out all the kinks with embedded objects and sped up the class instantiation process.

 

 

Published at DZone with permission of its author, Benjamin Plesser. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)