Striking the perfect balance between write and query performance in a MongoDB database distributed between clustered servers depends on choosing an appropriate hash-based shard key. Conversely, choosing the wrong key can slow writes and reads to a crawl, in addition to squandering server storage space.
Hash-based sharding was introduced in version 2.4 of MongoDB as a way to allow shards to be distributed efficiently among clustered servers. As the MongoDB manual explains, your choice of shard key -- and the resulting data partitioning -- is a balancing act between write and query performance.
Using a randomized shard key offers many benefits for scaling write operations in a cluster, but random shard keys can cause query performance to suffer because they don't support query isolation, thus mongos must query all or nearly all shards. Step-by-step instructions for creating a hashed index and for sharding a collection using a hashed shard key are provided in the MongoDB manual.
As straightforward as the concept of hash-based sharding appears, implementing the technique on a live MongoDB database can be anything but trouble-free. A post on the MongoDB blog highlights the tradeoffs required to establish the optimal sharding balance for a specific database.
Once you've named the collection to be sharded and the hashed "identifier" field for the documents in the collection, you create the hashed index on the field and then shard the collection using the same field. The post uses as an example a collection named "mydb.webcrawler" and an identifier field named "url".
While it's best to shard the collection before adding data via pre-splitting, when you shard an existing collection the balancer automatically positions chunks to ensure even distribution of the data. The split and moveChunks functions apply to hash-based shard keys, but use of the "find" mechanism can be problematic because the specifier document is hashed to get the containing chunk. The solution is to use the "bound" parameter when manipulating chunks or entire collections manually.
When hash-based sharding impedes performance
The consequences of choosing the wrong shard key when you hash a MongoDB collection are demonstrated in a Stack Overflow post from September 2013. After sharding a collection by hashed_id, the resulting _id_hashed index was taking up nearly a gigabyte of space. The poster asked whether the index could be deleted because only the _id field is used to query the document.
Hash-based sharding requires a hashed index on the shard key, which is used to determine the shard used for all subsequent queries. In this case, the optimizer is using the _id index because it is unique and generates a more efficient plan, but it still requires the _id_hashed index.
In an October 14, 2014, post on the Wenda.io site, the process of applying a hash-based shard to a particular field is explained. The goal is to allow the application to generate a static hash for the field value so that the hash will always be the same if the value is the same.
When you designate a field in a document as a hash sharded field, a hash value for that field is generated automatically just before the document is read or written to. Outgoing queries are assigned that hash value that is always used for shard targeting. However, this can impact default chunk balancing and depends on selection of an appropriate hash function.
Much of the hassle of managing MongoDB collections -- as well as MySQL, Redis, and ElasticSearch databases -- is eliminated by the simple interface of the Morpheus database-as-a-service (DBaaS). Morpheus lets you provision, deploy, and host heterogeneous databases via a single console.
Morpheus is the first and only DBaaS that supports SQL, NoSQL, and in-memory databases. A free full replica set is deployed with each database instance you provision, and your MySQL and Redis databases are backed up. Visit the Morpheus site to create a free account.