Just finished up at Mongo day Boston. Learned a few interesting factoids (coming from someone who knew virtually nothing about Mongo).
Mongo sharding uses PNUT style range partitioning. Re-balancing is accomplished by moving partitions while maintaining an op log of concurrent changes with a final sync of the op log at the end. Default partition size (point at which a partition might be split) is 200 megabytes. I assume recovery uses the same online data shipping functionality.
Partition info (and other cluster metadata) is stored in a separate Mongo replica set dedicated to this purpose. Having it in a separate set of processes seems a little odd, but I can see why.
Replica sets use a 3rd dedicated process to vote about who the master for the replica set is. I suppose this helps when you don't have an odd number of replicas.
A proxy process that poses as a mongod is used to make interacting with a cluster transparent. The proxy appears like a regular Mongo client to the cluster of actual Mongos. The poser also appears like a Mongo to client applications, but it is backed by the resources of the entire cluster. You can still connect to Mongo shards directly if you are into that sort of thing (don't!).
Sort of a Unixy use/reuse many small hammers approach to sharding. It seems like it has a lot of moving parts to be managed by the user.
WAL is going to use the op log that already exists for replicating to slaves. Forgot to ask what the approach to compacting/truncating the log is going to be. Any discussion of the exact approach they will take and when is of course tentative.
Uses a thread per connection. Node.js fans beware. They have not experienced the joy of libevent yet.
They plan to split the global RW lock. I think it was to the document level. Will be interesting to see if the bottleneck (not a knock, everything has a bottleneck) is lock contention or IO. According to this lock contention could be it. I would like to learn more about how the IO pattern of VM translates to actual disk operations in real use.
Someone stated that one of the design goals was to push complexity into the client libraries (libraries that 10gen puts a lot of work into). Hopefully I am not butchering that statement. I believe I heard that node failure and fail-over is transparent to clients which shows the advantage of that approach.
I also got a peek at the on disk format. It is what you would expect if you had to manage your own heap which is essentially what it does. Free lists of various sizes with objects occasionally being moved.
BSON is nice. Hooray for length preceded values and lazy (or no) deserialization. I die a little inside every time I have to copy an object off/on the Java heap and deserialize it. I have taken to just leaving things in ByteBuffers and wrapping them with objects that pull fields out of the BB.
Thank you to 10Gen for hosting the event.