0

I have an instance of MongoDB 4.0.6 with WiredTiger and the documentation says that db.serverStatus() has a storageEngine section in the output with some info on the Wired Tiger flags. On my instance there's no storageEngine section in the output at all. Am I missing something?

To be specific, I have a sharded cluster with a 3 node replica set where one of the nodes is an arbiter, which leads to an ever growing WiredTigerLAS.wt file when creating many records quickly. The documentation says we can turn off readConcernMajority with '--enableMajorityReadConcern false', which we have done. Does not stop the lookaside file growth. I'd like to check the status of that flag, and again following the documentation I'm using db.serverStatus(), but in contrast to the documentation the output contains no storageEngine section at all.

Maybe I should just change the arbiter into a data bearing node, but I was hoping to avoid the cost. Any help on this issue is appreciated.

Adding rs.status() (note that the secondary is currently resynching from the primary):

{
        "set" : "shardRS1",
        "date" : ISODate("2019-06-06T17:09:06.986Z"),
        "myState" : 2,
        "term" : NumberLong(66),
        "syncingTo" : "1b:27018",
        "syncSourceHost" : "1b:27018",
        "syncSourceId" : 0,
        "heartbeatIntervalMillis" : NumberLong(2000),
        "optimes" : {
                "lastCommittedOpTime" : {
                        "ts" : Timestamp(1559840946, 1),
                        "t" : NumberLong(66)
                },
                "readConcernMajorityOpTime" : {
                        "ts" : Timestamp(1559840946, 1),
                        "t" : NumberLong(66)
                },
                "appliedOpTime" : {
                        "ts" : Timestamp(1559840946, 1),
                        "t" : NumberLong(66)
                },
                "durableOpTime" : {
                        "ts" : Timestamp(1559840946, 1),
                        "t" : NumberLong(66)
                }
        },
        "members" : [
                {
                        "_id" : 0,
                        "name" : "1b:27018",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 177202,
                        "optime" : {
                                "ts" : Timestamp(1559840936, 1),
                                "t" : NumberLong(66)
                        },
                        "optimeDurable" : {
                                "ts" : Timestamp(1559840936, 1),
                                "t" : NumberLong(66)
                        },
                        "optimeDate" : ISODate("2019-06-06T17:08:56Z"),
                        "optimeDurableDate" : ISODate("2019-06-06T17:08:56Z"),
                        "lastHeartbeat" : ISODate("2019-06-06T17:09:05.722Z"),
                        "lastHeartbeatRecv" : ISODate("2019-06-06T17:09:05.263Z"),
                        "pingMs" : NumberLong(0),
                        "lastHeartbeatMessage" : "",
                        "syncingTo" : "",
                        "syncSourceHost" : "",
                        "syncSourceId" : -1,
                        "infoMessage" : "",
                        "electionTime" : Timestamp(1559662230, 1),
                        "electionDate" : ISODate("2019-06-04T15:30:30Z"),
                        "configVersion" : 2
                },
                {
                        "_id" : 1,
                        "name" : "1d:27018",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 177203,
                        "optime" : {
                                "ts" : Timestamp(1559840946, 1),
                                "t" : NumberLong(66)
                        },
                        "optimeDate" : ISODate("2019-06-06T17:09:06Z"),
                        "syncingTo" : "1b:27018",
                        "syncSourceHost" : "1b:27018",
                        "syncSourceId" : 0,
                        "infoMessage" : "",
                        "configVersion" : 2,
                        "self" : true,
                        "lastHeartbeatMessage" : ""
                },
                {
                        "_id" : 2,
                        "name" : "1c:27018",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
                        "uptime" : 177202,
                        "lastHeartbeat" : ISODate("2019-06-06T17:09:05.933Z"),
                        "lastHeartbeatRecv" : ISODate("2019-06-06T17:09:05.468Z"),
                        "pingMs" : NumberLong(1),
                        "lastHeartbeatMessage" : "",
                        "syncingTo" : "",
                        "syncSourceHost" : "",
                        "syncSourceId" : -1,
                        "infoMessage" : "",
                        "configVersion" : 2
                }
        ],
        "ok" : 1,
        "operationTime" : Timestamp(1559840946, 1),
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("000000000000000000000000")
        },
        "lastCommittedOpTime" : Timestamp(1559840946, 1),
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1559840926, 1),
                        "t" : NumberLong(35)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1559840946, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}

Adding rs.conf():

{
        "_id" : "shardRS1",
        "version" : 2,
        "protocolVersion" : NumberLong(1),
        "writeConcernMajorityJournalDefault" : true,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "1b:27018",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 1,
                        "host" : "1d:27018",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 2,
                        "host" : "1c:27018",
                        "arbiterOnly" : true,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "catchUpTimeoutMillis" : -1,
                "catchUpTakeoverDelayMillis" : 30000,
                "getLastErrorModes" : {

                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                },
                "replicaSetId" : ObjectId("5c8743e6b6916b268c071e3a")
        }
}
david
  • 1
  • The WiredTiger lookaside file is a disk-based swap mechanism for the WiredTiger in-memory cache (analogous to a swapfile in Linux). Read concern majority can contribute to growth of the lookaside in a Primary-Secondary-Arbiter configuration with a Secondary down, but lookside growing with all of your replica set members online is usually an indication that your WiredTiger cache is too small to keep up with your workload. If you want to check the status of server config parameters, the shell command to use is db.serverCmdLineOpts().parsed. – Stennie Jun 01 '19 at 22:35
  • Are all of your data-bearing members using WiredTiger and MongoDB 4.0.6? Also, your description mentions sharding: is this deployment a three-member replica set or replica set that is part of a sharded cluster? Lastly, what is the output of version() and db.version() in the mongo shell? – Stennie Jun 01 '19 at 22:35
  • Thank you both. I did look at serverCmdOpts but none of the Wired Tiger params show up there. It seems to me that if the work load exceeds the replication rate, which is the issue I presume, that filling the hard drive and crashing is not a useful response. I realize that data must exist somewhere. However, the total size of the data being written is a few gigabytes. Under 10. The size of the look aside file grows to over 120 gigabytes, many times the size of the data. The response of the storage engine is ridiculous. And there should be some way to tell the... – david Jun 03 '19 at 15:52
  • ...system to wait for all replicas to confirm before returning from the insert, slowing the process to the available power of the cluster, as opposed to crashing the system, which is exactly the wrong response in a database. But, yes we have 2 data nodes and 1 arbiter, all 4.0.6, from both version () and db.version(). Do you think I can avoid this by artificially slowing the insert rate, as the database seems unable to keep itself safe? Is there some parameter I can set that will force fully journaled replication at opposed to the ridiculous look aside crash? Thanks again. – david Jun 03 '19 at 15:53
  • Is this as simple as setting 'w: 2, j: true'? – david Jun 03 '19 at 16:08
  • Given your description so far, a lookaside file growing into 100+ GB suggests an error in your configuration but more details are needed. If the WiredTiger cache is full and data can't be evicted, the alternative to a disk-based lookaside file would be stopping writes until the cache pressure clears. This is a similar scenario as your O/S not having enough RAM and needing to use swap or virtual memory: you don't want a workload that relies on significant lookaside. With reasonable resources for your workload there should be no (or minimal) need for lookaside. – Stennie Jun 04 '19 at 08:03
  • Using w:2 or w:majority write concern would throttle writes to keep up with replication, but it's not clear if that is the actual reason for the growth of your lookaside file. Is there significant replication lag (which could cause cache pressure)? Can you edit your question to include the output of rs.status() and rs.conf()? Have you changed any other server options from the default? How much RAM do your instances have? What specific O/S and version are you using? Are these bare metal instances, VMs, containers, ... ? I know that's a lot of questions, but this should be solvable. – Stennie Jun 04 '19 at 08:10
  • Your questions are incredibly helpful, so the quantity of them is not a problem. I'll do my best to answer. We have, as I've said 2 data bearing nodes and 1 arbiter, all 4.0.6, none of the nodes have any other users other than myself currently, although they are intended for production. The production use case is very different than the current one, which is to pull 8 million records from one collection, which is {key:, name :, value:{}}, and insert it into another collection which is {key:, items:{name:}}, essentially restructuring the data. – david Jun 04 '19 at 19:17
  • We have been pulling a lot of records, in the billions, from the current production cluster (mongo 2.4) into this new mongo 4.0.6 cluster. All of these prior records are a simple get from the 2.4 collections and an insert into the 4.0.6 collections, with no restructuring at all (pure inserts). And the data rates have not been a problem, no blooming look aside file growth. I find it odd that this simple restructuring of far less records is causing a problem. – david Jun 04 '19 at 19:20
  • The throughput for the restructuring task is no higher than the previous copy tasks which didn't seem to have replication lag. Maybe it's the sets as opposed to inserts. I'm not sure what the restructuring tasks replication lag is. Is there a way to track that metric (I'll go look on my own as well, as that's something I should know)? I assume there must be. We have 4 shards but I assume they are roughly equivalent, so I'll add rs.status() and rs.conf() – david Jun 04 '19 at 19:24
  • All shards startup with --wiredTigerCacheSizeGB 4.0 --serviceExecutor adaptive --enableMajorityReadConcern false , which are the only changes from default other than log and data location. The instances have 16GB of RAM, Ubuntu 16.04, and are AWS vms, r4.xlarge for data nodes, t3.medium for the arbiter. Anything else I can provide? – david Jun 04 '19 at 19:34
  • I can attempt to guess the proper throughput rate, but I'd really rather the database tell me how fast it can go, so if w:2 works (and I think it has to be w:2 because w:majority which is the current default? does not take into account the arbiter and is part of the problem?) that would be fine. The code operating on mongo will happily go as fast as mongo returns from the updates, so if that is throttled properly to keep look aside from growing, that's good. – david Jun 04 '19 at 19:37
  • w:2 does the trick. Throughput collapses, but the transfer is stable, which is the important part. I'd like to improve the throughput, but I've gotten used to how slow Mongo is. We use it for holding data whose structure is not amenable to a sql database, or for when there are more than a couple hundred million rows and a sql database would be more difficult to scale, so this works. If you have any suggestion for improving throughput I'd be happy to hear them though. Thanks for all the help. – david Jun 05 '19 at 00:03
  • Unfortunately the rs.status() output wasn't helpful since the secondary was re-syncing at the time, but I'm curious about replication lag. If an increased write concern resolved your issue, it seems likely that the primary was getting ahead of your secondary and data was being pinned in the primary cache waiting for replication. Once reads start dipping into disk the performance will suffer, but a 100GB lookaside is unexpectedly large. Note: the default write concern is w:1 (not w:majority). Increasing the size of the WT cache may help somewhat but isn't likely to resolve this. – Stennie Jun 06 '19 at 07:36
  • I would try removing the 4GB cache setting and test with the default cache size which would be 50% of (RAM-1GB) or 7.5GB. I'd also suggest testing without the adaptive serviceExecutor flag (which is still experimental as at MongoDB 4.0). Some workloads had performance issues with the adaptive serviceExecutor so we haven't been confident to make this the default execution model yet. I didn't find any relevant open issues but it would be good to remove any potential impact since your problem seems easily reproducible. – Stennie Jun 06 '19 at 07:36
  • I updated with the current rs.status() in case that's helpful at all. The reason for the 4GB cache setting is that each data bearing node has 4 shards, with 2 being primary on each node at any one time (which we found helpful in distributing write load with w:1, at least in our Mongo 2.4 setup), so we're probably giving too much memory actually, but were not memory bound and I agree that probably won;t be a solution but I can play with it. I'll try again without the serviceExecuter flag. – david Jun 06 '19 at 17:19
  • The look aside file reaches 120 GB before it consumes the drive, which I found surprising as well. I don't expect the cache file to be as compact as the BSON representation in memory, but I'm sure it's many times the size of the actual data we're trying to update the table with. I know Mongo provides a range of trade offs I can make with throughput versus resiliency and those don't bother me. If I lose data on some unexpected shutdown due to choices I've made, at least those were mine. But a data store should never make a trade off of throughput for stability. – david Jun 06 '19 at 17:24
  • A crashed service is worth nothing. I find it disconcerting that the default settings are such that any prolonged throughput rate that beats the replication lag will result in a crash. Far better to limit the throughput. If I'm not happy with that, I can add capacity, which is the major benefit of a horizontally scalable data store like Mongo. Whereas a crash in a data store, especially one so unexpected and fast, is a huge mess. If you're going to leave this as is, it needs a much more prominent and thorough section in the documentation. – david Jun 06 '19 at 17:31

0 Answers0