关于自动提交和软提交

文章来自：http://stackoverflow.com/questions/17654266/solr-autocommit-vs-autosoftcommit/17666569#17666569

简单来一句话： solr的软提交不是没有代价的，耗内存比较大。两者如何配合使用，需要看使用场景。

I always shudder at this, because any recommendation will be wrong in some cases. My first recommendation would be to not overthink the problem. Some very smart people have tried to make the entire process robust. Try the simple things first and only tweak things as necessary. In particular, look at the size of your transaction logs and adjust your hard commit intervals to keep these “reasonably sized”. Remember that the penalty is mostly the replay-time involved if you restart after a JVM crash. Is 15 seconds tolerable? Why go smaller then?

We’ve seen situations in which the hard commit interval is much shorter than the soft commit interval, see the bulk indexing bit below.

These are places to start.

HEAVY (BULK) INDEXING

The assumption here is that you’re interested in getting lots of data to the index as quickly as possible for search sometime in the future. I’m thinking original loads of a data source etc.

Set your soft commit interval quite long. As in10 minutes. Soft commit is about visibility, and my assumption here is that bulk indexing isn’t about near real time searching so don’t do the extra work of opening any kind of searcher. Set your hard commit intervals to 15 seconds, openSearcher=false. Again the assumption is that you’re going to be just blasting data at Solr. The worst case here is that you restart your system and have to replay 15 seconds or so of data from your tlog. If your system is bouncing up and down more often than that, fix the reason for that first. Only after you’ve tried the simple things should you consider refinements, they’re usually only required in unusual circumstances. But they include: Turning off the tlog completely for the bulk-load operation Indexing offline with some kind of map-reduce process Only having a leader per shard, no replicas for the load, then turning on replicas later and letting them do old-style replication to catch up. Note that this is automatic, if the node discovers it is “too far” out of sync with the leader, it initiates an old-style replication. After it has caught up, it’ll get documents as they’re indexed to the leader and keep its own tlog. etc.

INDEX-HEAVY, QUERY-LIGHT

By this I mean, say, searching log files. This is the case where you have a lot of data coming at the system pretty much all the time. But the query load is quite light, often to troubleshoot or analyze usage.

Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand. Set your hard commit to 15 seconds, openSearcher=false INDEX-LIGHT, QUERY-LIGHT OR HEAVY

This is a relatively static index that sometimes gets a small burst of indexing. Say every 5-10 minutes (or longer) you do an update

Unless NRT functionality is required, I’d omit soft commits in this situation and do hard commits every 5-10 minutes with openSearcher=true. This is a situation in which, if you’re indexing with a single external indexing process, it might make sense to have the client issue the hard commit.

INDEX-HEAVY, QUERY-HEAVY

This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start

Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free. Set your hard commit interval to 15 seconds.

关于自动提交和软提交