9

I have like 50000 records that meet my condition for batch processing. My batch process begins at say 10:01 AM and ends at 10:10 AM. I set my batch size to be 200.

Can someone tell me what would happen

a) If an user creates a record at 10:03 AM that meet my condition...will this record be also be processed or not ?

b) Will the records which are getting processed during Batch 1 are technically locked till that particular batch is processed completely ?

c) What will be the behaviour if a record in Batch N does not meet my condition anymore but that particular batch has already been started to process ?

d) Is it possible to create a parallel processing of sorts such that Salesforce (instead of running one batch at a time) it will start processing parallely with multiple batches ?

I searched online and I could find manual/instructions with respect to how to run batch apex but I am not able to get clarity with respect to these specific cases.

user25311
  • 599
  • 5
  • 16

1 Answers1

14

My understanding is that all the records that make up the batch are determined when the Database.QueryLocator is created in the start method.

The records in that query locator will not change for the duration of the batch execution. The data from the execute List<sObject> will appear as it existed when the query locator was created. Updated - As per comments from Stephen Willcock and his associated blog post Batch Apex Query Behaviour:

only the IDs of the records [are] put into temporary storage. The platform will retrieve the records by ID just-in-time to be processed for each batch.

If any of the records are modified by another user or process while the Batch job is running, it is likely that it will be the modified version of the records which will be passed into the execute, and not the version of the records as they were when the start method was called.

Any records that are created or modified to subsequently meet the SOQL query after the query locator is created will not be detected.

Record locking wouldn't occur across the entire batch as you can't use FOR UPDATE in a batch SOQL query. If you do try this you get the error message:

Locking is implied for each batch execution and therefore FOR UPDATE should not be specified

It will however lock the records within each execution of execute. See QueryLocator.getQuery() usage:

You cannot use the FOR UPDATE keywords with a getQueryLocator query to lock a set of records. The start method automatically locks the set of records in the batch.

If you can divide/segment your data into distinct batches via the SOQL query then you could start each batch simultaneously. This doesn't imply that they will actually run in parallel, although they may.

Daniel Ballinger
  • 102,288
  • 39
  • 270
  • 594
  • 1
    I think that the set of record IDs for the entire job is established in the start method, but the records are queried during the job before execute (and I don't think the query-chunks are guaranteed to be 1:1 with the batches), therefore the records could be changed between start and execute. This means that records which no longer would have been selected are still selected, but the record data in the execute may be different from when the job started – Stephen Willcock Oct 23 '15 at 09:58
  • @StephenWillcock In hindsight what you say would make a lot of sense. Just have the IDs to drive the QueryLocator (Server Side Cursor). Especially when dealing with large volumes of records. I'm still trying to find some official documentation on it. Reading comments in Only query Id field in start() to speed up Apex Batch confuses the matter a bit, as it implies the data if frozen. I'll do some more research and experiments. – Daniel Ballinger Oct 24 '15 at 07:37
  • Yes Daniel, I haven't been able to find any concrete documentation either. If I remember correctly, I took my understanding from a conversation with someone authoritative at Salesforce, and this was backed up by developer observations in the office. I'm also doing some more thorough testing to clarify my own understanding which I'll share. – Stephen Willcock Oct 24 '15 at 09:38
  • 1
    Daniel, my tests seem to confirm what I thought - that only the IDs and not the records are fixed in the start. I've written it up here http://wp.me/p2XgT8-al with a link to my test code. Let me know what you find. – Stephen Willcock Oct 25 '15 at 23:00
  • @StephenWillcock Thanks for the update. Your findings seem pretty conclusive. I've updated the answer to reflect this. – Daniel Ballinger Oct 26 '15 at 20:32
  • @StephenWillcock Could I make one addition? Even though it might be true that records are loaded on the fly before the execute() function, Salesforce will only load those fields that are specified in the QueryLocator in the start() method! So just querying the ID in the start() method is not sufficient, because then the execute() function will only have records that have an ID and no other field at all. – Willem Mulder Mar 21 '17 at 10:02
  • It strikes me that the official documentation statement (the last quote in Daniel's answer) contains an inaccuracy. It makes no sense for the locks to be applied during start and that the locking would be applied around each call to execute. Has anyone else raised this point with Salesforce? (I have provided documentation feedback questioning this statement a couple of times now. Of course, I have heard nothing back.) – Phil W Jul 26 '19 at 07:57