Batch start CPU timeout - What influences start() post-processing?

Question

We're experiencing a very annoying issue with a Batch job processing "Large Data Volume" (1.5M records over a table containing 4.5M records). Our batch start method last statement is: return Database.getQueryLocator(soql); and despite the query completes successfully returning 1.5M rows, the batch doesn't kick off and a CPU timeout error is reported.

I believe the CPU timeout is not to be attributed to any of our logic, being the query our last code statement and completing it successfully (look at the pseudo-code, debug log and screenshots of CPU timeline and Apex Jobs provided below). I have the strong suspect the issue is related to the "housekeeping" sfdcfox mentions in here - but that's just a weak suspect as if that'd be the case I suppose I wouldn't be the only one experiencing the issue.

We have optimized the SOQL so that only the Id field of targeted records is selected: the SOQL doesn't contain any child/sub-query, and the records returned by the SOQL are as lightweight as they can be being only their ID field selected. This is a known factor that affects badly batch start() methods with CPU timeouts, but it's not to be considered by us after optimization. I wonder then if there are others.

The question (and bounty) is out for a clear, documented, detailed answer to the question: what are factors that can affect/impact on the post-processing that SFDC does on the recordset which is returned by a batch start() method, other than the presence of child queries or parent records?

Pseudo code that gives an idea of my batch

global with sharing class APseudoBatch {

  private String soqlString = 'SELECT alias.Id FROM MyObject__c alias WHERE ...';

  global Database.QueryLocator start (Database.BatchableContext BC) {

    MyOtherObject__c[] prevBatchData = [SELECT Id FROM MyOtherObject__c WHERE ...];

    delete prevBatchData;

    return Database.getQueryLocator(soqlString);

  }

  (...)

}

Debug log

34.0 APEX_CODE,DEBUG;APEX_PROFILING,INFO;CALLOUT,INFO;DB,INFO;SYSTEM,DEBUG;VALIDATION,INFO;VISUALFORCE,INFO;WORKFLOW,INFO

05:12:19.152 (152753319)|SOQL_EXECUTE_BEGIN|[159]|Aggregations:0|SELECT id FROM AAA__c

05:12:19.152 (158857573)|SOQL_EXECUTE_END|[159]|Rows:0

05:12:19.152 (163217923)|SOQL_EXECUTE_BEGIN|[165]|Aggregations:0|SELECT id FROM BBB__c

05:12:19.152 (167389830)|SOQL_EXECUTE_END|[165]|Rows:1

05:12:19.152 (167994190)|DML_BEGIN|[405]|Op:Delete|Type:BBB__c|Rows:1

05:12:19.152 (401425184)|DML_END|[405]

05:12:19.152 (402249359)|SOQL_EXECUTE_BEGIN|[172]|Aggregations:0|SELECT id FROM CCC__c

05:12:19.152 (406737724)|SOQL_EXECUTE_END|[172]|Rows:4

05:12:19.152 (407218557)|DML_BEGIN|[405]|Op:Delete|Type:CCC__c|Rows:4

05:12:21.104 (2104679115)|DML_END|[405]

05:12:21.104 (2110622491)|DML_BEGIN|[409]|Op:Upsert|Type:DDD__c|Rows:4

05:12:21.104 (2183712913)|DML_END|[409]

05:12:21.104 (2192886624)|SOQL_EXECUTE_BEGIN|[188]|Aggregations:0|Select s.Id From LDV__c

05:13:25.773 (66773435973)|SOQL_EXECUTE_END|[188]|Rows:1517052

(Line 188 is the last line of my start() method - the debug log ends there, I've pasted it all as it is)

Screenshot of Apex Jobs page

The SOQL completes successfully at 05:13:25, the Batch Job is marked failed at 05:16 in the Apex Jobs page. Strange, isn't?

Screenshot of Execution Timeline

LAST UPDATE

Changed the batch query removing all the WHERE criterias. The behavior of the CPU timeout changed - now it is happening 10 minutes after the SOQL failure (as the linked sfdcfox post suggests) instead of 2-3 minutes as per my previous debug logs. The table contains 3.5M records now.

Please post the rest of the code in your start method. Asking if anyone else is experiencing the same problem leads to an opinion based discussion rather than one based on facts. If you have data to post from Dev Console for the Execution Timeline, I'm confident that would be helpful as well. — crmprogdev, Jun 13 '16 at 13:22
a) I cannot post the rest of the code for security reasons b) I believe asking if others are experiencing the same is not leading to opinion-based discussion but just to "a discussion" where people are eventually able to share common experiences in issues c) the SOQL is the last statement of the start method, the timeline does not show anything related to that, being that beyond our control — MLucci, Jun 13 '16 at 13:34
You should put the delete in a separate batch and use batch chaining instead. — Adrian Larson, Jun 13 '16 at 17:02
@AdrianLarson agree that's an optimization that could be done, but I don't think that's impacting anything in my case - I've just added the debug log in my question, you can see there that SOQLs and DMLs performed before the final LDV SOQL are targeting a max. of 4 records — MLucci, Jun 13 '16 at 17:15
Just throwing an idea.. Try limiting the records in the query and return them (Hit & Try), let me know the outcome..if this works then you can use schedular class to call this batch class and acheive the functionality — Krishna Bidwai, Jun 13 '16 at 17:26
You can separate it in to two batches and execute the second batch from the finish method of first batch. In the second batch , query the the results which are not retrieved in first batch. This way you can reduce the timeout error. Just an idea — Anurag, Jun 15 '16 at 13:57
How about adding a LIMIT clause to the SOQL. See if it works with 10,000 records and increase it to find the breaking point. — Jagular, Jun 15 '16 at 16:46
Try removing the where clauses. It's counter intuitive but you might be better off determining which records to act on in the execute instead of the query. — Scott Morrison, Jun 15 '16 at 20:29
Thank you all for the suggestions so far. I am sure I can get my data processing going somehow, but I am trying first to find an answer to the question "what is causing the CPU timeout" before jumping into alternative solutions. Can't really accept that I have to recur to fuzzy solutions without having an explanation of why the only/main facility SF provides to process LDV fails in doing so. The SOQL completes and is optimized (no aggregations, parent/child query, only Id field selected). And when it completes there's still plenty of CPU time and no further operation done by our code. — MLucci, Jun 15 '16 at 21:25
@MLucci I think you're missing the point of the suggestions. You first need to get your batch method working and can then add things back in to see where, what & why it's breaking! It could be that something that's not being reported is causing it to break. You won't be able to see that until you eliminate these other things and can get something to actually work. Once you can, you can increase record limits, add where clauses, etc to see what it is that causes your batch code to break. — crmprogdev, Jun 16 '16 at 14:26
Agrre with your observation @crmprogdev... Unfortunately I have hands tied in terms of the time I can spend troubleshooting with try-and-error, and there's also to consider it is a managed package and the issue is happening only with a specific customer which makes it lengthy to issue a patch for each tentative. I'll have to push up my manager eventually on dedicating more time/effort on ripping off the logic. Still, the bitterness of having to waste time because the platform doesn't behave is hard to swallow - that's why I hope to hear from somebody who faced and resolved the same issue. — MLucci, Jun 16 '16 at 14:40
But I see, until I don't do that meticulous job you mentioned, there's little chance of really identifying what's "the same issue" as well... ahi ahi ahi... — MLucci, Jun 16 '16 at 14:43
Perhaps it's a difference in perspective. I don't see it as "trial and error". It's simply a methodical approach to diagnosing the root cause of an issue. Simplifying, then slowly adding complexity back in, is a perfectly valid diagnostic technique that's used in most every industry. — crmprogdev, Jun 16 '16 at 14:49
@crmprogdev If you look back at the suggestions they were mostly proposing solutions (batch chaining, hit & try, limit, etc.) so yes, following through to them looks like a try-and-error process to me. Yours was not, but still that's no reason for using such a condescending tone. — MLucci, Jun 16 '16 at 15:49
Am sorry if you took what I said as being condescending. Batch chaining, is one way of simplifying what your batch does. Adding limit clauses is another way of testing to see if something else is happening that doesn't show before the time limit is exceeded. It can be progressively increased until the problem occurs again. I see these as all being valid approaches to trying to resolve your issue. You asked for insight and users responded with options you can explore. What to do with them is up to you. It truly matters not to me as it's not my problem to solve. Best of all, the advice was free. — crmprogdev, Jun 16 '16 at 16:27
Have seen the occasion where Salesforce's Query Locator will time out in SF's code with even the simplest filter in the start method. Only recourse was to remove all where conditions and process every record validating against criteria. — Eric, Jun 16 '16 at 17:47
Thanks for giving a suggestion linked to the specific case and citing direct experience. Your comment gives me hope it would as the scenario you mentioned seems very related to ours. (Scott Morrison gave the same suggestion before, but I really needed to hear something that linked it to my scenario) — MLucci, Jun 16 '16 at 18:21
@Eric tried that and... now the CPU timeout is happening 10 minutes after the SOQL has completed (not 2-3 minutes like in my previously posted debug/screenshots). 3.5M total records (table size changed as customer tried to shrink/aggregate data to overcome the issue). Opened a new support case now. — MLucci, Jun 20 '16 at 14:56
I see there is still no reasonable answer as to why it is happening. In my projects I always process around 20k items in a batch and process the rest in the next batch. There obviously is an overhead of marking the processed items. Totally agree that the documentation says you can process 50M which I haven't ever been able to achieve. — Artur, Nov 30 '16 at 22:02

score 2 · Answer 1 · edited Aug 24 '17 at 13:44

Every time you ask for a data from salesforce, everytime you have to call the salesforce servers. Now each server call limit is 10,000 millisecond = 10 seconds, that means if server call is taking more than 10 seconds to return results your batch will immediately fail with status "CPU time exceeds".Each server call can be querying data, Executing codes or others.

I feel, in your case your start method works fine as there are no failure during fetching data (that is why your SOQL completes at some time and finally after few minutes your batch is getting failed). Even If you know, you may find different debug logs for start method, each batch and finish method. Suppose you have 5 batches then you may get 7 debug logs for your batch class (1 for Start,5 for execute and 1 for finish). If you are getting your very first debug without error that means your start method is fine.

I feel, the CPU limit reach is hitting from execute method somewhere. As you specified that batch fails after few minutes of SOQL execution.

Possibility for failure:

may be using heavy subquery (another query in your execute method)
may be using for loop inside a for loop
may be very lengthy code in execute method

Solution:

You can use a subquery but make sure it is optimized and return data rapidly OR You can always make you class stateful and instead of writing a subquery, write the same query inside your start method and gather data in a map to be used further.
try eliminating nested for loops as they are the main eaters of CPU time to process the whole execute code within 10 seconds. Make two different loops out of nested one, Use collections to store data results temporarily gained from first loop then feed it to another loop to get the outcome.
If your logic is lengthy, go for splitting batch into multiple batches. Process data according to your half logic, store it in collections like map, place your another half logic in another batch, call the batch from the finish method with the temporary collection as input.

score 0 · Answer 2 · answered Mar 29 '17 at 12:12

Reason for CPU time out (in my case) – SubQuery

I tried SOQL Query Plan and run same query in Developer console and workbench, but it was not timing out. Still, gave one shot to idea that I had and tried to remove subquery from start method and queried again in execute method. And voilla…. it worked. So, subquery in SOQL was causing CPU time limit error and it was not even able to enter in execute method. I got lost because there were no debug logs. Instead of spending time on actual issue, I spent lot of time to figure out why debug log is not showing up.

Lessons learned

If we are getting CPU time limit error (in my experience I have seen many), then avoid using sub-query, its costly and time consuming. Damn !!! I need to tell my colleagues of previous projects to check code for subquery, where they are getting CPU error. If batch apex error says “First error – some error” , that means error is in start method and don’t even try to hunt for execute method debug logs.

This is just a shot at your CPU timeout issue on Start method of batch.

This could be helpful: Batch Apex – First error: Apex CPU time limit exceeded

score 0 · Answer 3 · answered Jun 22 '18 at 09:26

Checking all the answers & comments what I can tell is:

Have you tried removing the delete from the start method of the batch and put it on a separate batch and use batch chaining? (as @Adrian Larson has suggested on comments)
To my opinion you don't have to remove the Where clause on SOQL query. What you can avoid from query are Subqueries (in case you use them on the start method) as others have suggested. During my experience I've had the same error and by removing the Subqueries on the start method and querying the child records on the execute method did solve the issue for me.

P.S. You might have solved the issue already, as it's a 2 years old question. Would be nice if you update this thread by providing the solution of your issue or accepting the correct answer.

Thanks

score 0 · Answer 4 · answered Feb 09 '22 at 20:08

0

Had same issue, removing sub query from start method solved it for us.

answered Feb 09 '22 at 20:08

GeorgeA

974
2
12
30

Batch start CPU timeout - What influences start() post-processing?

4 Answers4