We have only two templates leveraging the cache tag and they both have a construct similar to:
{% extends "_layout" %}
{% block content %}
{% cache globally using key entry.url ~ "relatedArticles" for 1 hour %}
... bunch of relational queries ...
{% endcache %}
...
{% endblock %}
Once or twice a day, the DeleteStaleTemplateCaches get stuck and requires manual "killing". For example, when this occurs, the craft_tasks table indicates say currentStep 1309 of totalSteps 12800... Then, "error'ing" the task, and choosing the "Retry task", generate a new job but will say only 670 totalSteps... and of course it succeeds. Why is the total steps so much different on retry?
There is no clue in the logs. Only the logs of the "retried task" that show the steps and its successful result. Nothing about the stuck task that I killed.
Currently, we have 3 nodes sharing a common file system for all (except most of the "runtime folder") and common database but individual memcached server (one per node). Craft is configured to use memcached. Could the issue be that more than one node start the delete task at the same time? i.e. all try to cleanup the same db table?
Beside this specific problem, I would be very grateful if you could explain how the different caching (data, templates, etc.) works within craft or indicate where I can find the information. i.e. I'm not clear on what is cached in memcached vs the database, vs file system (if anything).
Thanks.
After more investigation, I ended up removing the "request_terminate_timeout" in PHP-FPM and leaving that one at the default (off). So for the gateway timeout, only PHP "max_execution_time = 300" and the "fastcgi_read_timeout 300" are in place. I eventually got a "failed" job instead of a "stuck" job and got some logs! To fix that issue I increased MySQL "max_user_connections" (which was at 50).
I will report back if I get another stuck job. Thanks!
– Mathieu P. Feb 27 '15 at 18:11