Created a server that can run a session with multi-threads using Flask.
When run 3 threads that the GPU's memory less than 8G, the program can run. But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.
I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash.
So I try to limit the number of threads, and set intra_op_num_threads = 2 or inter_op_num_threads = 2 or os.environ["OMP_NUM_THREADS"] = "2", but don't work.
Try to 'gpu_mem_limit', don't work either
import onnxruntime as rt
from flask import Flask, request
app = Flask(__name__)
sess = rt.InferenceSession(model_XXX, providers=['CUDAExecutionProvider'])
@app.route('/algorithm', methods=['POST'])
def parser():
prediction = sess.run(...)
if __name__ == '__main__':
app.run(host='127.0.0.1', port='12345', threaded=True)
My understanding is that the Flask HTTP server maybe use different sess for each call.
How can make each call use the same session of onnxruntime?
System information
- OS Platform and Distribution: Windows10
- ONNX Runtime version: 1.8
- Python version: python 3.7
- GPU model and memory: RTX3070 - 8G