How to use onnxruntime with flask

Question

Created a server that can run a session with multi-threads using Flask.

When run 3 threads that the GPU's memory less than 8G, the program can run. But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.

I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set intra_op_num_threads = 2 or inter_op_num_threads = 2 or os.environ["OMP_NUM_THREADS"] = "2", but don't work.
Try to 'gpu_mem_limit', don't work either

import onnxruntime as rt
from flask import Flask, request
app = Flask(__name__)

sess = rt.InferenceSession(model_XXX, providers=['CUDAExecutionProvider'])

@app.route('/algorithm', methods=['POST'])
def parser():
    prediction = sess.run(...)

if __name__ == '__main__':
    app.run(host='127.0.0.1', port='12345', threaded=True)

My understanding is that the Flask HTTP server maybe use different sess for each call.
How can make each call use the same session of onnxruntime?

System information

OS Platform and Distribution: Windows10
ONNX Runtime version: 1.8
Python version: python 3.7
GPU model and memory: RTX3070 - 8G

How to use onnxruntime with flask

0 Answers0