I've always gotten a lot of help from you guys.
I'm asking for your help with this error.
Below is my environment for training.
[ Host Machine ]
OS: Ubuntu 20.04
GPU: RTX 3090
docker version: 20.10.7
[ Training Code ]
Python version: 2.7
Tensorflow: 1.13
I was testing this github code for research: https://github.com/Google-Health/records-research/tree/master/graph-convolutional-transformer
I found that RTX30XX GPUs need CUDA 11 or later, but the train code need CUDA 10 to use gpu on training.
So, I thought that using docker image is essential.
[ What I've tried ]
1. Using Docker
I used docker images below.
tensorflow/tensorflow:1.13.2-gpu
tensorflow/tensorflow:1.15.0-gpu
nvcr.io/nvidia/tensorflow:20.01-tf1-py2
However, All three docker images make the same results.(error)
Error Message
2021-12-30 08:00:16.577900: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2021-12-30 08:00:25.437865: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "./train.py", line 70, in <module>
tf.app.run(main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./train.py", line 63, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
return self.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(3232, 128), b.shape=(128, 128), m=3232, n=128, k=128
[[node graph_convolutional_transformer/dense_2/Tensordot/MatMul (defined at /tf/graph_convolutional_transformer.py:325) ]]
[[node add_9 (defined at /tf/graph_convolutional_transformer.py:758) ]]
Caused by op u'graph_convolutional_transformer/dense_2/Tensordot/MatMul', defined at:
File "./train.py", line 70, in <module>
tf.app.run(main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./train.py", line 63, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
return self.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/tf/graph_convolutional_transformer.py", line 792, in model_fn
model, feature_embedder, features, training)
File "/tf/graph_convolutional_transformer.py", line 720, in get_prediction
embeddings, masks[:, :, None], guide, prior_guide, training)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/tf/graph_convolutional_transformer.py", line 325, in call
v = self._layers['V'][i](features)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/layers/core.py", line 968, in call
outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 3583, in tensordot
ab_matmul = matmul(a_reshape, b_reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 2455, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5333, in mat_mul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(3232, 128), b.shape=(128, 128), m=3232, n=128, k=128
[[node graph_convolutional_transformer/dense_2/Tensordot/MatMul (defined at /tf/graph_convolutional_transformer.py:325) ]]
[[node add_9 (defined at /tf/graph_convolutional_transformer.py:758) ]]
nvidia-smi in docker
Thu Dec 30 08:06:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 36C P8 21W / 420W | 19MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
I also tried these
I searched for the error "Blas GEMM", I got these solutions:
tensorflow-gpu is not working with Blas GEMM launch failed
https://stackoverflow.com/a/65523597/17757583
tensorflow running error with cublas
However, these didn't work to fix the error...
So, I tried other method in below.
2. Other Method
I tried this method, but the python version is only 3.x in the conda env when the settings are done. (I should use python 2.7)
Is there any other solution to fix this error using docker?