Error: Running Training Code (tensorflow 1.1x, python 2.7) with docker in RTX3090 Ubuntu20.04

Question

I've always gotten a lot of help from you guys.

I'm asking for your help with this error.

Below is my environment for training.

[ Host Machine ]

OS: Ubuntu 20.04

GPU: RTX 3090

docker version: 20.10.7

[ Training Code ]

Python version: 2.7

Tensorflow: 1.13

I was testing this github code for research: https://github.com/Google-Health/records-research/tree/master/graph-convolutional-transformer

I found that RTX30XX GPUs need CUDA 11 or later, but the train code need CUDA 10 to use gpu on training.

So, I thought that using docker image is essential.

[ What I've tried ]

1. Using Docker

I used docker images below.

tensorflow/tensorflow:1.13.2-gpu
tensorflow/tensorflow:1.15.0-gpu
nvcr.io/nvidia/tensorflow:20.01-tf1-py2

However, All three docker images make the same results.(error)

Error Message

2021-12-30 08:00:16.577900: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2021-12-30 08:00:25.437865: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "./train.py", line 70, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "./train.py", line 63, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(3232, 128), b.shape=(128, 128), m=3232, n=128, k=128
         [[node graph_convolutional_transformer/dense_2/Tensordot/MatMul (defined at /tf/graph_convolutional_transformer.py:325) ]]
         [[node add_9 (defined at /tf/graph_convolutional_transformer.py:758) ]]

Caused by op u'graph_convolutional_transformer/dense_2/Tensordot/MatMul', defined at:
  File "./train.py", line 70, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "./train.py", line 63, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/tf/graph_convolutional_transformer.py", line 792, in model_fn
    model, feature_embedder, features, training)
  File "/tf/graph_convolutional_transformer.py", line 720, in get_prediction
    embeddings, masks[:, :, None], guide, prior_guide, training)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/tf/graph_convolutional_transformer.py", line 325, in call
    v = self._layers['V'][i](features)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/layers/core.py", line 968, in call
    outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 3583, in tensordot
    ab_matmul = matmul(a_reshape, b_reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 2455, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5333, in mat_mul
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(3232, 128), b.shape=(128, 128), m=3232, n=128, k=128
         [[node graph_convolutional_transformer/dense_2/Tensordot/MatMul (defined at /tf/graph_convolutional_transformer.py:325) ]]
         [[node add_9 (defined at /tf/graph_convolutional_transformer.py:758) ]]

nvidia-smi in docker

Thu Dec 30 08:06:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   36C    P8    21W / 420W |     19MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I also tried these

I searched for the error "Blas GEMM", I got these solutions:

tensorflow-gpu is not working with Blas GEMM launch failed

https://stackoverflow.com/a/65523597/17757583

tensorflow running error with cublas

However, these didn't work to fix the error...

So, I tried other method in below.

2. Other Method

https://www.pugetsystems.com/labs/hpc/How-To-Install-TensorFlow-1-15-for-NVIDIA-RTX30-GPUs-without-docker-or-CUDA-install-2005/

I tried this method, but the python version is only 3.x in the conda env when the settings are done. (I should use python 2.7)

Is there any other solution to fix this error using docker?

Python 3 was stable ten years ago, so please upgrade instead of trying to fix obsolete code. This may also be caused by running that old code on newer hardware. Further, it's unclear to me what you mean with "So, I thought that using docker image is essential." In any case, for a question here, you'd have to provide a [mcve]. Since you seem to be just running the code that fails, this may be better off as a bug report, so check the upstream bug tracker. Also, as a new user here, take the [tour] and read [ask]. — Ulrich Eckhardt, Dec 30 '21 at 10:15

Error: Running Training Code (tensorflow 1.1x, python 2.7) with docker in RTX3090 Ubuntu20.04

Error Message

nvidia-smi in docker

I also tried these

0 Answers0