3

I have a Python process that uses SQLAlchemy to insert some data into a MS SQL Server DB. When the Python process runs it hangs during the insert. I turned on SQLAlchemy logging to get some more information. I found that it hangs at this point where SQLAlchemy seems to be requesting table schema info about the entire DB:

2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')

I have other "stuff" going on in the DB at this time, including some open transactions and my guess is that for whatever reason querying [INFORMATION_SCHEMA].[TABLES] creates some deadlock or blocks somehow.

I've also read (here) that [INFORMATION_SCHEMA].[TABLES] is a view that cannot cause a deadlock which would contradict my guess of what is causing this issue.

My question is: Can I alter the configuration/settings of SQLAlchemy so that it does not make this query in the first place?

UPDATE 1: The Python code for the insert is like this:

with sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params).connect() as connection:
    # df is a Pandas DataFrame
    df.to_sql(name=my_table, con=connection, if_exists='append', index=False)

Note that the code works without any problems when I run the Python script during other times of the day when I don't have those other DB transactions going on. In those cases, the log continues immediately like this, listing all the tables in the DB:

2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._init_metadata(result.py:810) DEBUG: Col ('TABLE_NAME',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table1',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table2',)
...

UPDATE 2: Apparently when a table or other object is created in an open transaction and not committed yet, querying [INFORMATION_SCHEMA].[TABLES] will get blocked (source). Is anyone familiar with the internals of SQLAlchemy to suggest how to prevent it from making this query in the first place?

UPDATE 3: After posting this issue on the SQLAlchemy github (issue link) the SQLAlchemy devs confirmed that the query of [INFORMATION_SCHEMA].[TABLES] is in fact being caused by the Pandas function to_sql().

So, my new question is does anyone know how to disable this behavior in the Pandas to_sql() function? I looked over the documentation and could not find anything that would seem to help.

Joe
  • 328
  • 3
  • 9
  • Possibly related: https://dba.stackexchange.com/q/164429/21390 – Gord Thompson Oct 30 '20 at 17:53
  • Related: https://stackoverflow.com/questions/59636271/sql-server-queries-beyond-a-certain-character-length-are-timing-out – Nick ODell Oct 31 '20 at 15:36
  • Discussed on GitHub [here](https://github.com/sqlalchemy/sqlalchemy/issues/5679). – Gord Thompson Oct 31 '20 at 17:43
  • @GordThompson the SQLAlchemy dev mentions using either `ALTER DATABASE MyDatabase SET ALLOW_SNAPSHOT_ISOLATION ON` or `ALTER DATABASE MyDatabase SET READ_COMMITTED_SNAPSHOT ON` to not lock as aggressively by turning on snapshot isolation. I'm concerned about making these changes on the database level. Do you know of a setting in SQL server to make queries on `[INFORMATION_SCHEMA].[TABLES]` not block when an open transaction exists that has created a new table in that database? – Joe Oct 31 '20 at 17:52
  • You could try `create_engine(connection_uri, isolation_level="READ_UNCOMMITTED")` and see if that prevents the hang. – Gord Thompson Oct 31 '20 at 18:47
  • Unfortunately that did not prevent the query from hanging. – Joe Oct 31 '20 at 20:32
  • Hmm, it works for me. (Hangs without `isolation_level` and does not hang when I add it.) Perhaps try `engine = create_engine(connection_uri)` and the omit the `with` block and call `to_sql` with `engine` as the second parameter. That works for me, too. – Gord Thompson Oct 31 '20 at 21:01
  • That is strange, it doesn't work for me. Also just tried engine = ... as you suggested and it also did not work. The code hangs on `table_names = [r[0] for r in connection.execute(s)]` in `get_table_names` of `sqlalchemy.dialects.mssql.base.py`. Just to be clear, when you test this, you have an open transaction on the same database where you have created a new table in that database in the open transaction? That's what I have and as soon as I rollback or commit that transaction in SQL the Python code continues running again. – Joe Oct 31 '20 at 22:45
  • @Joe - *"Just to be clear, when you test this, you have an open transaction on the same database where you have created a new table in that database in the open transaction?"* - Yes, exactly. – Gord Thompson Nov 01 '20 at 00:38
  • Darn. Still doesn't work for me. Perhaps their are some database related settings that are causing the difference. I wonder if their is some way I can confirm the `isolation_level="READ_UNCOMMITTED"` option is being used in general. – Joe Nov 01 '20 at 17:11
  • `SELECT transaction_isolation_level FROM sys.dm_exec_sessions WHERE session_id = @@SPID;` returns the numeric value. `1` means READ_UNCOMMITTED. – Gord Thompson Nov 02 '20 at 14:05
  • Anyone ever figure out a fix for this? We've been stuck for a couple of days because of this. – jseals Apr 29 '21 at 19:27
  • I have not...just ended up monkey patching Pandas for the time being. – Joe May 21 '21 at 17:39

1 Answers1

0

I'm not very familiar with SQLAlchemy, but I can tell you about the Pandas side of this issue.

Pandas automatically creates a new table if the table doesn't exist. The way it figures out whether the table exists is that it calls has_table() in SQL Alchemy. The way has_table() works is that it queries the information schema. (At least, it works that way in MySQL and MSSQL.)

Implementation details

Here's what I found tracing the logic for this in Pandas and SQLAlchemy. We start in pandas/io/sql.py, inside to_sql().

        table = SQLTable(
            name,
            self,
            frame=frame,
            index=index,
            if_exists=if_exists,
            index_label=index_label,
            schema=schema,
            dtype=dtype,
        )
        table.create()

SQLTable.create() is defined here:

class SQLTable(PandasObject):
    [...]
    def create(self):
        if self.exists():
            if self.if_exists == "fail":
                raise ValueError(f"Table '{self.name}' already exists.")
            elif self.if_exists == "replace":
                self.pd_sql.drop_table(self.name, self.schema)
                self._execute_create()
            elif self.if_exists == "append":
                pass
            else:
                raise ValueError(f"'{self.if_exists}' is not valid for if_exists")
        else:
            self._execute_create()

Notice that it calls exists() unconditionally. Inside SQLTable.exists(), you'll find this:

    def exists(self):
        return self.pd_sql.has_table(self.name, self.schema)

This eventually calls has_table() in SQLAlchemy: https://docs.sqlalchemy.org/en/13/core/internals.html#sqlalchemy.engine.default.DefaultDialect.has_table

For MSSQL, this is implemented in sqlalchemy/dialects/mssql/base.py in SQLAlchemy:

    @_db_plus_owner
    def has_table(self, connection, tablename, dbname, owner, schema):
        if tablename.startswith("#"):  # temporary table
            [...]
        else:
            tables = ischema.tables

            s = sql.select(tables.c.table_name).where(
                sql.and_(
                    tables.c.table_type == "BASE TABLE",
                    tables.c.table_name == tablename,
                )
            )

            if owner:
                s = s.where(tables.c.table_schema == owner)

            c = connection.execute(s)

            return c.first() is not None

(ischema is an abbreviation of information_schema, and this code is running a select on that table.)

How to fix this

I don't see a good, simple way to fix this. Pandas assumes that has_table() is a cheap operation. MSSQL doesn't follow that assumption. No matter what if_exists is set to, Pandas will call has_table() during to_sql().

I can think of a hacky way of doing this, though. If you were to monkey-patch pandas.io.sql.SQLTable.create() so that it's a no-op, then you could trick Pandas into thinking that the table already exists. The downside of this is that Pandas won't automatically create tables.

Nick ODell
  • 9,210
  • 2
  • 26
  • 54
  • 1
    I think you are basically on the right track, but the issue occurs after the insert. When Pandas to_sql does a check on whether the table might have case sensitivity issues (look for: `# check for potentially case sensitivity issues (GH7815)` in to_sql) it ends up calling the function `get_table_names` in SQLAlchemy which is ultimately what causes the block. – Joe Oct 31 '20 at 17:16