Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic buffers in GetDiagRecs and GetErrorFromHandle and prevent buffer overflow on errors longer than 1024 characters #881

Closed

Conversation

juliuszsompolski
Copy link
Contributor

@juliuszsompolski juliuszsompolski commented Mar 7, 2021

What changes are proposed in this pull request?

When using Databricks with Simba Spark ODBC drivers, long error messages from Spark were occasionally truncated. We traced it down to fixed buffer sizes in GetDiagRecs and GetErrorFromHandle when calling SQLGetDiagRecW.
We now allocate the buffer on the heap instead of stack, and reallocate it to be bigger when needed.

https://docs.microsoft.com/en-us/sql/odbc/reference/syntax/sqlgetdiagrec-function?view=sql-server-ver15 documents the SQLGetDiagRec function bufferLength as:

BufferLength
[Input] Length of the *MessageText buffer in characters. There is no maximum length of the diagnostic message text.

suggesting that the message length should not be limited by SQL_MAX_MESSAGE_LEN when using this API.

Also, changed GetSqlState to use SQLGetDiagField to retrieve the SQLState. That function seems to not be used though. (Used by HasSqlState, which is not used anywhere in the codebase.)

How it was tested?

Now able to retrieve longer error messages using Simba Spark ODBC drivers and connecting to Databricks Spark Runtime.

@juliuszsompolski
Copy link
Contributor Author

Hi @mkleehammer ,
We're using pyodbc at Databricks to connect Redash to Databricks Runtime. We'd really love if the patch could make it into a pyodbc release, so we could use it from the official pyodbc release.

@v-chojas
Copy link
Contributor

v-chojas commented Mar 8, 2021

32k? That seems far too large... keep in mind that some platforms (e.g. Alpine Linux, popular with Docker containers and such) default to a 128k stack, and SQL_MAX_MESSAGE_LENGTH (512) is defined in the ODBC specification.

Also, this will cause OOM/stack overflow on platforms where short is also 32 bits...

@juliuszsompolski
Copy link
Contributor Author

Thanks for the review @v-chojas .
I'll look at using a small (unchanged) buffer, and only reallocating it if cchMsg indicates that there is a longer message to be retrieved.

Copy link
Contributor Author

@juliuszsompolski juliuszsompolski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v-chojas I'll work to get the memory allocated on the heap if the TextLengthPtr returned from initial call indicates that the buffer was too small.
What memory allocator would you recommend for that?
PyMem_Malloc / PyMem_Free?

src/cursor.cpp Outdated Show resolved Hide resolved
@v-chojas
Copy link
Contributor

There is pyodbc_malloc/pyodbc_free, used in the rest of the code.

10k is already a lot more than what Windows and unixODBC DMs have, a much smaller limit on their maximum error message length: https://github.com/lurcher/unixODBC/blob/master/DriverManager/__info.c#L4193

@juliuszsompolski
Copy link
Contributor Author

Thanks for pointing to pyodbc_malloc / pyodbc_free.

We have been getting repeated feedback from our customers that long error messages from Spark are truncated in python applications that use pyodbc + Simba Spark ODBC driver to connect to Spark.
It does not happen in native ODBC applications - Tableau or PowerBI return long error messages without problem.

I've looked at https://github.com/lurcher/unixODBC/blob/master/DriverManager/SQLGetDiagRecW.c. Is that what is being called by pyodbc? It doesn't seem to be limited there, up to the point where it calls SQLGETDIAGRECW.

#define SQLGETDIAGRECW(con,typ,han,rn,st,nat,msg,bl,tlp)\
                                    (con->functions[77].funcW)\
                                        (typ,han,rn,st,nat,msg,bl,tlp)

I believe this calls directly into the underlying driver?
We already worked with Simba to make sure that truncation does not happen inside the driver.

@juliuszsompolski
Copy link
Contributor Author

@v-chojas I changed it to allocate the message on the heap, and reallocate to a bigger one if needed and tested with Spark + Simba ODBC driver.
In an earlier commit I had the initial message be on the stack, and then allocate on heap if bigger is needed, but I think it's too much complexity for probably little gain (one could argue some heap fragmenation, but I think it's not a very hot function to cause that) 27cf95b
Even more simple would be to just allocate a bigger message on the heap from the get go, heap size should not be so limited as stack size?
WDYT?

Comment on lines +267 to +268
msgLen = cchMsg + 1;
if (!pyodbc_realloc((BYTE**) &szMsg, (msgLen + 1) * sizeof(ODBCCHAR))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I intentionally did some defensive "double +1" here and in other places. A small cost to avoid potential overflow.

@keitherskine
Copy link
Collaborator

Thanks for doing this work, Julius! It would certainly be an improvement to enable PRINT statements of any length.

It would be good to update the unit tests to cover this scenario though. As a suggestion, perhaps update the unit test function test_cursor_messages_with_print() in tests3\sqlservertests.py with something like:

        for message in ('hello world', 'A' * 50000):
            self.cursor.execute("PRINT '{}'".format(message))
            self.assertTrue(type(self.cursor.messages) is list)
            self.assertEqual(len(self.cursor.messages), 1)
            self.assertTrue(type(self.cursor.messages[0]) is tuple)
            self.assertEqual(len(self.cursor.messages[0]), 2)
            self.assertTrue(type(self.cursor.messages[0][0]) is str)
            self.assertTrue(type(self.cursor.messages[0][1]) is str)
            self.assertEqual('[01000] (0)', self.cursor.messages[0][0])
            self.assertTrue(self.cursor.messages[0][1].endswith(message))

or something similar? (By the way, best avoid using f-strings and underscores in numbers because they don't work in Python 3.5.)

For completeness, it would be good to make a similar change to the unit tests in these scripts too:
tests2\sqlservertests.py
tests2\pgtests.py
tests3\pgtests.py

@v-chojas
Copy link
Contributor

I did some quick testing of Windows' DM and unixODBC, and once you get beyond 512 (SQL_MAX_MESSAGE_LENGTH), the behaviour varies subtly depending on whether the driver and application are Unicode/ANSI, and it also differs between Windows and unixODBC. Depending on the exact API and state upon which the error occurs, unixODBC may extract the diag recs from the driver immediately after the call, without the application calling the diag functions first, in which case the 512 limit applies. Otherwise it is a pass-through, and it doesn't. The Windows DM does something similar.

In other words, enlarging the buffers in the application doesn't always work, and a driver should be splitting a long message into multiple diag recs of less than 512 bytes each.

@juliuszsompolski
Copy link
Contributor Author

@v-chojas Thanks for checking various code paths in Windows DM and unixODBC.

In other words, enlarging the buffers in the application doesn't always work, and a driver should be splitting a long message into multiple diag recs of less than 512 bytes each.

Would that cause this change to break, or would that cause that in some subset of cases where the diag recs are retrieved elsewhere, the message would be truncated to 512 bytes anyway?
It would already be helpful for us if pyodbc did not truncate error messages in those cases where this code is reached.

In other words, enlarging the buffers in the application doesn't always work, and a driver should be splitting a long message into multiple diag recs of less than 512 bytes each.

Unfortunately, getting multiple diag recs seems to be unconditionally disabled in pyodbc on UnixODBC altogether: https://github.com/mkleehammer/pyodbc/pull/881/files#diff-a776f769b92a9224299e940b7d4b8cb4de6ca70371732c05f7e5ea223be9aafcR321
Gluing the multiple diag recs together would right now also add extra formatting added by pyodbc, which would also be undesirable.

@juliuszsompolski
Copy link
Contributor Author

@keitherskine Thanks. I added tests as suggested, but used 30000 instead of 50000, because I think the length may be ultimately limited by the SQLGetDiagRecW len parameters being shorts...
Unfortunately, I don't have a Postgres or SQL Server environments handy to verify these tests.

@keitherskine
Copy link
Collaborator

@juliuszsompolski From the CI test results, it's interesting that Postgres appears to have returned 30 messages rather than 1. It looks like it might have returned the 30K test string in chunks of 1,000 or so bytes each. SQL Server returned only 1 message for the 30K test string but it doesn't appear to be the right value - maybe it was truncated?

@juliuszsompolski
Copy link
Contributor Author

Quite possibly then Postgres is doing what @v-chojas has suggested - splitting long messages into multiple records, while SQL Server (or its driver) might be truncating... Simba Spark ODBC driver was also truncating but they fixed it for us in their upcoming 2.6.17 release. I will take a look when I find a moment.

@v-chojas
Copy link
Contributor

Would that cause this change to break, or would that cause that in some subset of cases where the diag recs are retrieved elsewhere, the message would be truncated to 512 bytes anyway?

The latter. Look for "DEFER_" in the unixODBC source to see the cases where that occurs.

Copy link
Collaborator

@keitherskine keitherskine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, @juliuszsompolski . I've updated the unit tests so they pass both AppVeyor and Travis CI.

tests2/pgtests.py Outdated Show resolved Hide resolved
tests2/sqlservertests.py Outdated Show resolved Hide resolved
tests3/sqlservertests.py Outdated Show resolved Hide resolved
tests3/pgtests.py Outdated Show resolved Hide resolved
@juliuszsompolski
Copy link
Contributor Author

Thanks @keitherskine for the help tweaking the tests, I very much appreciate the snippets - I do not have a SQL Server environment to debug myself; I have been testing with Simba Spark ODBC drivers, as this is the case I'm trying to solve for myself.

@keitherskine
Copy link
Collaborator

No problem at all @juliuszsompolski , I'm glad they helped. Your updates make the ".messages" functionality much better (and more memory efficient) so many thanks indeed!

@juliuszsompolski
Copy link
Contributor Author

Thanks. Happy to help, and I'd be happy if this could get merged in, so that we can go back to pointing our dependencies at an official pyodbc release, instead of a forked commit.

@keitherskine
Copy link
Collaborator

Yes, it will be good to get this into the master branch and then into a release. Only @mkleehammer can do that though.

@juliuszsompolski
Copy link
Contributor Author

@mkleehammer @keitherskine
The current pyodbc release and master actually SIGSEGVS in GetErrorFromHandle when an error longer than 1024 characters is returned from SQLGetDiagRecW.

This is because

ret = SQLGetDiagRecW(nHandleType, h, iRecord, (SQLWCHAR*)sqlstateT, &nNativeError, (SQLWCHAR*)szMsg, (short)(_countof(szMsg)-1), &cchMsg);

fills cchMsg with the actual size of the message that is available, which may be greater than the size of the buffer.
Meanwhile, a few lines below that

Object msgStr(PyUnicode_Decode((char*)szMsg, cchMsg * sizeof(ODBCCHAR), unicode_enc, "strict"));

uses cchMsg * sizeof(ODBCCHAR) for the size of buffer to decode, causing a buffer overflow.

I did not notice this before, because I was using my version, but now hit that crash when accidentally running in an environment with vanilla pyodbc installed.

This PR fixes that. A simpler fix to prevent the crash would use min(_countof(szMsg)-1, cchMsg) in PyUnicode_Decode.

@juliuszsompolski juliuszsompolski changed the title Increase error buffer sizes in GetDiagRecs and GetErrorFromHandle Add dynamic buffers in GetDiagRecs and GetErrorFromHandle and prevent buffer overflow on errors longer than 1024 characters May 18, 2021
The pyodbc_malloc is just a define for malloc most of the time.  It does not set an exception
on failure.
@mkleehammer
Copy link
Owner

I'm not sure what happened, but it wouldn't rebase cleanly so I did it manualy. I put the author names into the commit message as co-authors which GitHub docs say should show up. Let me know if there is a better way to do that in the future.

juliuszsompolski added a commit to databricks/pyodbc that referenced this pull request Jul 15, 2021
Function create_name_map used a static buffer `szName[300]` for the column name in call to `SQLDescribeColW`. That function would truncate the column name to the size of the provided buffer, but return the actual length of the column name in the `&cchName` return parameter.
That `cchName` output was used to compute `cbName` parameter for the call to `TextBufferToObject(enc, szName, cbName)`. When the actual column name was longer than the buffer, cbName was to big, and was causing a buffer overflow, resulting in crashes, invalid data, or errors.

Fix this by making szName a dynamic buffer. This buffer overflow is very similar to the one fixed in mkleehammer#881. I check through the rest of the code base, and I don't see more places with static buffers that could have a similar issue.

Upstream PR: mkleehammer#931
@sthen
Copy link

sthen commented Nov 27, 2021

Several seemingly unrelated changes were included in the commit for this and some don't seem 100% right. For example in pyodbcmodule.cpp, qmark changed to paramstyle without changing the text below that talks about qmark, and in setup.py pyodbc.pyi is installed, but it ends up in (at least on my system) in /usr/local/pyodbc.pyi.

@juliuszsompolski
Copy link
Contributor Author

@sthen I am not sure what happened during the merge of this PR, but these changes were not a part of my branch. Something happened during the mkleehammer force-pushed the master branch from 7b8a44f to 7c7b1b1 6 months ago?
These seem to be the changes from #864, but they somehow show up in the diff of this PR?

@keitherskine
Copy link
Collaborator

@sthen I agree it looks like the #864 changes were rolled into this PR somehow which I can't explain, but to answer your other questions.

PEP 249 specifies a series of globals that should be part of any module that implements PEP 249, including paramstyle:
https://www.python.org/dev/peps/pep-0249/#paramstyle
For pyodbc, the value of this paramstyle global is "qmark". Previously, the global name was documented as "qmark" which was not correct. The global name and value are documented correctly now, I believe.

I appreciate the location of pyodbc.pyi on deployments is not ideal. As explained here, deploying random files is not straightforward with Python. If you know of a better way of doing this, I'm open to suggestions.

@JCZuurmond
Copy link

👋 Hi y'all,

I am running into the same issue: truncated error message when using pyodbc with Spark (Databricks). The PR is a bit confusing to me, it is closed but from the comments I understand changes were merged. I assume that we expect error message should not be truncated anymore. Is that correct?

On Github the latest release is the 8th of February in 2020. That is before this PR is merged. When is a new release made? Or do I miss something, the last release was two years ago.

@juliuszsompolski
Copy link
Contributor Author

juliuszsompolski commented Jun 13, 2022

@JCZuurmond this is merged, and should be available in the latest release on Pypi from Aug 2021: https://pypi.org/project/pyodbc/4.0.32/

After how many characters the message was truncated?
In the LInux (and linux only) version of the Spark ODBC driver there was another bug regarding truncating long error messages that was fixed last year; make sure to use the newest one from https://databricks.com/spark/odbc-drivers-download.
Even then, the default of the Linux Spark ODBC driver remains to truncate error messages after 512 characters. This limitation is because of another bug in some old versions of the unixODBC that caused buffer overflows for longer error messages... I don't remember when that was fixed, but long enough ago that you shouldn't have a version that hits it.
On Linux, you can increase this limit in the Spark ODBC driver by setting ErrMsgMaxLen=16384 (or other length, but 16k should be enough) in the ODBC config.

@juliuszsompolski
Copy link
Contributor Author

@JCZuurmond you could also try using https://pypi.org/project/databricks-sql-connector, which we developed to connect to Databricks Spark directly without going via ODBC, using the same PEP-249 interface. See https://docs.databricks.com/dev-tools/python-sql-connector.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants