Replies: 1 comment
-
Hi @markjamie , in order to avoid duplicate data copying, connectorx pre-allocate the pandas dataframe through a count query before fetching the data. Therefore, if there are additional data update between the count query and the real data fetching query that causes a change in the number of rows, there will be an issue. For example, if at the beginning there are 100 rows in the results (i.e., connectorx allocates a dataframe with 100 rows), and later when it issues the query to fetch the data there are 101 rows returned, it will corrupt the memory by writing the final row into the wrong place. One way to avoid this kind of error is to alter the query to make sure the query result won't change. For example, if you are just inserting the data and you have a timestamp column for the insertion, you can add a filter that only fetch the data that are inserted before the a fixed timestamp. Another workaround is to avoid the count query from the beginning through arrow (unlike pandas, arrow does not require a continuous memory block). (NOTE: This may cause some performance degrade though due to the conversion from arrow to pandas.) To use this workaround, you can set the destination to arrow instead of pandas, and then convert arrow dataframe to pandas dataframe. You need to install
|
Beta Was this translation helpful? Give feedback.
-
I've recently moved away from pandas pd.read_sql() to connectorx cx.read_sql() because it is much faster. The reader was being used as part of a Python Dash dashboard. The switch was easy and all worked well....for a while.
What I am finding is when my Dash dashboard reads from a MySQL database which is not being written to at the same time, connectorx works fine and it is super fast. If however, I start writing to the MySQL database using Python, at the same time as my Dash dashboard is reading from it, my code fails with malloc(): corrupted top size. i think this is in someway related to connectorx but I cannot be entirely sure. The reason I say this is because if I switch back to using pd.read_sql(), all works fine.
I realise that this malloc error isn't hugely helpful but I if anyone can provide some help I'd be hugely grateful. I really want (need) to use connectorx because it is so fast compared to all the other solutions.
Beta Was this translation helpful? Give feedback.
All reactions