Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python segfault / corrupted double-linked list (not small) in a docker container #378

Open
t0k4rt opened this issue Oct 20, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@t0k4rt
Copy link

t0k4rt commented Oct 20, 2022

What language are you using?

Python

What version are you using?

0.3.0

What database are you using?

Postgresql

What dataframe are you using?

Pandas

Can you describe your bug?

I've got a python script running in docker that loads data from sql to a panda dataframe, depending on the volume of data it can load between 15gb and 60gb of data in memory.

This issue is not related to docker memory limits, the script is monitored and fails well below docker container memory limit.

The issue I get is complicated. It mainly fails silently, I need to open dmesg to see some segfault happening.

It seems to happen when data has finished downloading from database

This issue is docker specific, when I run the script on my dev machine (withou docker), everything is going well.

It seems to me there is 2 cases:

First case: my python script use 15gb memory

When data transfer is finished, the python script fails silently and triggers a segfault:

[17623143.600654] python[696237]: segfault at 0 ip 00007f746595680a sp 00007ffc7119fe60 error 6 in connectorx.cpython-38-x86_64-linux-gnu.so[7f746555d000+201c000]
[17623143.606349] Code: 41 56 53 48 83 ec 18 49 89 fe 66 48 8d 3d 9e df 68 02 66 66 48 e8 e6 6e c0 ff 48 83 38 00 74 18 48 83 c0 08 48 83 38 00 74 2e <49> 83 06 ff 74 77 48 83 c4 18 5b 41 5e c3 66 48 8d 3d 70 df 68 02

My process then restarts (i've got a restart policy for my failed containers) and in this case, the process does not fail (this happened each times).

Second case: my script use 30gb memory

When data transfer is finished, the python script fails with the error "corrupted double-linked list (not small)" and seems to triggers the same kind of segfault:

[17623143.600654] python[696237]: segfault at 0 ip 00007f746595680a sp 00007ffc7119fe60 error 6 in connectorx.cpython-38-x86_64-linux-gnu.so[7f746555d000+201c000]
[17623143.606349] Code: 41 56 53 48 83 ec 18 49 89 fe 66 48 8d 3d 9e df 68 02 66 66 48 e8 e6 6e c0 ff 48 83 38 00 74 18 48 83 c0 08 48 83 38 00 74 2e <49> 83 06 ff 74 77 48 83 c4 18 5b 41 5e c3 66 48 8d 3d 70 df 68 02

My process then restarts (i've got a restart policy for my failed containers) then the process stil fails.

What are the steps to reproduce the behavior?

I cannot reproduce this on my local machine because my local docker instance hits container memory limit and container is shut down.

Host is running on debian 11 (128gb ram/24cores)

Our containers are using latest python 3.8.15 built using pyenv with these specific build flags:
RUN CONFIGURE_OPTS="--enable-shared" PYTHON_CFLAG="-march=haswell -O3 -pipe" pyenv install ${PYTHON_VERSION}

Database setup if the error only happens on specific data or data type
Example query / code

This scripts and query should generate the same kind of data we are using with a high enough volume on a docker container:

test_bug.py

import connectorx as cx
import time
import os

query="""
WITH time as (SELECT generate_series('2022-01-01', '2022-08-01', '1 second'::interval) as timestamp_client),
ids AS (SELECT cid from (values('B45668C2-BFDC-4861-A38D-6141933F6940'),('40ABA32A-24EE-4876-8568-8E8E51D1D942'),('1837CE67-4BCC-4936-BC6D-76874BE1C4FF'),('D9194F09-7122-4EC7-AE81-FCB13A06B4EA'), ('645641C3-4E84-4475-AF85-1DFFBFE18726')) AS x(cid))
SELECT
    cid,
    timestamp_client,
    500*random() as accuracy,
    'ios' as os,
    'UTC' as timezone,
    40562 as place_id,
    random() as confidence,
    500*random() as distance
FROM ids
JOIN time ON True
ORDER BY timestamp_client ASC
"""
print(time.ctime())
print(os.environ.get('PG_CONN_URL'))
print(query)
result=cx.read_sql(os.environ.get('PG_CONN_URL'), query)
print(result.shape)
print(time.ctime())

Dockerfile

FROM python:3.8-bullseye

# install dependencies
RUN pip install connectorx pandas==1.3.5
RUN pip list
COPY ./test_bug.py /test_bug.py

ENTRYPOINT ["python", "/test_bug.py"]

docker build --pull -t connectorx-bug -f Dockerfile .
docker run --env PG_CONN_URL=your_db_conn_url test-dataprocessing-gps-visit

What is the error?

Segfault

[17623143.600654] python[696237]: segfault at 0 ip 00007f746595680a sp 00007ffc7119fe60 error 6 in connectorx.cpython-38-x86_64-linux-gnu.so[7f746555d000+201c000]
[17623143.606349] Code: 41 56 53 48 83 ec 18 49 89 fe 66 48 8d 3d 9e df 68 02 66 66 48 e8 e6 6e c0 ff 48 83 38 00 74 18 48 83 c0 08 48 83 38 00 74 2e <49> 83 06 ff 74 77 48 83 c4 18 5b 41 5e c3 66 48 8d 3d 70 df 68 02

And sometimes "corrupted double-linked list (not small)"

@t0k4rt t0k4rt added the bug Something isn't working label Oct 20, 2022
@t0k4rt
Copy link
Author

t0k4rt commented Oct 24, 2022

When I have some time I'll check if it's related to the python version built from sources

@Babbleshack
Copy link

Am not sure if you had time to test against differnt python versions, but we are experience a similar issue on python 3.9.14. Specifically we are doing a join on large datasets (>60gb).

@t0k4rt
Copy link
Author

t0k4rt commented Nov 8, 2022

Am not sure if you had time to test against differnt python versions, but we are experience a similar issue on python 3.9.14. Specifically we are doing a join on large datasets (>60gb).

I'm working on it, i'm building some docker images to test my code with different python version. I'll keep you updated when I've some news !

@kmatt
Copy link

kmatt commented Jan 12, 2023

Similar, selecting 100,000 rows from MS SQL Server, on Ubuntu 22.04.1 (5.15.0-56-generic), Python 3.10.6.

Not running in Docker, but a VMWare VM in this case:

[1191578.055637] show_signal_msg: 22 callbacks suppressed
[1191578.055642] python3[408830]: segfault at 0 ip 00007f75e99e0bca sp 00007ffc0b5e5660 error 6 in connectorx.cpython-310-x86_64-linux-gnu.so[7f75e9694000+1ee8000]
[1191578.055663] Code: 41 56 53 48 83 ec 18 48 89 fb 66 48 8d 3d 46 5e 60 02 66 66 48 e8 26 3b cb ff 48 83 38 00 74 17 48 83 c0 08 48 83 38 00 74 2d <48> ff 0b 74 75 48 83 c4 18 5b 41 5e c3 66 48 8d 3d 19 5e 60 02 66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants