Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librav1e produces segfault #485

Open
tatref opened this issue Jul 5, 2024 · 26 comments
Open

librav1e produces segfault #485

tatref opened this issue Jul 5, 2024 · 26 comments

Comments

@tatref
Copy link

tatref commented Jul 5, 2024

Trying to encode a file into av1 with librav1e results in segfault

podman run -it --rm -v "$PWD:$PWD" -w "$PWD" docker.io/mwader/static-ffmpeg -v debug -i PXL_20240630_122849440.mp4 -c:v librav1e 'PXL_20240630_122849440.av1.mp4'; echo $?

There is no output, but dmesg shows:

librav1e[2506]: segfault at 0 ip 00007f7e713a5b48 sp 00007f7e6b3ac0f0 error 6 in ffmpeg[7f7e6d895000+5b11000] likely on CPU 3 (core 3, socket 0)

Copying the ffmpeg bin from container to host results in the same error.

@wader
Copy link
Owner

wader commented Jul 5, 2024

Hey, interesting. First step i think would be to see if one can reproduce the problem with some ffmpeg linked with glibc, if so i would guess it's a ffmpeg or librav1e bug somehow, if not we have to figure out what difference musl etc does.

Are you able to share PXL_20240630_122849440.mp4, some small cuts of it or some other video that reproduces the problem?

btw does -v trace give any hints what is going on before is crahes?

@tatref
Copy link
Author

tatref commented Jul 6, 2024

I have tested multiple input files (hevc, x264, and xvid), all of them produces a crash. Encoding to x264 is OK. So I think the issue is with librav1e

-v trace doesn't produce anything useful.

Running with gdb gives:

Thread 9 "enc0:0:librav1e" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 204808]
0x00007ffff2ba5b48 in ?? ()
(gdb) bt
#0  0x00007ffff2ba5b48 in ?? ()
#1  0x00007ffff2b44cf9 in ?? ()
#2  0x00007ffff2b43f83 in ?? ()
#3  0x00007ffff2b45c20 in ?? ()
#4  0x00007ffff2a9f612 in ?? ()
#5  0x00007ffff2a9baa0 in ?? ()
#6  0x00007ffff2a89c5b in rav1e_send_frame ()
#7  0x00007fffef9a9f4d in ?? ()
#8  0x00007fffef84c7b4 in ?? ()
#9  0x00007fffef84ca34 in avcodec_send_frame ()
#10 0x00007fffef1b60cc in ?? ()
#11 0x00007fffef1b6b81 in encoder_thread ()
#12 0x00007fffef1d1ff9 in ?? ()
#13 0x00007ffff0e9bd75 in ?? ()
#14 0x0000000000000000 in ?? ()

Do you know if it is possible to compile with debug symbols? (not sure if it can be useful)

@wader
Copy link
Owner

wader commented Jul 6, 2024

Ok, could you try with alpines own ffmpeg which also has librav1e?

docker run --rm -ti -v "$PWD:$PWD" -w "$PWD" alpine:edge sh -c 'apk add ffmpeg && ffmpeg -i PXL_20240630_122849440.mp4 -c:v librav1e -t 0.1s PXL_20240630_122849440.av1.mp4'

And also some glibc-based distro like debian?

docker run --rm -ti -v "$PWD:$PWD" -w "$PWD" debian:sid sh -c 'apt-get update && apt-get install ffmpeg && ffmpeg -i PXL_20240630_122849440.mp4 -c:v librav1e -t 0.1s PXL_20240630_122849440.av1.mp4'

About debug symbols: yes is possible, remove --disable-debug and maybe prepend ffmpeg configure with CFLAGS="-O0 -ggdb" ./configure .... etc

@tatref
Copy link
Author

tatref commented Jul 6, 2024

The Alpine and Debian containers work fine.

I tried to recompile with the debug flags, but I don't get anymore information

@wader
Copy link
Owner

wader commented Jul 6, 2024

Ok than. Then i would try without librsvg, is also rust, there was some issue with dup symbols

I didn't not manage to reproduce locally with some files. Are you able to share some file that triggers this?, would make it a lot easier to help.

Weird about debug symbols, must be something more then 🤔

@wader
Copy link
Owner

wader commented Jul 6, 2024

Maybe also try with just librav1e. Could also compare how alpine does things https://git.alpinelinux.org/aports/tree/community/rav1e/APKBUILD?h=3.19-stable

@tatref
Copy link
Author

tatref commented Jul 6, 2024

Thanks for the help!

I'll do some more testing tomorrow

Also I found an old post of yours about a similar issue lu-zero/cargo-c#98

@wader
Copy link
Owner

wader commented Jul 6, 2024

Thanks for the help!

No problem!

I'll do some more testing tomorrow

👍 tip is to try minimize the dockerfile as much as possible first and then start digg more into details. that way it will be less unrelated moving parts and much faster to iterate and try things. but again if you have a test file i can use it would be great.

Also I found an old post of yours about a similar issue lu-zero/cargo-c#98

I think that was about rust itself crashing as build time?

@tatref
Copy link
Author

tatref commented Jul 6, 2024

You can download the example file I used here: https://photos.app.goo.gl/WVZ7D6giYhYFmbs36

However I face the issue with multiple files, so I suppose it makes no difference.

Also, the crash happens right at the beginning, so it's pretty easy to reproduce.

I think that was about rust itself crashing as build time?

My bad, I did a quick search on "cargo cbuild" and "segfault". I didn't notice at first that you were the author of the issue, that's funny!

@wader
Copy link
Owner

wader commented Jul 6, 2024

You can download the example file I used here: https://photos.app.goo.gl/WVZ7D6giYhYFmbs36

However I face the issue with multiple files, so I suppose it makes no difference.

Also, the crash happens right at the beginning, so it's pretty easy to reproduce.

Thanks. Weirdly it seems to work fine for me on a macbook m3 (arm64). What CPU are you using? could it be that librav1e or ffmpeg ends up using some instruction that is not available (feature detect at build time on build host etc)? but then it usually crashes with SIGILL hmm

$ docker run -it --rm -v "$PWD:$PWD" -w "$PWD" docker.io/mwader/static-ffmpeg:latest -v debug -i PXL_20240630_122849440.mp4 -c:v librav1e 'PXL_20240630_122849440.av1.mp4'; echo $?
...
0
$ ffprobe -hide_banner -i PXL_20240630_122849440.av1.mp4
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'PXL_20240630_122849440.av1.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomav01iso2mp41
    encoder         : Lavf61.1.100
  Duration: 00:00:09.78, start: 0.000000, bitrate: 11581 kb/s
  Stream #0:0[0x1](und): Video: av1 (libdav1d) (Main) (av01 / 0x31307661), yuv420p(tv, smpte170m/bt470bg/bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], 11495 kb/s, 30 fps, 30 tbr, 15360 tbn (default)
      Metadata:
        handler_name    : ISO Media file produced by Google Inc.
        vendor_id       : [0][0][0][0]
        encoder         : Lavc61.3.100 librav1e
  Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 127 kb/s (default)
      Metadata:
        handler_name    : ISO Media file produced by Google Inc.
        vendor_id       : [0][0][0][0]

I think that was about rust itself crashing as build time?

My bad, I did a quick search on "cargo cbuild" and "segfault". I didn't notice at first that you were the author of the issue, that's funny!

😄

@tatref
Copy link
Author

tatref commented Jul 7, 2024

A little more details on my setup: CPU is x86_64, I'm running Debian 12 and Oracle Linux 9 (Redhat 9) under Virtualbox

@wader
Copy link
Owner

wader commented Jul 7, 2024

Hmm interesting, could it be that the VM lacks support for SSE instructions etc? but looking at the code https://github.com/xiph/rav1e/blob/e34e772e47b01169b6f75a4589c056624ea886a4/src/cpu_features/x86.rs#L20 it seems like i do runtime detection hmm. Maybe you can check the VM settings? if it does not help i think we need to get a proper debug build and inspect things with gdb.

@wader
Copy link
Owner

wader commented Jul 12, 2024

Hey, did you get anywhere with this?

@tatref
Copy link
Author

tatref commented Jul 16, 2024

Sorry for the delay

I managed to reproduce the issue on bare metal. The CPU is AMD Ryzen 7 5800X

I would be great is someone else could reproduce it on different hardware

@wader
Copy link
Owner

wader commented Jul 26, 2024

Ok! that is strange. If you have time it would be great to try to minimize down the Dockerfile. Maybe something like: remove everything except building rav1e and ffmpeg, very it stills crashes, after that maybe try change the build to be more like alpine https://git.alpinelinux.org/aports/tree/community/rav1e/APKBUILD?h=3.19-stable#n34 ? ... i see that they do use some newer cargoc stuff. Not sure if RUSTFLAGS="-C target-feature=+crt-static" is more or less same as --library-type staticlib and the "fixes static linking flags" thing as i recognise the -lgcc_eh issue but no sure about -lssp_nonshared and -lc. I would probably try doing the alpine way and patch the pkgconfig file.

@wader
Copy link
Owner

wader commented Jul 29, 2024

btw it might worth looking thru rav1e issues and see if something liks similar? things like:
https://github.com/xiph/rav1e/issues?q=Ryzen
https://github.com/xiph/rav1e/issues?q=illegal
https://github.com/xiph/rav1e/issues?q=segfault

@wader
Copy link
Owner

wader commented Jul 29, 2024

one suggestion in the issues is to try with --no-default-features so maybe try change cargo cinstall --release to cargo cinstall --release --no-default-features? not a fix but might give som clue

@tatref
Copy link
Author

tatref commented Jul 29, 2024

I did some testing today

rav1e works fine if I only enable x264 and rav1e. I kept all the compilation flags as it is.

I'm still not sure why with the full Dockerfile, it crashes

@wader
Copy link
Owner

wader commented Jul 30, 2024

That is very interesting! could you try re-add librsvg and see if it start to crash again? that is my main suspect that statically linking two rust based libraries causes some symbol conflict/mixing that is bad... but if so why it would only affect a certain type of cpu is a bit of a mystery, but it've seen werider things :)

@wader
Copy link
Owner

wader commented Jul 30, 2024

I tried to add a rav1e sanity test and the CI job segfauled in a similar way #490 🤔

@tatref
Copy link
Author

tatref commented Jul 30, 2024

I stripped the Dockerfile of everything except glib, harfbuzz, cairo, pango, librsvg, fdk_aac, x264, and rav1e, this reproduces the issue!

@wader
Copy link
Owner

wader commented Jul 30, 2024

Hey, yeap! i also managed to reproduce it myself now on my old intel macbook and it only seem to happen when linking with both rav1e and librsvg. The stacktrace suggests it crashes inside the rust rayon crate, somewhere here https://github.com/rayon-rs/rayon/blob/main/rayon-core/src/registry.rs#L329-L338 ...my guess is that crash is somehow related to some issue with two rust runtimes being linked together (librsvg and rav1e both uses rayon but different version so i think it should be fine, but not sure). But it's still weird why it works on arm64, maybe for some reason symbols resolve differently and it happens to work?

Some progress at least! will do more digging tomorrow or so

@wader
Copy link
Owner

wader commented Aug 3, 2024

Update: i tried to recreate the two staticlib rust crates that uses rayon with same dependencies and a c program that static-pie links them but no crash on both arm64 and amd64. I'll keep digging from time to time.

BTW for your use case would using libsvtav1 be an option?

@tatref
Copy link
Author

tatref commented Aug 6, 2024

In the end, I used a different image with a non-static ffmpeg that works for me.

I tried to recompile rav1e with rayon 1.0, it works, but ffmpeg still crashes.

Could it be relater to symbol mangling? I suppose different versions should have different names for proper linking?

@wader
Copy link
Owner

wader commented Aug 6, 2024

In the end, I used a different image with a non-static ffmpeg that works for me.

👍

I tried to recompile rav1e with rayon 1.0, it works, but ffmpeg still crashes.

The rav1e cli tools works but not ffmpeg?

Could it be relater to symbol mangling? I suppose different versions should have different names for proper linking?

Yeap i'm not sure what is going on but i suspect there is some issue with miss matching rust runtime symbols etc, e.g. that the runtime is compiled a little bit differently between libs and then gets mixed up. But a bit of a mystery why arm64 seems to work but not amd64... maybe just by chance

@tatref
Copy link
Author

tatref commented Aug 6, 2024

Yes rav1e works. The workflow is a bit different, because input files have to be in y4m format, but it works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants