Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize rgb565 serialization #317

Merged

Conversation

hchargois
Copy link
Contributor

For context: I'm trying to drive a rev A display from a (gen 1) Raspberry Pi. On a weak CPU like that, the sample program simple-program.py is extremely slow (~30 s to paint the background image) and the CPU usage is very high (constantly ~80 %).

So, I did some profiling and I've seen that the serialization to little-endian RGB565 was the main culprit for CPU usage. Using numpy yields a huge improvement.

I've added some timing debug messages to the simple-program, here's before:

2023-09-05 23:28:01.624 [DEBUG] background picture set (took 29.658 s)
2023-09-05 23:28:09.804 [DEBUG] refresh done (took 4.121 s)
2023-09-05 23:28:13.989 [DEBUG] refresh done (took 4.178 s)
2023-09-05 23:28:18.138 [DEBUG] refresh done (took 4.142 s)
2023-09-05 23:28:22.278 [DEBUG] refresh done (took 4.134 s)

and after:

2023-09-05 23:29:01.502 [DEBUG] background picture set (took 6.218 s)
2023-09-05 23:29:03.760 [DEBUG] refresh done (took 1.265 s)
2023-09-05 23:29:04.970 [DEBUG] refresh done (took 1.204 s)
2023-09-05 23:29:06.182 [DEBUG] refresh done (took 1.205 s)
2023-09-05 23:29:07.392 [DEBUG] refresh done (took 1.204 s)

Also, the CPU usage goes from ~80% to ~40%.

So, it's still not fast, but a 5x speedup in showing the BG, and >3x speedup in refreshing the info makes it much more useable.

On my desktop with a more powerful CPU, the timings only show a slight improvement (BG paint goes from 1.49s to 1.33s, refresh from 0.22s to 0.19s). However, there's a 2x difference in CPU usage, from ~30% to ~15%.

Sure, there's the downside of adding numpy as a dep, but I think that's not a big problem, lots of things depend on it, the Raspberry Pi OS even comes with it pre-installed.

@ghost
Copy link

ghost commented Sep 6, 2023

If CPU usage is this much of a concern, then the screen I am working on writing alternative software for may be a better option. All the drawing is done on the device end. See the main README.md. It is under the section called Fuldho.

@mathoudebine
Copy link
Owner

Thanks for optimizing this code! I agree adding numpy is not a concern as it is a well known and maintained package

@hchargois
Copy link
Contributor Author

If CPU usage is this much of a concern, then the screen I am working on writing alternative software for may be a better option. All the drawing is done on the device end. See the main README.md. It is under the section called Fuldho.

Thanks, that's interesting. However, I'd rather use the screen I already have, and I like how the protocol is straightforward (simply pasting bitmaps).

@hchargois
Copy link
Contributor Author

Thanks for optimizing this code! I agree adding numpy is not a concern as it is a well known and maintained package

I only have a rev A screen, so I couldn't test the other, but I see that at least rev B also uses RGB565, so the same kind of optimization could be done. Rotating/reversing should also be easily done with numpy transposing/rolling operations, I think (I'm not a numpy expert at all).

I did some further testing and I noticed that my first post is totally misrepresenting the performance gains, which are much bigger than I thought!

In the current code, the serialization to RGB565 and the sending of data to the serial port are interlaced. To better measure the performance gains of the serialization only, I moved the current pure-python serialization in a first step, and the line sending in a second step:

        pix = image.load()
        line = bytes()
        lines = []
        for h in range(image_height):
            for w in range(image_width):
                R = pix[w, h][0] >> 3
                G = pix[w, h][1] >> 2
                B = pix[w, h][2] >> 3

                rgb = (R << 11) | (G << 5) | B
                line += struct.pack('<H', rgb)

                # Send image data by multiple of "display width" bytes
                if len(line) >= self.get_width() * 8:
                    lines.append(line)
                    line = bytes()
        # Write last line if needed
        if len(line) > 0:
            lines.append(line)

        end = time.perf_counter()
        logger.debug(f"serialization done (took {end-start:.3f} s)")

        # Lock queue mutex then queue all the requests for the image data
        start = time.perf_counter()

        with self.update_queue_mutex:
            for line in lines:
                self.SendLine(line)

        end = time.perf_counter()
        logger.debug(f"sending lines done (took {end-start:.3f} s)")

On my Raspberry Pi, here are the results of the pure-python serialization for a fullscreen image:

serialization done (took 21.458 s)
sending lines done (took 5.921 s)

And here's the numpy version:

serialization done (took 0.160 s)
sending lines done (took 5.957 s)

So, it's more than 100x faster! It seems that numpy is crazy efficient for this kind of things.

Now, the paint speed is basically limited by the speed of writing the data to serial, which I don't know why it's so slow on the raspberry pi...

@ghost
Copy link

ghost commented Sep 6, 2023

Now, the paint speed is basically limited by the speed of writing the data to serial, which I don't know why it's so slow on the raspberry pi...

You have to keep in mind most of the screens supported by this project are using a very slow USB 1.1 interface. A whole screen blit will take 2-3 seconds. But there's also a possibility that python is still acting as a noticeable bottleneck on your slower ARM board. Python is very slow after all, especially at raw number crunching as you are seeing here.

Perhaps in time there will be a better solution available. I've been thinking of starting another project to handle USB screens like the Turing ones at some point once my current hidss project is finished. The focus of that project would be to improve the performance of the driver program by designing a new driver program with a focus on performance optimization. This won't make much difference for PCs but it may make the screens usable by lower performing hardware, which would be my primary motivation.

@hchargois
Copy link
Contributor Author

hchargois commented Sep 6, 2023

You have to keep in mind most of the screens supported by this project are using a very slow USB 1.1 interface. A whole screen blit will take 2-3 seconds.

Sure, USB 1.1 is slow, but it's still 12 Mbit/s, and the screen is recognized at that speed in both my Raspberry Pi and desktop PC.

Ignoring some USB overhead, at a full 12 Mbit/s, sending the 320*480*16 bits for a fullscreen paint could take on the order of 0.03 s, which is much lower than the numbers I'm seeing, so we're not really limited by the USB bandwidth itself here.

But there's also a possibility that python is still acting as a noticeable bottleneck on your slower ARM board. Python is very slow after all, especially at raw number crunching as you are seeing here.

After the serialization to RGB565, there's not really any "number crunching", it's just writing bytes to a file descriptor, which shouldn't really be impacted by Python's speed (or rather slowness).

But I did think the same thing as you! So to test that I implemented a minimal Go program to paint a background, and I got the exact same results as with Python: ~1.2s on my desktop PC, ~6s on the RPi. Some further stracing show that indeed the individual write system calls themselves are slower on the RPi than on the PC.

Perhaps in time there will be a better solution available. I've been thinking of starting another project to handle USB screens like the Turing ones at some point once my current hidss project is finished. The focus of that project would be to improve the performance of the driver program by designing a new driver program with a focus on performance optimization. This won't make much difference for PCs but it may make the screens usable by lower performing hardware, which would be my primary motivation.

Yep, a "driver" backend in a fast language would be nice, I thought on doing the same.

@ghost
Copy link

ghost commented Sep 6, 2023

Yep, a "driver" backend in a fast language would be nice, I thought on doing the same.

I would be using C most likely if I end up doing it. I also wanted to implement a "separation of concerns" of sorts so the whole program doesn't have to run as root. For hidss I am doing this by making a minimal controller program that will be setuid root and the rest of the programs will be unprivileged. Unfortunately you can't do this with a python program because most kernels require a setuid program to be a native executable and not a script that requires an interpreter.

@hchargois
Copy link
Contributor Author

OK so this message has nothing to do with turing-smart-screen-python per se, but I'm happy to say that I've managed to make my screen work well with my Raspberry Pi. In the chance that someone else stumbles upon this thread trying to do the same thing, I'll share how I made it work.

As per the info in the Raspberry Pi forums [1] I simply added dwc_otg.speed=1 to the /boot/cmdline.txt to set the USB host controller to USB 1.1 speed. My RPi is a 1 B, but AFAIK the same controller is used in RPIs 2 and 3 so I expect they will exhibit the same problem & solution.

With that parameter, I have fullscreen paints in around 1.5 s, so it's just marginally slower than my desktop PC (1.2 s).

Of course the Ethernet is now very slow by today's standards (~7 Mbit/s), but that's more than enough for my use case.

[1] https://forums.raspberrypi.com/viewtopic.php?t=53832

Repository owner locked and limited conversation to collaborators Sep 8, 2023
Repository owner unlocked this conversation Sep 8, 2023
@mathoudebine
Copy link
Owner

Thanks @hchargois I added this info to the Troubleshooting page https://github.com/mathoudebine/turing-smart-screen-python/wiki/Troubleshooting#raspberry-pi-zero--1--2--3-display-refresh-is-too-slow

@mathoudebine mathoudebine merged commit 7b112cb into mathoudebine:main Sep 14, 2023
@hchargois hchargois deleted the optimize-rgb565-serialization branch September 16, 2023 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants