Skip to content

Conversation

@jafingerhut
Copy link
Contributor

@jafingerhut jafingerhut commented Jan 21, 2025

This PR also adds the --verbosity DEBUG option to CI builds, which I expect will help diagnose future CI build failures.

timeout 10800 ./run_p4_tests.sh -p ${TESTNAME} --arch tofino |& sed 's/^/tests: /'

echo "Killing bf_switchd and tofino-model processes ..."
sudo killall bf_switchd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break any parallel setup. I would only kill the process returned by the shell.

Python has better facilities for this imho and ChatGPt etc works really well for these kinds of boilerplate scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vgurevich Have you run multiple Tofino models and bf_switchd processes on the same base OS in parallel successfully? Without containers or VMs or other things like that to separate them from each other?

If that works, great.

I tried using bash mechanisms to capture the PID of the sudo run_*.sh runs, but killing those only killed the sudo process, not the run_*.sh script running as root, so was not effective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The existing tests that I am aware expect that a set of veth interfaces have been created first, before the test starts running, and the test assumes that whichever of those veth interfaces it wants to use, are available for its sole use.

Thus any two tests that both use veth2 will conflict with each other if you attempt to run them both without Linux network namespaces or other tricks like that.

And that is assuming that the Tofino model and driver processes don't conflict with each other in other ways besides this, which might be the case, but it seems like a trip down the rabbit hole to run parallel tests on the same system. If you really want to do testing in parallel, it would be far easier to set up and easier to maintain if you divided up the tests N ways and ran the different subsets of all tests on N different systems in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fruffy I am fairly sure it is possible to enable running multiple tests in parallel on a system, but doing so would be significantly more development time to create than what is in this PR. It isn't just replacing the uses of killall. I added comments to these trying to make it very explicit that these scripts only support running one test at a time on a system.

I added another script that runs all Tofino1 and Tofino2 tests that take at most about 5 minutes of time each, which is most of them. They take about 90 minutes to run. If I enabled all of the tests, the longest 5 or 6 would push the total up to about 5 hours. This seems like a reasonable length of test suite to run nightly or weekly, rather than pre-commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

We definitely had parallel scripts for the Tofino PTF tests using network namespaces. Which is why I was concerned about this command (if we end up using these commands it for the tests).

Also add a script that runs all Tofino1 and Tofino2 tests.

Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
@jafingerhut jafingerhut changed the title Add script to run one tofino1 test Add scripts to run one Tofino1 or 2 test, and to run all tests that take less than 5 mins each Feb 27, 2025
Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
failing the run-one-test.sh script if that does happen.

Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
@@ -0,0 +1,318 @@
#! /bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to combine this with #94? I am curious to see whether we can now run tests in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a commit a moment ago to update this Bash script to use your new run-test.py program to run individual tests. We'll see how it goes.

…gerhut/open-p4studio into add-script-to-run-one-tofino1-test
…l tests

Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
@jafingerhut
Copy link
Contributor Author

@fruffy In the logs for the ubuntu-22.04 test running in CI, you can download them and then do: egrep '(ERROR|exit_status)' log-file and I see output like this:

2025-04-11T02:07:55.2865294Z Test ARCH=tofino exit_status=0 p4_14 basic_switching
2025-04-11T02:08:19.9937195Z Test ARCH=tofino exit_status=0 p4_16 bri_handle
2025-04-11T02:08:24.5327170Z 2025-04-11 02:08:24,532 - INFO - [switchdOutputThread] - switchd: 2025-04-11 02:08:24.532227 BF_SWITCHD ERROR - ERROR: bf_sys_dma_pool_create failed(-1) for dev_id 0 subdev_id 0 pool BF_DMA_CPU_PKT_RECEIVE_0_dev_0_0_Pool
2025-04-11T02:08:24.8915089Z 2025-04-11 02:08:24,891 - ERROR - [MainThread] - Process [switchd] (PID: 119763) is not alive (rc: 1).
2025-04-11T02:08:28.9256753Z Test ARCH=tofino exit_status=1 p4_16 bri_with_pdfixed_thrift
2025-04-11T02:08:34.5193524Z 2025-04-11 02:08:34,518 - INFO - [switchdOutputThread] - switchd: 2025-04-11 02:08:34.518823 BF_SWITCHD ERROR - ERROR: bf_sys_dma_pool_create failed(-1) for dev_id 0 subdev_id 0 pool BF_DMA_CPU_PKT_RECEIVE_0_dev_0_0_Pool
2025-04-11T02:13:35.8594007Z 2025-04-11 02:13:35,859 - ERROR - [MainThread] - Tests failed with exit code 1.
2025-04-11T02:13:35.8597081Z 2025-04-11 02:13:35,859 - ERROR - [MainThread] - Process [switchd] (PID: 121989) is not alive (rc: 1).
2025-04-11T02:13:39.8877294Z Test ARCH=tofino exit_status=1 p4_14 chksum
2025-04-11T02:13:44.2712849Z 2025-04-11 02:13:44,270 - INFO - [switchdOutputThread] - switchd: 2025-04-11 02:13:44.270755 BF_SWITCHD ERROR - ERROR: bf_sys_dma_pool_create failed(-1) for dev_id 0 subdev_id 0 pool BF_DMA_CPU_PKT_RECEIVE_0_dev_0_0_Pool
2025-04-11T02:13:44.8150662Z 2025-04-11 02:13:44,814 - ERROR - [MainThread] - Process [switchd] (PID: 125311) is not alive (rc: 1).
2025-04-11T02:13:48.8496136Z Test ARCH=tofino exit_status=1 p4_14 default_entry
2025-04-11T02:13:53.3384067Z 2025-04-11 02:13:53,338 - INFO - [switchdOutputThread] - switchd: 2025-04-11 02:13:53.337969 BF_SWITCHD ERROR - ERROR: bf_sys_dma_pool_create failed(-1) for dev_id 0 subdev_id 0 pool BF_DMA_CPU_PKT_RECEIVE_0_dev_0_0_Pool
2025-04-11T02:13:53.9365409Z 2025-04-11 02:13:53,936 - ERROR - [MainThread] - Process [switchd] (PID: 127537) is not alive (rc: 1).
2025-04-11T02:13:57.9648178Z Test ARCH=tofino exit_status=1 p4_14 deparse_zero

Every line with exit_status is output by my bash script named run-multiple-tests.sh, after each run of one test.

The other lines with errors about bf_sys_dma_pool_create failing, I never see those when I run these same tests on my local Ubuntu 22.04 VM. I do not know yet what causes those errors, but from the log files here, it appears that whenever that happens, it is causing the test to fail.

Most of these tests pass on my local system. I see 11 out of 109 tests failing on my local system, vs. 107 out of 109 failing in CI.

@jafingerhut jafingerhut changed the title Add scripts to run one Tofino1 or 2 test, and to run all tests that take less than 5 mins each Add script to run all tests that take less than 5 mins each Apr 11, 2025
@jafingerhut jafingerhut changed the title Add script to run all tests that take less than 5 mins each Add script to run all tests that take less than 1 min each Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants