Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The execution of the project will not only crash in main_ga, but also hang midway in other scripts such as main_baseline #5

Closed
lilejin322 opened this issue Mar 28, 2024 · 13 comments

Comments

@lilejin322
Copy link

lilejin322 commented Mar 28, 2024

Sorry for the bothering again. We've encountered some problems while running this project and don't know how to solve them.

Describe the issue

  The script test_main.py can complete execution smoothly due to its single iteration. However, when switching to scripts like main_baseline.py and main_ga.py, which require long-running processes, the system appears to be unstable. Specifically, after running the scenario several times, it will eventually crash with a segmentation fault error indicated from the shell.
  We’ve diagnosed the project using tools like pdb, but still couldn’t find where does the distinct error come from. The screenshot is shown as follow.

shell segmentation fault

Environment

  1. Hardware configuration
    CPU: Intel Core i9 14900K (24-core)
    Memory: 128GB
    Graphic Card: None
  2. Software configuration
    OS: Ubuntu 18.04
    Docker-CE: version 24.0.2
    Python: version 3.9.18
  3. Requirements to be met in README.md

To reproduce

  1. Run python main_ga.py
  2. Open the Dreamviews in browser
  3. After several iterations in the Genetic Cycle -> g0s0, g0s1,…, the shell will get stuck
  4. From the observation in the browser, it seems that there is a potential system freezing whenever one of the ADCs arrives its destination
  5. Dozens of minutes later, the shell reports “Segmentation fault” and the python script exits
  6. Command ~$ docker kill $(docker ps -q)
  7. Change the DoppelTest map data to san_mateo in config.py
  8. Update the {ApolloROOT}/modules/common/data/global_flagfile.txt as well
  9. Run python main_baseline.py
  10. Step 2-5 regenerated

Current Result

  The script might report an error and exit after running for several generations.

Expected Result

  The system shall exit normally after a timeout instructed by RUN_FOR_HOUR parameter defined in config.py.

Debugging Endeavors

  1. Check if the versions in conda env meet the requirements.txt
  2. Alter different map asset bundles
  3. Force containers to restart on each iteration within the main() by moving the ctn starting code into the While loop
    ctn = ApolloContainer(APOLLO_ROOT, 'ROUTE_0')
    ctn.start_instance()
    ctn.start_dreamview()
  4. In framework/scenario/ScenarioRunner.py, disable all applications of MessageBroker using comment sign #
    mbk = MessageBroker(self.__runners)
    mbk.spin()

    mbk.broadcast(Topics.TrafficLight, tld.SerializeToString())

@YuqiHuai
Copy link
Collaborator

YuqiHuai commented Mar 28, 2024

@lilejin322 Hi, this is a very interesting outcome of running DoppelTest and I have never seen this before. Do you have any screenshots displaying the segmentation fault?


Adding on top of that, DreamView does appear to be frozen after long hours of experiment, I reported this to Apollo about a year ago but I could not provide enough context to help the developers debug. (See Apollo Issue #13134) Since DreamView is frozen, sim_control can no longer be enabled/disabled, causing the entire container to appear to be frozen.

In my later projects, e.g., scenoRITA, I decided to separate SimControl from DreamView as a separate module (See sim_control_standalone), which (1) eliminated the need to run DreamView and (2) attempted to solve the SimControl teleporting issue (See link). I have migrated DoppelTest to Apollo 8.0 using this separated SimControl, but issues remain so the migration is still in a private repository.

Please let me know if you can verify the problem is actually from DreamView.

@lilejin322
Copy link
Author

lilejin322 commented Mar 29, 2024

The bash shell could only give the thrown segmentation fault error without additional traceback details, weird.
shell segmentation fault
Furthermore, we have applied ctrl + c command to detect the calling stack once the dynamic info ScenarioRunner -INFO - get stucked.
111
222
In this runtime shown above, it was trapped into container.stop_recorder(), next we disabled recorder by modifing

runners = srunner.run_scenario(g_name, s_name, True)

runners = srunner.run_scenario(g_name, s_name, False)

We ran this modification twice, and similarly applied hardware interrupt to the system when the shell stucking phenomenon came out, screenshots shown as follow.
第一次
第二次

In order to exclude bugs from other auxiliary modules, we slightly modified test_main.py to make the scene loop infinitely.

from datetime import datetime
from framework.scenario.ad_agents import ADAgent, ADSection
from framework.scenario.pd_agents import PDSection
from framework.scenario.tc_config import TCSection
from apollo.ApolloContainer import ApolloContainer
from config import (APOLLO_ROOT, MAX_ADC_COUNT, RUN_FOR_HOUR)
from framework.scenario import Scenario
from framework.scenario.ScenarioRunner import ScenarioRunner

def main():

    start_time = datetime.now()
    index = 0

    one = Scenario(
    ad_section=ADSection(
        [
            ADAgent(['lane_19', 'lane_25'], 40, 105, 0),
            ADAgent(['lane_25', 'lane_19'], 115, 40, 0),
        ]
    ),
    pd_section=PDSection([]),
    tc_section=TCSection.get_one())

    one.gid = 0
    containers = [ApolloContainer(APOLLO_ROOT, f'ROUTE_{x}') for x in range(2)]

    for ctn in containers:
        ctn.start_instance()
        ctn.start_dreamview()
        print(f'Dreamview at http://{ctn.ip}:{ctn.port}')
    
    srunner = ScenarioRunner(containers)
    
    while True:
        
        one.cid = index
        index += 1
        g_name = f'Generation_{one.gid:05}'
        s_name = f'Scenario_{one.cid:05}'

        srunner.set_scenario(one)
        srunner.init_scenario()
        runners = srunner.run_scenario(g_name, s_name, False)
        
        tdelta = (datetime.now() - start_time).total_seconds()
        if tdelta / 3600 > RUN_FOR_HOUR:
            break

if __name__ == '__main__':
    main()

It will still get stuck as well, bizarre. Furthermore it is worthy to emphasize that we don't have a Graphic Card so we built the modified Apollo in cpu only mode.

@YuqiHuai
Copy link
Collaborator

@lilejin322 Hi! Thank you for providing additional information. The graphics card should not be a problem since I was able to run DoppelTest on a server without a graphics card. Here are some of my thoughts based on this information:

  1. Personally, I have not seen a segmentation fault in Python before. What is the hardware specification of your machine? My initial guess would be somehow your machine does not have sufficient memory.

  2. The remaining screenshots show the stack trace when you use Ctrl+C to send a KeyboardInterrupt, it seems like the problem is around lines involving subprocess.run. My suggestion for debugging would be adding an additional log statement before subprocess.run, for example, here and here, to figure out which container is causing DoppelTest to hang, then manually run the command docker exec ...... in another terminal to see what is going on.

  3. It looks like once hanging occurs, you can stop DoppelTest and restart it to reproduce the hanging behavior, and you don't need to run it for another couple of hours to reproduce the issue, is that correct?

I apologize for not being able to directly solve this issue as I have not seen it on all 4 of the machines available to me. If the problem persists, I am happy to talk with you over Zoom and try to figure out the issue on your end 😄

@lilejin322
Copy link
Author

lilejin322 commented Apr 1, 2024

  1. I believe our machine should be sufficient to support the execution of this project, CPU: Intel Core i9 14900K (24-core); Memory: 128GB, details grab from bash:
    cpu_mod
    Furthermore, we've also figured out the configuration for the limitation of memory usage ratio for each docker containers, shown as follow:
    docker_lmt memory

  2. We've implement a file logger to record the subprocess.run() result, Specifically

In main script:

def get_my_logger() -> logging.Logger:
    """
    The distinct logger to diagnose what's wrong with multi-thread tasks
    """
    my_logger = logging.getLogger("my_logger")
    my_logger.setLevel(logging.DEBUG)
    file_handler = logging.FileHandler("my.log")
    file_handler.setLevel(logging.DEBUG)
    formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    file_handler.setFormatter(formatter)
    my_logger.addHandler(file_handler)
    return my_logger

Log something during the subprocess.run()

subprocess.run(
cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE
)

We modify as:

result = subprocess.run(
            cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
        )
self.my_logger.info(f'In container {self.container_name} stop_recorder command, has stdout {result.stdout.strip()}')
self.my_logger.error(f'In container {self.container_name} stop_recorder command, has error {result.stderr.strip()}')

subprocess.run(
cmd.split(),
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
)

We modify as:

result = subprocess.run(
            cmd.split(),
            stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
        )
self.my_logger.info(f'In container {self.container_name} stop_sim_control_standalone command, has stdout {result.stdout.strip()}')
self.my_logger.error(f'In container {self.container_name} stop_sim_control_standalone command, has error {result.stderr.strip()}')

Then the log file is available:
my.log

@YuqiHuai
Copy link
Collaborator

YuqiHuai commented Apr 1, 2024

@lilejin322 thanks for providing more details. Looks like the issues is around stop recorder and stop sim control. Your machine does have almost equivalent hardware to one of mine.

the logs you attached, is this just the end of that file when error started occurring?

@lilejin322
Copy link
Author

No, this is the complete output, not just the tail end. Clearly, the system will crash after a few iterations. 😂

@lilejin322 thanks for providing more details. Looks like the issues is around stop recorder and stop sim control. Your machine does have almost equivalent hardware to one of mine.

the logs you attached, is this just the end of that file when error started occurring?

@YuqiHuai
Copy link
Collaborator

YuqiHuai commented Apr 2, 2024

@lilejin322 It is very unusual to have this kind of error, and this error would prevent scenarios from being generated correctly. Can we setup a Zoom meeting and have me look at the issue on your machine remotely?

@lilejin322
Copy link
Author

To further investigate the error messages from stop_sim_control_standalone() mentioned in #5 (comment)

Then the log file is available: my.log

We slightly modified the /apollo/modules/sim_control/script.sh

function stop() {
  # It seems that we need to print the output step by step in the pipe to see what is going on.
  echo -e "ps -ef is: \n $(ps -ef)"
  echo -e "ps -ef | grep -E \"sim_control_main\" is: \n $(ps -ef | grep -E "sim_control_main")"
  echo -e "ps -ef | grep -E \"sim_control_main\" | grep -v 'grep' is: \n $(ps -ef | grep -E "sim_control_main" | grep -v 'grep')"
  echo -e "ps -ef | grep -E \"sim_control_main\" | grep -v 'grep' | awk '{print \$2}' is: \n $(ps -ef | grep -E "sim_control_main" | grep -v 'grep' | awk '{print $2}')"
  
  ps -ef | grep -E "sim_control_main" | grep -v 'grep' | awk '{print $2}' | xargs kill -9
}

And here is the file logger has recorded:

2024-04-02 20:56:40,148 - INFO - In container apollo_dev_ROUTE_1 stop_sim_control_standalone command, has stdout ps -ef is: 
 UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 20:56 pts/2    00:00:00 /bin/bash
root       142     1  0 20:56 ?        00:00:00 /usr/bin/python3 /apollo/bazel-bin/cyber/tools/cyber_launch/cyber_launch.runfiles/apollo/cyber/tools/cyber_launch/cyber_launch.py start /apollo/modules/monitor/launch/monitor.launch
root       150   142 11 20:56 ?        00:00:00 mainboard -d /apollo/modules/monitor/dag/monitor.dag -p monitor -s CYBER_DEFAULT
root       213     1  1 20:56 ?        00:00:00 /usr/bin/python3 /apollo/bazel-bin/cyber/tools/cyber_launch/cyber_launch.runfiles/apollo/cyber/tools/cyber_launch/cyber_launch.py start /apollo/modules/dreamview/launch/dreamview.launch
root       221   213 22 20:56 ?        00:00:00 /apollo/bazel-bin/modules/dreamview/dreamview --flagfile=/apollo/modules/common/data/global_flagfile.txt
root       299     0  0 20:56 ?        00:00:00 bash /apollo/modules/sim_control/script.sh stop
root       308   299  0 20:56 ?        00:00:00 ps -ef
ps -ef | grep -E "sim_control_main" is: 
 root       311   309  0 20:56 ?        00:00:00 grep -E sim_control_main
ps -ef | grep -E "sim_control_main" | grep -v 'grep' is: 
 
ps -ef | grep -E "sim_control_main" | grep -v 'grep' | awk '{print $2}' is:
2024-04-02 20:56:40,148 - ERROR - In container apollo_dev_ROUTE_1 stop_sim_control_standalone command, has error Usage:
 kill [options] <pid> [...]

Options:
 <pid> [...]            send signal to every <pid> listed
 -<signal>, -s, --signal <signal>
                        specify the <signal> to be sent
 -l, --list=[<signal>]  list all signal names, or convert one to a name
 -L, --table            list all signal names in a nice table

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).

@YuqiHuai
Copy link
Collaborator

YuqiHuai commented Apr 2, 2024

@lilejin322 This is consistent with my understanding. SimControl isn't running at all.

Which Apollo are you using and can you give this https://github.com/YuqiHuai/BaiduApollo/tree/DoppelTest a try?

@lilejin322
Copy link
Author

@lilejin322 This is consistent with my understanding. SimControl isn't running at all.

Which Apollo are you using and can you give this https://github.com/YuqiHuai/BaiduApollo/tree/DoppelTest a try?

We tried the Apollo DOI repository provided in README.MD
https://zenodo.org/records/7622977

@YuqiHuai
Copy link
Collaborator

YuqiHuai commented Apr 3, 2024

@lilejin322 That shouldn't have an issue either. Currently, I am not sure what is happening and I'll wait for our meeting to take a closer look.

@lilejin322
Copy link
Author

lilejin322 commented Apr 11, 2024

Conclusion
  If you're using an Intel i9-13900K or i9-14900K processor, make sure to enter the BIOS and disable any settings related to overclocking.
  Because if this hasn’t been done, when Apollo stacks run inside those Docker containers, the host machine will attempt to overclock for the purpose of speeding up processing, which can lead to memory leaks and the segmentation fault will come out from the shell.

Heuristic Source
  Currently, Intel's 13th and 14th generation CPUs with a K suffix have stability issues during overclocking operations. For related news, please refer to the link: Intel investigating games crashing on 13th and 14th Gen Core i9 processors

Owners of Intel’s latest 13th and 14th Gen Core i9 desktop processors have been noticing an increase in game crashes in recent months. It’s happening in games like The Finals, Fortnite, and Tekken 8, and has even led Epic Games to issue a support notice to encourage Intel Core i9 13900K and 14900K owners to adjust BIOS settings.

@YuqiHuai
Copy link
Collaborator

Thank you for the detailed investigation and for figuring out the issue! I am very surprised by the root cause of the issue being CPU overclocking, and indeed, non of the machines I ever used turned on overclocking.

P.S. The link in your comment isn't working, but I found the media report Intel investigating games crashing on 13th and 14th Gen Core i9 processors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants