-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The execution of the project will not only crash in main_ga, but also hang midway in other scripts such as main_baseline #5
Comments
@lilejin322 Hi, this is a very interesting outcome of running DoppelTest and I have never seen this before. Do you have any screenshots displaying the Adding on top of that, DreamView does appear to be frozen after long hours of experiment, I reported this to Apollo about a year ago but I could not provide enough context to help the developers debug. (See Apollo Issue #13134) Since DreamView is frozen, sim_control can no longer be enabled/disabled, causing the entire container to appear to be frozen. In my later projects, e.g., scenoRITA, I decided to separate SimControl from DreamView as a separate module (See sim_control_standalone), which (1) eliminated the need to run DreamView and (2) attempted to solve the SimControl teleporting issue (See link). I have migrated DoppelTest to Apollo 8.0 using this separated SimControl, but issues remain so the migration is still in a private repository. Please let me know if you can verify the problem is actually from DreamView. |
The bash shell could only give the thrown segmentation fault error without additional traceback details, weird. Line 33 in 264a06e
runners = srunner.run_scenario(g_name, s_name, False) We ran this modification twice, and similarly applied hardware interrupt to the system when the shell stucking phenomenon came out, screenshots shown as follow. In order to exclude bugs from other auxiliary modules, we slightly modified test_main.py to make the scene loop infinitely. from datetime import datetime
from framework.scenario.ad_agents import ADAgent, ADSection
from framework.scenario.pd_agents import PDSection
from framework.scenario.tc_config import TCSection
from apollo.ApolloContainer import ApolloContainer
from config import (APOLLO_ROOT, MAX_ADC_COUNT, RUN_FOR_HOUR)
from framework.scenario import Scenario
from framework.scenario.ScenarioRunner import ScenarioRunner
def main():
start_time = datetime.now()
index = 0
one = Scenario(
ad_section=ADSection(
[
ADAgent(['lane_19', 'lane_25'], 40, 105, 0),
ADAgent(['lane_25', 'lane_19'], 115, 40, 0),
]
),
pd_section=PDSection([]),
tc_section=TCSection.get_one())
one.gid = 0
containers = [ApolloContainer(APOLLO_ROOT, f'ROUTE_{x}') for x in range(2)]
for ctn in containers:
ctn.start_instance()
ctn.start_dreamview()
print(f'Dreamview at http://{ctn.ip}:{ctn.port}')
srunner = ScenarioRunner(containers)
while True:
one.cid = index
index += 1
g_name = f'Generation_{one.gid:05}'
s_name = f'Scenario_{one.cid:05}'
srunner.set_scenario(one)
srunner.init_scenario()
runners = srunner.run_scenario(g_name, s_name, False)
tdelta = (datetime.now() - start_time).total_seconds()
if tdelta / 3600 > RUN_FOR_HOUR:
break
if __name__ == '__main__':
main() It will still get stuck as well, bizarre. Furthermore it is worthy to emphasize that we don't have a Graphic Card so we built the modified Apollo in cpu only mode. |
@lilejin322 Hi! Thank you for providing additional information. The graphics card should not be a problem since I was able to run DoppelTest on a server without a graphics card. Here are some of my thoughts based on this information:
I apologize for not being able to directly solve this issue as I have not seen it on all 4 of the machines available to me. If the problem persists, I am happy to talk with you over Zoom and try to figure out the issue on your end 😄 |
In main script:def get_my_logger() -> logging.Logger:
"""
The distinct logger to diagnose what's wrong with multi-thread tasks
"""
my_logger = logging.getLogger("my_logger")
my_logger.setLevel(logging.DEBUG)
file_handler = logging.FileHandler("my.log")
file_handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)
my_logger.addHandler(file_handler)
return my_logger Log something during the subprocess.run()DoppelTest/apollo/ApolloContainer.py Lines 232 to 234 in 264a06e
We modify as: result = subprocess.run(
cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
)
self.my_logger.info(f'In container {self.container_name} stop_recorder command, has stdout {result.stdout.strip()}')
self.my_logger.error(f'In container {self.container_name} stop_recorder command, has error {result.stderr.strip()}') DoppelTest/apollo/ApolloContainer.py Lines 253 to 256 in 264a06e
We modify as: result = subprocess.run(
cmd.split(),
stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
)
self.my_logger.info(f'In container {self.container_name} stop_sim_control_standalone command, has stdout {result.stdout.strip()}')
self.my_logger.error(f'In container {self.container_name} stop_sim_control_standalone command, has error {result.stderr.strip()}') Then the log file is available: |
@lilejin322 thanks for providing more details. Looks like the issues is around stop recorder and stop sim control. Your machine does have almost equivalent hardware to one of mine. the logs you attached, is this just the end of that file when error started occurring? |
No, this is the complete output, not just the tail end. Clearly, the system will crash after a few iterations. 😂
|
@lilejin322 It is very unusual to have this kind of error, and this error would prevent scenarios from being generated correctly. Can we setup a Zoom meeting and have me look at the issue on your machine remotely? |
To further investigate the error messages from
We slightly modified the function stop() {
# It seems that we need to print the output step by step in the pipe to see what is going on.
echo -e "ps -ef is: \n $(ps -ef)"
echo -e "ps -ef | grep -E \"sim_control_main\" is: \n $(ps -ef | grep -E "sim_control_main")"
echo -e "ps -ef | grep -E \"sim_control_main\" | grep -v 'grep' is: \n $(ps -ef | grep -E "sim_control_main" | grep -v 'grep')"
echo -e "ps -ef | grep -E \"sim_control_main\" | grep -v 'grep' | awk '{print \$2}' is: \n $(ps -ef | grep -E "sim_control_main" | grep -v 'grep' | awk '{print $2}')"
ps -ef | grep -E "sim_control_main" | grep -v 'grep' | awk '{print $2}' | xargs kill -9
} And here is the file logger has recorded:
|
@lilejin322 This is consistent with my understanding. SimControl isn't running at all. Which Apollo are you using and can you give this https://github.com/YuqiHuai/BaiduApollo/tree/DoppelTest a try? |
We tried the Apollo DOI repository provided in README.MD |
@lilejin322 That shouldn't have an issue either. Currently, I am not sure what is happening and I'll wait for our meeting to take a closer look. |
Conclusion Heuristic Source
|
Thank you for the detailed investigation and for figuring out the issue! I am very surprised by the root cause of the issue being CPU overclocking, and indeed, non of the machines I ever used turned on overclocking. P.S. The link in your comment isn't working, but I found the media report Intel investigating games crashing on 13th and 14th Gen Core i9 processors |
Sorry for the bothering again. We've encountered some problems while running this project and don't know how to solve them.
Describe the issue
The script test_main.py can complete execution smoothly due to its single iteration. However, when switching to scripts like main_baseline.py and main_ga.py, which require long-running processes, the system appears to be unstable. Specifically, after running the scenario several times, it will eventually crash with a segmentation fault error indicated from the shell.
We’ve diagnosed the project using tools like pdb, but still couldn’t find where does the distinct error come from. The screenshot is shown as follow.
Environment
CPU: Intel Core i9 14900K (24-core)
Memory: 128GB
Graphic Card: None
OS: Ubuntu 18.04
Docker-CE: version 24.0.2
Python: version 3.9.18
To reproduce
~$ docker kill $(docker ps -q)
san_mateo
in config.py{ApolloROOT}/modules/common/data/global_flagfile.txt
as wellCurrent Result
The script might report an error and exit after running for several generations.
Expected Result
The system shall exit normally after a timeout instructed by RUN_FOR_HOUR parameter defined in config.py.
Debugging Endeavors
DoppelTest/main_baseline.py
Lines 63 to 65 in 264a06e
#
DoppelTest/framework/scenario/ScenarioRunner.py
Lines 119 to 120 in 264a06e
DoppelTest/framework/scenario/ScenarioRunner.py
Line 133 in 264a06e
DoppelTest/framework/scenario/ScenarioRunner.py
Line 161 in 264a06e
The text was updated successfully, but these errors were encountered: