Skip to content

Conversation

@17-sugiyama
Copy link
Contributor

I tested the recently added hwp_pcu agent, and found that 3 seconds of timeout is sometimes too short to communicate with the HWP PCU module.
This is a request to extend timeout for sending commands/receiving messages.

Description

I extended the timeout for send_command from 3 sec to 10 sec; and for get_status to 10 sec.

Motivation and Context

Improving the operation stability

How Has This Been Tested?

I tested this agent in the lab and it worked.
The recently added agent worked fine except for the timeout issue.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit unusual to me. I believe it could be due to the timeout in the 'acq' process being set to a non-zero value. Try changing the timeout on line 162 to 0 and leaving these timeouts modified here as they were and see if that helps:

with self.lock.acquire_timeout(timeout=3, job='acq') as acquired:

@jlashner
Copy link
Contributor

@BrianJKoopman a bit confused about your comment (I think it was me who suggested non-zero timeouts). Why would a timeout of zero help here?

@jlashner
Copy link
Contributor

I think the problem is the long acq loop time:

time.sleep(5)

While this is sleeping, the lock cannot be acquired by other processes. I think to fix we should make this a fast loop (.1 sec or so) and only take and publish data on iterations where 5 sec have elapsed.

Or we could also restructure the agent to avoid TimeoutLocks.

@BrianJKoopman
Copy link
Member

I think the problem is the long acq loop time:

time.sleep(5)

While this is sleeping, the lock cannot be acquired by other processes. I think to fix we should make this a fast loop (.1 sec or so) and only take and publish data on iterations where 5 sec have elapsed.

Ah, yup, agreed this 5 second sleep is the issue, the task timeouts would need to be longer than that. Sorry for the noise. I've just seen non-zero acq timeouts cause issues in the past, and the example in the docs has a 0 timeout in long running processes.

@jlashner
Copy link
Contributor

I realize it might not be clear what I'm talking about when I say lockless agent restructure, so I threw together what I have in mind in this draft PR for reference which should fix the locking issues. Feel free to use that fix if you would like, but may need some testing:

#600

@17-sugiyama
Copy link
Contributor Author

Thank you so much for your comments and suggestions.
I prefer to use Jack's lockless agent. I tested Jack's agent with the PCU instrument and it worked.
I left some comments to #600.
I'd like to close this pull request.

@17-sugiyama 17-sugiyama deleted the hwp_pcu-agent branch December 20, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants