-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(robot): add test case Test Longhorn components recovery #2143
base: master
Are you sure you want to change the base?
Conversation
longhorn/longhorn#9536 Signed-off-by: Chris <chris.chien@suse.com>
cd756b4
to
7597939
Compare
WalkthroughThis pull request introduces multiple enhancements across various resource management functionalities in the Longhorn system. New keywords related to backing images, Longhorn components, and share managers have been added to facilitate operations such as deletion, waiting for operational status, and ensuring recovery. Additionally, modifications to existing methods improve their flexibility by allowing namespace specifications. A new test suite has been created to validate the resilience of Longhorn components and volumes under failure conditions, incorporating various test scenarios. Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 27
🧹 Outside diff range and nitpick comments (24)
e2e/libs/sharemanager/base.py (1)
21-23
: Consider adding type hints and docstring for wait_for_restart.
The wait_for_restart
method would benefit from type hints and documentation, especially for the last_creation_time
parameter, as its type and format might not be immediately obvious to implementers.
@abstractmethod
- def wait_for_restart(self, name, last_creation_time):
+ def wait_for_restart(self, name: str, last_creation_time: str) -> bool:
+ """Wait for a share manager to restart after the specified creation time.
+
+ Args:
+ name: Name of the share manager
+ last_creation_time: Previous creation timestamp to compare against
+
+ Returns:
+ bool: True if the share manager has restarted, False otherwise
+ """
return NotImplemented
e2e/libs/sharemanager/rest.py (4)
14-15
: Add type hints and docstring to get method
The get
method lacks type hints and documentation. This information is crucial for maintainability and usage understanding.
- def get(self, name):
+ def get(self, name: str) -> dict:
+ """Get share manager details by name.
+
+ Args:
+ name: Name of the share manager
+
+ Returns:
+ Share manager details
+
+ Raises:
+ ApiException: If the API request fails
+ """
17-18
: Add type hints and docstring to delete method
The delete
method lacks type hints and documentation.
- def delete(self, name):
+ def delete(self, name: str) -> None:
+ """Delete a share manager.
+
+ Args:
+ name: Name of the share manager to delete
+
+ Raises:
+ ApiException: If the deletion fails
+ """
20-21
: Add type hints and docstring to wait_for_running method
The wait_for_running
method lacks type hints and documentation.
- def wait_for_running(self, name):
+ def wait_for_running(self, name: str, timeout: int = 300) -> None:
+ """Wait for share manager to reach running state.
+
+ Args:
+ name: Name of the share manager
+ timeout: Maximum time to wait in seconds
+
+ Raises:
+ TimeoutError: If the share manager doesn't reach running state
+ ApiException: If the status check fails
+ """
23-24
: Add type hints and docstring to wait_for_restart method
The wait_for_restart
method lacks type hints and documentation.
- def wait_for_restart(self, name, last_creation_time):
+ def wait_for_restart(self, name: str, last_creation_time: str, timeout: int = 300) -> None:
+ """Wait for share manager to restart.
+
+ Args:
+ name: Name of the share manager
+ last_creation_time: Previous creation timestamp
+ timeout: Maximum time to wait in seconds
+
+ Raises:
+ TimeoutError: If the share manager doesn't restart
+ ApiException: If the status check fails
+ """
e2e/libs/sharemanager/sharemanager.py (1)
20-31
: Consider adding docstrings for better maintainability.
While the implementation is clean and follows the strategy pattern correctly, adding docstrings would improve maintainability and help other developers understand the purpose and expected behavior of each method.
Example improvement:
def delete(self, name):
+ """Delete a share manager instance.
+
+ Args:
+ name (str): Name of the share manager to delete
+
+ Returns:
+ The result from the underlying implementation
+ """
return self.sharemanager.delete(name)
e2e/keywords/backing_image.resource (1)
29-30
: Consider adding documentation for the keyword.
The keyword is well-implemented, but adding documentation would help users understand its purpose and expected behavior.
-Wait backing image managers running
+Wait backing image managers running
+ [Documentation] Waits until all backing image managers are in running state.
e2e/libs/backing_image/base.py (1)
38-40
: Consider using plural form in method name for consistency.
The method list_backing_image_manager
returns a collection of managers, so consider renaming it to list_backing_image_managers
for consistency with typical naming conventions for methods returning collections.
- def list_backing_image_manager(self):
+ def list_backing_image_managers(self):
return NotImplemented
e2e/keywords/sharemanager.resource (3)
24-27
: Add documentation for the new keyword.
The implementation looks good and aligns with the PR objectives. Consider adding documentation to describe the purpose, parameters, and expected behavior of this keyword.
Delete sharemanager of deployment ${deployment_id} and wait for recreation
+ [Documentation] Deletes the sharemanager associated with the given deployment ID and waits for it to be recreated.
+ ...
+ ... Arguments:
+ ... - deployment_id: The ID of the deployment whose sharemanager should be deleted
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
delete_sharemanager_and_wait_for_recreation ${volume_name}
29-32
: Add documentation and consider timeout parameter.
The implementation looks good but could benefit from some enhancements:
- Add documentation to describe the keyword's purpose and parameters
- Consider adding a timeout parameter to control how long to wait
Wait for sharemanager of deployment ${deployment_id} running
+ [Documentation] Waits for the sharemanager associated with the given deployment ID to be in running state.
+ ...
+ ... Arguments:
+ ... - deployment_id: The ID of the deployment whose sharemanager should be monitored
+ [Arguments] ${deployment_id} ${timeout}=300
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
- wait_for_share_manager_running ${volume_name}
+ wait_for_share_manager_running ${volume_name} timeout=${timeout}
24-32
: Implementation aligns well with PR objectives.
The new keywords provide essential functionality for testing sharemanager recovery scenarios:
Delete sharemanager of deployment
enables testing recovery by triggering sharemanager recreationWait for sharemanager of deployment running
allows verification of successful recovery
These additions will effectively support the automation of Longhorn components recovery test cases as outlined in the PR objectives.
Consider adding error handling keywords to handle cases where recovery fails or times out, which would make the test suite more robust.
e2e/libs/keywords/backing_image_keywords.py (1)
23-42
: Consider implementing a retry mechanism for resilience testing.
Since this code is part of component recovery testing, consider implementing a retry mechanism with exponential backoff for the wait operations. This would make the tests more resilient and better simulate real-world recovery scenarios.
Key recommendations:
- Create a common retry decorator/utility for all wait operations
- Add configurable retry parameters (max attempts, backoff factor)
- Implement detailed logging of retry attempts for test debugging
- Consider adding assertions about the time taken for recovery
Would you like me to provide an example implementation of the retry mechanism?
e2e/libs/keywords/sharemanager_keywords.py (3)
51-52
: Add documentation and error handling.
Consider adding a docstring and basic error handling to improve test maintainability and debugging:
def delete_sharemanager(self, name):
+ """Delete a share manager instance by name.
+
+ Args:
+ name (str): Name of the share manager to delete
+
+ Returns:
+ The result of the deletion operation
+
+ Raises:
+ Exception: If deletion fails
+ """
+ try:
return self.sharemanager.delete(name)
+ except Exception as e:
+ logging(f"Failed to delete share manager {name}: {str(e)}")
+ raise
60-61
: Add documentation and error handling for wait operation.
Consider adding a docstring and error handling to improve test reliability:
def wait_for_share_manager_running(self, name):
+ """Wait for a share manager to reach running state.
+
+ Args:
+ name (str): Name of the share manager
+
+ Raises:
+ TimeoutError: If the share manager doesn't reach running state
+ ValueError: If name is empty
+ """
+ if not name:
+ raise ValueError("Share manager name cannot be empty")
+
+ try:
return self.sharemanager.wait_for_running(name)
+ except Exception as e:
+ logging(f"Failed waiting for share manager {name} to run: {str(e)}")
+ raise
50-61
: Consider architectural improvements for better maintainability.
The new methods are well-integrated, but consider these improvements:
- Extract common timeout and wait logic into a base method to avoid duplication
- Add integration tests to verify the recovery scenarios
- Consider using a configuration object for timeouts and retry settings
Example of a base wait method:
def _wait_with_timeout(self, operation, timeout=300, interval=2):
"""Base method for wait operations with timeout.
Args:
operation (callable): Function to execute
timeout (int): Maximum wait time in seconds
interval (int): Sleep interval between retries
"""
start_time = time.time()
while time.time() - start_time < timeout:
try:
return operation()
except Exception as e:
if time.time() - start_time >= timeout:
raise TimeoutError(f"Operation timed out: {str(e)}")
time.sleep(interval)
e2e/libs/sharemanager/crd.py (1)
63-66
: Consider extracting timestamp comparison logic
The datetime parsing and comparison logic could be moved to a utility function for reuse across other test cases, especially since this PR involves multiple recovery test scenarios.
Consider creating a utility function like:
def is_newer_timestamp(new_time: str, old_time: str, fmt: str = "%Y-%m-%dT%H:%M:%SZ") -> bool:
return datetime.strptime(new_time, fmt) > datetime.strptime(old_time, fmt)
e2e/libs/keywords/k8s_keywords.py (1)
83-84
: Consider adding a docstring for better maintainability.
The method implementation looks good and follows the class's pattern of wrapping k8s module functions. However, adding a docstring would improve maintainability by documenting the method's purpose and parameters.
Consider adding documentation like this:
def wait_for_namespace_pods_running(self, namespace):
+ """Wait for all pods in the specified namespace to be in running state.
+
+ Args:
+ namespace (str): The namespace to check for running pods
+
+ Returns:
+ bool: True if all pods are running, False otherwise
+ """
return wait_for_namespace_pods_running(namespace)
e2e/libs/backing_image/rest.py (1)
113-113
: Fix whitespace consistency
There are extra blank lines around the new methods.
Apply this diff to maintain consistent spacing:
-
def delete_backing_image_manager(self, name):
e2e/libs/keywords/workload_keywords.py (2)
64-70
: LGTM: Enhanced pod selection capabilities.
The addition of namespace and label_selector parameters improves the flexibility of pod selection and deletion.
Consider adding docstring to document the parameters, especially the format expected for label_selector:
def delete_workload_pod_on_node(self, workload_name, node_name, namespace="default", label_selector=""):
"""Delete workload pod on specific node.
Args:
workload_name (str): Name of the workload
node_name (str): Name of the node
namespace (str, optional): Kubernetes namespace. Defaults to "default"
label_selector (str, optional): Kubernetes label selector (e.g. "app=nginx"). Defaults to ""
"""
49-51
: LGTM: Consistent namespace parameter implementation.
The addition of namespace parameters across methods follows a consistent pattern and improves the test framework's flexibility while maintaining backward compatibility.
Consider creating a base class or configuration object to store common parameters like default namespace. This would make it easier to modify defaults across all methods and reduce parameter repetition.
Example:
class WorkloadConfig:
DEFAULT_NAMESPACE = "default"
class workload_keywords:
def __init__(self):
self.config = WorkloadConfig()
# ... rest of init ...
def delete_pod(self, pod_name, namespace=None):
namespace = namespace or self.config.DEFAULT_NAMESPACE
# ... rest of method ...
Also applies to: 64-70, 71-72
e2e/keywords/workload.resource (3)
190-201
: Consider improving maintainability and consistency.
A few suggestions to enhance the code:
- Remove unnecessary empty lines for consistency with the rest of the file.
- Consider using a mapping for label selectors to improve maintainability.
Apply this diff to implement the suggestions:
Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
-
${node_name} = get_node_by_index ${node_id}
-
IF '${workload_name}' == 'engine-image'
${label_selector} = Set Variable longhorn.io/component=engine-image
ELSE IF '${workload_name}' == 'instance-manager'
${label_selector} = Set Variable longhorn.io/component=instance-manager
ELSE
${label_selector} = Set Variable ${EMPTY}
END
delete_workload_pod_on_node ${workload_name} ${node_name} longhorn-system ${label_selector}
Additionally, consider creating a variable at the top of the file to map workload names to their label selectors:
*** Variables ***
&{LONGHORN_COMPONENT_LABELS} engine-image=longhorn.io/component=engine-image instance-manager=longhorn.io/component=instance-manager
Then simplify the keyword:
Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
${node_name} = get_node_by_index ${node_id}
${label_selector} = Get From Dictionary ${LONGHORN_COMPONENT_LABELS} ${workload_name} ${EMPTY}
delete_workload_pod_on_node ${workload_name} ${node_name} longhorn-system ${label_selector}
202-205
: Add documentation and error handling.
The keyword would benefit from:
- Documentation explaining its purpose and usage.
- Error handling for cases where the pod doesn't exist.
- Verification that the pod was successfully deleted.
Apply this diff to implement the suggestions:
Delete Longhorn ${workload_kind} ${workload_name} pod
+ [Documentation] Deletes a Longhorn pod of specified workload kind and name from the longhorn-system namespace.
+ ... Logs the pod name before deletion and verifies successful deletion.
+ ...
+ ... Arguments:
+ ... - workload_kind: The kind of workload (e.g., deployment, statefulset)
+ ... - workload_name: The name of the workload to delete
${pod_name} = get_workload_pod_name ${workload_name} longhorn-system
+ Should Not Be Empty ${pod_name} msg=No pod found for workload ${workload_name}
Log ${pod_name}
delete_pod ${pod_name} longhorn-system
+ Wait Until Keyword Succeeds 30s 5s Should Not Exist pod ${pod_name} longhorn-system
190-205
: Consider adding more verification steps for component recovery testing.
Given that these keywords are part of automating test cases for Longhorn components recovery, consider adding more verification steps:
- Verify that all associated resources are cleaned up after pod deletion.
- Add wait conditions to ensure the system is in a known state before proceeding with recovery tests.
Would you like me to provide examples of additional verification steps that could be added?
e2e/libs/workload/workload.py (1)
Line range hint 24-45
: Consider standardizing namespace handling across all functions
While the core pod retrieval functions now support custom namespaces, several other functions in this file still hardcode the 'default' namespace (e.g., write_pod_random_data, write_pod_large_data). Consider:
- Adding namespace parameters consistently across all pod-related functions
- Creating a module-level default namespace configuration
- Updating all exec/stream operations to use the specified namespace
Example pattern to consider:
# At module level
DEFAULT_NAMESPACE = "default"
def write_pod_random_data(pod_name, size_in_mb, file_name,
data_directory="/data", namespace=DEFAULT_NAMESPACE):
# ... use namespace parameter in api calls ...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (19)
- e2e/keywords/backing_image.resource (1 hunks)
- e2e/keywords/longhorn.resource (1 hunks)
- e2e/keywords/sharemanager.resource (1 hunks)
- e2e/keywords/workload.resource (1 hunks)
- e2e/libs/backing_image/backing_image.py (2 hunks)
- e2e/libs/backing_image/base.py (1 hunks)
- e2e/libs/backing_image/crd.py (1 hunks)
- e2e/libs/backing_image/rest.py (1 hunks)
- e2e/libs/k8s/k8s.py (3 hunks)
- e2e/libs/keywords/backing_image_keywords.py (1 hunks)
- e2e/libs/keywords/k8s_keywords.py (2 hunks)
- e2e/libs/keywords/sharemanager_keywords.py (1 hunks)
- e2e/libs/keywords/workload_keywords.py (2 hunks)
- e2e/libs/sharemanager/base.py (1 hunks)
- e2e/libs/sharemanager/crd.py (2 hunks)
- e2e/libs/sharemanager/rest.py (1 hunks)
- e2e/libs/sharemanager/sharemanager.py (1 hunks)
- e2e/libs/workload/workload.py (1 hunks)
- e2e/tests/negative/component_resilience.robot (1 hunks)
🧰 Additional context used
🪛 Ruff
e2e/libs/backing_image/crd.py
57-57: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
69-69: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
69-69: f-string without any placeholders
Remove extraneous f
prefix
(F541)
72-72: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
91-91: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
e2e/libs/k8s/k8s.py
8-8: workload.pod.wait_for_pod_status
imported but unused
Remove unused import: workload.pod.wait_for_pod_status
(F401)
9-9: workload.pod.get_pod
imported but unused
Remove unused import: workload.pod.get_pod
(F401)
178-178: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
195-195: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
e2e/libs/sharemanager/crd.py
44-44: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
52-52: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
55-55: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
68-68: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
🔇 Additional comments (20)
e2e/libs/sharemanager/base.py (1)
8-23
: LGTM! Well-structured abstract interface for share manager operations.
The new abstract methods provide a clean and comprehensive interface for share manager lifecycle operations, which aligns well with the PR's objective of testing component recovery. The method signatures are clear and follow consistent patterns.
e2e/libs/sharemanager/sharemanager.py (3)
21-22
: LGTM! Clean delegation to strategy implementation.
The delete method follows the strategy pattern correctly and maintains a clean interface.
24-25
: LGTM! Consistent with recovery testing requirements.
The wait_for_running method aligns well with the PR's objective of testing component recovery.
27-28
: LGTM! Simple and focused getter implementation.
The get method provides a clean interface to retrieve share manager instances.
e2e/keywords/backing_image.resource (1)
26-27
: LGTM! The keyword follows Robot Framework conventions.
The keyword is well-named and properly maps to its underlying implementation for testing backing image manager recovery.
e2e/libs/backing_image/base.py (1)
33-48
: Well-structured additions for recovery testing!
The new abstract methods form a comprehensive interface for managing backing image managers, which aligns well with the PR's objective of testing Longhorn components recovery. The methods provide the necessary operations for:
- Monitoring manager status (
wait_all_backing_image_managers_running
) - Managing lifecycle (
delete_backing_image_manager
,wait_backing_image_manager_restart
) - Retrieving state (
list_backing_image_manager
)
These additions will enable thorough testing of recovery scenarios.
e2e/keywords/sharemanager.resource (1)
23-23
: LGTM! Good spacing.
The added empty line improves readability by properly separating keyword definitions.
e2e/libs/keywords/backing_image_keywords.py (1)
36-42
:
Improve robustness of the manager deletion and recreation process.
The current implementation has several potential issues:
- No error handling for API responses
- Possible race conditions during iteration
- No timeout for the complete operation
- Lack of logging for test debugging
Here's a suggested improvement:
def delete_all_backing_image_managers_and_wait_for_recreation(self):
+ """Delete all backing image managers and ensure they are recreated.
+
+ Raises:
+ TimeoutError: If operation exceeds timeout
+ RuntimeError: If API calls fail
+ """
+ try:
backing_image_managers = self.backing_image.list_backing_image_manager()
+ if not backing_image_managers.get("items"):
+ return
+
+ # Store all managers first to avoid race conditions
+ managers_to_delete = [
+ (m["metadata"]["name"], m["metadata"]["creationTimestamp"])
+ for m in backing_image_managers["items"]
+ ]
+
- for backing_image in backing_image_managers["items"]:
- name = backing_image["metadata"]["name"]
- last_creation_time = backing_image["metadata"]["creationTimestamp"]
+ for name, last_creation_time in managers_to_delete:
+ print(f"Deleting backing image manager: {name}")
self.backing_image.delete_backing_image_manager(name)
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+ except Exception as e:
+ raise RuntimeError(f"Failed to manage backing image managers: {str(e)}")
Let's verify if there are any existing error handling patterns in the codebase:
✅ Verification successful
Let me gather more information about the logging patterns and timeout handling in the codebase.
Based on the analysis of the codebase, I can now provide the final response:
Add error handling and logging to backing image manager operations
The codebase shows consistent patterns of error handling and logging across other components. Let's align with these patterns:
def delete_all_backing_image_managers_and_wait_for_recreation(self):
+ """Delete all backing image managers and wait for their recreation.
+
+ Raises:
+ RuntimeError: If API calls fail
+ """
+ logging(f'Cleaning up backing image managers')
+ try:
backing_image_managers = self.backing_image.list_backing_image_manager()
+ if not backing_image_managers.get("items"):
+ return
+
for backing_image in backing_image_managers["items"]:
name = backing_image["metadata"]["name"]
last_creation_time = backing_image["metadata"]["creationTimestamp"]
+ logging(f'Deleting backing image manager: {name}')
self.backing_image.delete_backing_image_manager(name)
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+ except Exception as e:
+ raise RuntimeError(f"Failed to manage backing image managers: {str(e)}")
The codebase shows:
- Consistent error handling with try-except blocks raising RuntimeError
- Extensive use of logging for operations and their progress
- Null checks before processing lists
- No explicit timeout handling in similar wait operations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for error handling patterns in similar test files
rg -A 3 "try:|except:|raise" "e2e/libs/keywords/"
Length of output: 1531
Script:
#!/bin/bash
# Search for logging patterns
rg -A 2 "logging\(" "e2e/libs/keywords/"
# Search for timeout patterns and wait operations
rg -A 2 "wait.*timeout|timeout.*wait" "e2e/libs/keywords/"
Length of output: 25649
e2e/libs/backing_image/backing_image.py (1)
3-3
: LGTM: Import statement is correctly placed
The CRD import is appropriately grouped with other backing_image related imports.
e2e/libs/sharemanager/crd.py (1)
2-14
: LGTM: Well-structured initialization with configurable retries
The addition of retry configuration through get_retry_count_and_interval()
is a good practice for test automation, allowing for environment-specific tuning.
e2e/libs/keywords/k8s_keywords.py (1)
12-12
: LGTM!
The import follows the established pattern and is properly placed with other k8s imports.
e2e/keywords/longhorn.resource (1)
70-84
: Verify test coverage for recovery scenarios
The new keywords align well with the PR objectives for testing Longhorn components recovery. Let's verify the test coverage:
✅ Verification successful
Test coverage for recovery scenarios is comprehensive
The implementation provides thorough test coverage for Longhorn components recovery:
- Test cases cover all key recovery scenarios:
- Component recovery (CSI plugins, managers, UI, etc.)
- Volume recovery for different volume types (RWO, RWX)
- Backing image volume recovery
- Recovery during replica rebuilding
- Share manager recovery for RWX volumes
The new keywords are effectively utilized in well-structured test cases that verify both the failure injection and recovery verification, with data integrity checks.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if there are corresponding test files using these keywords
# and verify the test scenarios cover all recovery cases mentioned in PR objectives
echo "Checking for test files using the new keywords..."
rg -l "Delete instance-manager of (volume|deployment)" --type robot
echo "Verifying test scenarios for different volume types..."
rg -l "Test.*recovery.*(RWO|RWX|backing image)" --type robot
Length of output: 406
Script:
#!/bin/bash
# Let's examine the actual test scenarios in the component_resilience.robot file
echo "Checking test scenarios in component_resilience.robot..."
cat e2e/tests/negative/component_resilience.robot
# Also check for any other related test files that might be using these keywords
echo -e "\nChecking for other potential test files..."
rg -l "longhorn.resource" --type robot
Length of output: 11001
e2e/libs/backing_image/rest.py (1)
114-124
:
Implement backing image manager methods following existing patterns
The newly added methods are currently returning NotImplemented
. Based on the existing patterns in the class and the PR objectives for testing Longhorn components recovery, these methods should be implemented.
Here's a suggested implementation following the class patterns:
def delete_backing_image_manager(self, name):
- return NotImplemented
+ logging(f"Deleting backing image manager {name}")
+ manager = get_longhorn_client().by_id_backing_image_manager(name)
+ get_longhorn_client().delete(manager)
def wait_all_backing_image_managers_running(self):
- return NotImplemented
+ for i in range(self.retry_count):
+ logging(f"Waiting for all backing image managers running... ({i})")
+ managers = get_longhorn_client().list_backing_image_manager()
+ all_running = True
+ for manager in managers:
+ if manager.status.state != "running":
+ all_running = False
+ break
+ if all_running:
+ return
+ time.sleep(self.retry_interval)
+ assert False, "Timeout waiting for all backing image managers running"
def wait_backing_image_manager_restart(self, name, last_creation_time):
- return NotImplemented
+ for i in range(self.retry_count):
+ logging(f"Waiting for backing image manager {name} restart... ({i})")
+ manager = get_longhorn_client().by_id_backing_image_manager(name)
+ if manager.metadata.creationTimestamp > last_creation_time:
+ return
+ time.sleep(self.retry_interval)
+ assert False, f"Timeout waiting for backing image manager {name} restart"
def list_backing_image_manager(self):
- return NotImplemented
+ return get_longhorn_client().list_backing_image_manager()
The implementation:
- Follows existing error handling and logging patterns
- Uses the retry mechanism consistently
- Maintains similar assertion patterns for timeouts
- Utilizes the Longhorn client methods for operations
Let's verify the Longhorn client API methods exist:
e2e/libs/keywords/workload_keywords.py (2)
49-51
: LGTM: Namespace parameter addition is well-implemented.
The addition of the namespace parameter with a default value maintains backward compatibility while enabling better resource isolation for tests.
71-72
: Consider handling multiple pods scenario.
While the namespace parameter addition is good, returning the first pod from the list might be problematic if there are multiple pods and the order matters.
Consider either:
- Documenting that this method should only be used with single-pod workloads, or
- Adding a parameter to specify which pod to return
Let's verify if this method is used with multi-pod workloads:
e2e/libs/workload/workload.py (3)
17-18
: LGTM: Clean namespace parameter addition
The namespace parameter addition with a default value maintains backward compatibility while enabling cross-namespace pod operations.
Line range hint 24-45
: Verify RBAC permissions for cross-namespace operations
The addition of cross-namespace support requires appropriate RBAC permissions. Please ensure:
- The test service account has necessary permissions to access pods across namespaces
- The test framework validates namespace accessibility before operations
#!/bin/bash
# Description: Check for RBAC configurations and namespace-related permissions
# Search for RBAC definitions
rg "Role|ClusterRole|ServiceAccount" --type yaml -B 5 -A 5
# Look for namespace-related test configurations
rg "namespace.*test" --type py -B 2 -A 2
Line range hint 24-45
: LGTM: Namespace support properly implemented
The namespace parameter is correctly integrated with proper error handling and logging.
Let's verify that all callers of this function handle the namespace parameter correctly:
✅ Verification successful
The previous search might have failed due to syntax. Let's try a different approach to find the callers and verify the namespace parameter usage.
Based on the shell script results, I can now provide a final response:
All callers properly handle the namespace parameter
The code analysis shows that all callers of get_workload_pods()
properly handle the namespace parameter:
get_workload_pod_names()
correctly propagates the namespace parameterwait_for_workload_pods_running()
properly passes the namespace parameterwait_for_workload_pods_stable()
correctly uses the namespace parameterwait_for_workload_pod_kept_in_state()
properly handles the namespace parameteris_workload_pods_has_annotations()
correctly uses the namespace parameter
The function's default namespace value of "default" maintains backward compatibility while allowing explicit namespace specification when needed. All callers either use the default value or explicitly pass the namespace parameter.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any direct calls to get_workload_pods that might need namespace parameter updates
# Search for function calls
rg "get_workload_pods\(" --type py -B 2 -A 2
# Search for potential namespace-related issues in test files
rg "namespace.*=.*default" --type py -B 2 -A 2
Length of output: 6894
Script:
#!/bin/bash
# Let's try a more specific search for get_workload_pods calls
rg "get_workload_pods\(" --type py -l | xargs cat
# Also search for the function definition to understand its full context
rg "def get_workload_pods" --type py -B 5 -A 20
Length of output: 27502
e2e/tests/negative/component_resilience.robot (2)
121-123
: Consistent use of replica node variable
Ensure that the variable replica node
used in deleting volume replicas is consistently defined and accessible within the test scope.
Confirm that replica node
refers to a valid and correctly assigned node in the test environment.
Also applies to: 146-148, 174-176, 202-204
81-91
: Ensure node selection is appropriately defined
In the steps where Longhorn component pods are deleted (lines 81-91), the node is specified as node 1
. Verify that node 1
accurately references an existing node in your cluster.
Run the following script to list all nodes in the cluster and confirm node 1
exists:
@abstractmethod | ||
def wait_for_running(self, name): | ||
return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding timeout parameter to wait_for_running.
For robustness in test scenarios, consider adding an optional timeout parameter to control how long the method should wait before giving up.
@abstractmethod
- def wait_for_running(self, name):
+ def wait_for_running(self, name: str, timeout: int = 300) -> bool:
+ """Wait for a share manager to reach running state.
+
+ Args:
+ name: Name of the share manager
+ timeout: Maximum time to wait in seconds (default: 300)
+
+ Returns:
+ bool: True if running state is reached, False if timeout occurs
+ """
return NotImplemented
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@abstractmethod | |
def wait_for_running(self, name): | |
return NotImplemented | |
@abstractmethod | |
def wait_for_running(self, name: str, timeout: int = 300) -> bool: | |
"""Wait for a share manager to reach running state. | |
Args: | |
name: Name of the share manager | |
timeout: Maximum time to wait in seconds (default: 300) | |
Returns: | |
bool: True if running state is reached, False if timeout occurs | |
""" | |
return NotImplemented |
def get(self, name): | ||
return NotImplemented | ||
|
||
def delete(self, name): | ||
return NotImplemented | ||
|
||
def wait_for_running(self, name): | ||
return NotImplemented | ||
|
||
def wait_for_restart(self, name, last_creation_time): | ||
return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation required: Methods return NotImplemented
All newly added methods currently return NotImplemented
. This seems inconsistent with the PR objectives which state that the tests have been executed locally. These methods are crucial for testing Longhorn components recovery and should be properly implemented.
Please implement these methods using the longhorn_client
. The implementation should:
- Handle API interactions with the share manager
- Include appropriate error handling
- Follow similar patterns as the existing
list
method
Example implementation structure for get
:
def get(self, name: str) -> dict:
"""Get share manager details by name.
Args:
name: Name of the share manager
Returns:
Share manager details
Raises:
ApiException: If the API request fails
"""
return self.longhorn_client.get_share_manager(name)
💡 Codebase verification
Error handling strategy needs to be implemented in Rest class
The review comment is valid. The code inspection reveals:
- The base class defines abstract methods without error handling guidance
- The Rest class inherits from Base and uses
longhorn_client
but has unimplemented methods returningNotImplemented
- The CRD implementation shows error handling patterns with try-except blocks and logging
- No error handling exists for API calls, timeouts, or resource states in the Rest class
Key areas needing error handling:
list()
method'slonghorn_client.list_share_manager()
call- Unimplemented methods should handle API errors and resource states
- Wait operations need timeout handling and state validation
- Consistent error logging similar to CRD implementation
🔗 Analysis chain
Define error handling strategy
The methods need a clear error handling strategy for various scenarios:
- API errors from
longhorn_client
- Timeouts during wait operations
- Resource not found cases
- Invalid state transitions
Let's verify the error handling in the base class:
Consider implementing a common error handling decorator or base method that can:
- Handle common API exceptions
- Implement retries for transient failures
- Provide consistent error messages
- Log relevant details for debugging
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check error handling patterns in the base class
rg -A 5 "raise|except" "e2e/libs/sharemanager/base.py"
Length of output: 54
Script:
#!/bin/bash
# Let's first check if the file exists and see its content
fd "base.py" "e2e/libs/sharemanager" --exec cat {}
# Also check the rest.py implementation for context
fd "rest.py" "e2e/libs/sharemanager" --exec cat {}
# Look for any error handling patterns in related files
rg -A 5 "raise|except|error" "e2e/libs/sharemanager/"
Length of output: 1736
def wait_for_restart(self, name, last_creation_time): | ||
return self.sharemanager.wait_for_restart(name, last_creation_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Based on the results, I can now generate the final response since I have sufficient information about the restart detection implementations across different classes.
Critical: REST strategy has unimplemented restart detection
The verification revealed important implementation gaps:
- The REST strategy (
rest.py
) returnsNotImplemented
forwait_for_restart
- Only the CRD strategy (
crd.py
) has a concrete implementation with retry logic and error handling - The base class (
base.py
) defines the interface but returnsNotImplemented
This could lead to runtime failures when using the REST strategy for recovery testing. The CRD implementation looks solid with proper retry and error handling, but the REST strategy needs to be implemented to match this behavior.
🔗 Analysis chain
Verify restart detection logic in strategy implementations.
The wait_for_restart method seems critical for recovery testing. Let's verify the underlying implementation handles edge cases properly.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the restart detection implementation in strategy classes
# Look for wait_for_restart implementations in CRD and Rest classes
echo "Checking CRD implementation:"
ast-grep --pattern 'def wait_for_restart($_, $_) { $$$ }'
echo -e "\nChecking error handling patterns:"
rg -A 5 "wait_for_restart"
Length of output: 1552
def delete_backing_image_manager(self, name): | ||
self.backing_image.delete_backing_image_manager(name) | ||
|
||
def wait_all_backing_image_managers_running(self): | ||
self.backing_image.wait_all_backing_image_managers_running() | ||
|
||
def wait_backing_image_manager_restart(self, name, last_creation_time): | ||
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time) | ||
|
||
def list_backing_image_manager(self): | ||
return self.backing_image.list_backing_image_manager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add docstrings and error handling to the new methods.
These test automation keywords would benefit from:
- Docstrings explaining the purpose, parameters, and expected behavior
- Timeout parameters for the wait methods to prevent hanging tests
- Basic input validation and error handling
Here's a suggested improvement for one of the methods as an example:
def wait_backing_image_manager_restart(self, name, last_creation_time):
+ """Wait for a backing image manager to restart after deletion.
+
+ Args:
+ name (str): Name of the backing image manager
+ last_creation_time (str): Previous creation timestamp
+
+ Raises:
+ TimeoutError: If manager doesn't restart within timeout
+ ValueError: If name or timestamp is invalid
+ """
+ if not name or not last_creation_time:
+ raise ValueError("Name and last_creation_time must be provided")
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
Committable suggestion was skipped due to low confidence.
def delete_backing_image_manager(self, name): | ||
self.backing_image = CRD() | ||
return self.backing_image.delete_backing_image_manager(name) | ||
|
||
def wait_all_backing_image_managers_running(self): | ||
self.backing_image = CRD() | ||
return self.backing_image.wait_all_backing_image_managers_running() | ||
|
||
def wait_backing_image_manager_restart(self, name, last_creation_time): | ||
self.backing_image = CRD() | ||
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time) | ||
|
||
def list_backing_image_manager(self): | ||
self.backing_image = CRD() | ||
return self.backing_image.list_backing_image_manager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
The shell script results provide clear evidence about the strategy pattern implementation across the codebase. Let me generate the final response:
Strategy pattern implementation needs revision to match codebase standards
The current implementation in backing_image.py
deviates from the established pattern used consistently across other modules. All other modules:
- Initialize the strategy object only once in
__init__
- Handle both REST and CRD cases properly with if-else blocks
- Use the initialized strategy object throughout their methods
Here's the correct implementation matching the codebase standards:
def __init__(self):
if self._strategy == LonghornOperationStrategy.CRD:
self.backing_image = CRD()
else:
self.backing_image = Rest()
Remove the redundant CRD instantiation from:
- delete_backing_image_manager
- wait_all_backing_image_managers_running
- wait_backing_image_manager_restart
- list_backing_image_manager
🔗 Analysis chain
Strategy pattern implementation needs revision
The current implementation has several architectural concerns:
- The new methods bypass the strategy pattern by directly instantiating CRD, which violates the pattern's purpose.
- Creating a new CRD instance per method call is inefficient and could lead to resource management issues.
- Overwriting
self.backing_image
could affect other methods expecting the REST implementation.
Consider refactoring to:
- Respect the strategy pattern:
class BackingImage(Base):
_strategy = LonghornOperationStrategy.REST
def __init__(self):
if self._strategy == LonghornOperationStrategy.REST:
self.backing_image = Rest()
+ else:
+ self.backing_image = CRD()
def delete_backing_image_manager(self, name):
- self.backing_image = CRD()
return self.backing_image.delete_backing_image_manager(name)
- If both REST and CRD implementations are needed, consider using separate attributes:
def __init__(self):
self.rest_client = Rest()
self.crd_client = CRD()
The wait_backing_image_manager_restart
method doesn't return the result from the delegate call:
def wait_backing_image_manager_restart(self, name, last_creation_time):
self.backing_image = CRD()
- self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+ return self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
Consider renaming list_backing_image_manager
to list_backing_image_managers
for consistency with other method names.
Let's verify the strategy pattern usage across the codebase:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if other classes follow the same pattern
rg -l "LonghornOperationStrategy" | xargs rg -A 5 "_strategy\s*="
Length of output: 5660
backing_image_manager = self.obj_api.get_namespaced_custom_object( | ||
group="longhorn.io", | ||
version="v1beta2", | ||
namespace="longhorn-system", | ||
plural="backingimagemanagers", | ||
name=name | ||
) | ||
except Exception as e: | ||
logging(f"Finding backing image manager {name} failed with error {e}") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Handle exceptions more specifically and improve logging
Catching all exceptions with a bare except
can obscure unexpected errors. It's better to catch specific exceptions. Additionally, consider logging the stack trace for better debugging.
Apply this diff to catch specific exceptions and log the stack trace:
- except Exception as e:
- logging(f"Finding backing image manager {name} failed with error {e}")
+ except client.exceptions.ApiException as e:
+ logging(f"Failed to find backing image manager {name}: {e}")
+ except Exception as e:
+ logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
backing_image_manager = self.obj_api.get_namespaced_custom_object( | |
group="longhorn.io", | |
version="v1beta2", | |
namespace="longhorn-system", | |
plural="backingimagemanagers", | |
name=name | |
) | |
except Exception as e: | |
logging(f"Finding backing image manager {name} failed with error {e}") | |
continue | |
backing_image_manager = self.obj_api.get_namespaced_custom_object( | |
group="longhorn.io", | |
version="v1beta2", | |
namespace="longhorn-system", | |
plural="backingimagemanagers", | |
name=name | |
) | |
except client.exceptions.ApiException as e: | |
logging(f"Failed to find backing image manager {name}: {e}") | |
except Exception as e: | |
logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True) | |
continue |
Resource ../keywords/backing_image.resource | ||
Resource ../keywords/storageclass.resource | ||
Resource ../keywords/persistentvolumeclaim.resource | ||
Resource ../keywords/k8s.resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate resource import
The resource file ../keywords/k8s.resource
is imported twice on lines 11 and 17. This duplication is unnecessary and could lead to confusion or maintenance issues.
Apply this diff to remove the duplicate import:
- Resource ../keywords/k8s.resource
Also applies to: 17-17
And Wait for volume 0 healthy | ||
And Check volume 0 data is intact | ||
And Wait for deployment 0 pods stable | ||
And And Check deployment 0 data in file data.txt is intact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct duplicate 'And' keywords in test steps
There are typographical errors on lines 96 and 100 where the keyword And
is duplicated.
Apply this diff to fix the typos:
- And And Check deployment 0 data in file data.txt is intact
+ And Check deployment 0 data in file data.txt is intact
- And And Check deployment 1 data in file data.txt is intact
+ And Check deployment 1 data in file data.txt is intact
Also applies to: 100-100
IF '${DATA_ENGINE}' == 'v1' | ||
When Create backing image bi with url=https://longhorn-backing-image.s3-us-west-1.amazonaws.com/parrot.qcow2 | ||
And Create volume 1 with backingImage=bi dataEngine=${DATA_ENGINE} | ||
And Attach volume 1 | ||
And Wait for volume 1 healthy | ||
And Write data to volume 1 | ||
|
||
When Create storageclass longhorn-test-1 with dataEngine=${DATA_ENGINE} | ||
And Create persistentvolumeclaim 1 using RWX volume with longhorn-test-1 storageclass | ||
And Create deployment 1 with persistentvolumeclaim 1 | ||
And Write 100 MB data to file data.txt in deployment 1 | ||
END | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify conditional checks for ${DATA_ENGINE}
The variable ${DATA_ENGINE}
is set to v1
at line 27. The conditional IF '${DATA_ENGINE}' == 'v1'
will always evaluate to true throughout the test cases.
Consider removing the conditional checks or making ${DATA_ENGINE}
configurable if different engine versions are intended to be tested. This will simplify the code and improve readability.
- IF '${DATA_ENGINE}' == 'v1'
...
- END
- IF '${DATA_ENGINE}' == 'v1'
...
- END
- IF '${DATA_ENGINE}' == 'v1'
...
- END
- IF '${DATA_ENGINE}' == 'v1'
...
- END
Also applies to: 97-102, 138-153, 167-183
And Attach volume 0 | ||
And Wait for volume 0 healthy | ||
And Write data to volume 0 | ||
Then Delete instance-manager of volume 0 and wait for recover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential race conditions during instance manager deletion
Deleting the instance manager while a volume is rebuilding may lead to inconsistent states or test flakiness.
Consider adding a wait or verification step to ensure the rebuild process has properly initiated before deleting the instance manager. This can help in making the test more reliable.
+ And Wait for replica rebuilding progress
Also applies to: 123-123, 145-145, 148-148, 172-172, 176-176, 200-200, 204-204
Which issue(s) this PR fixes:
Issue #9536
What this PR does / why we need it:
Automate manutal test case Test Longhorn components recovery into blow sub test cases
Special notes for your reviewer:
Tested on my local env
Additional documentation or context
Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes