Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dashmate): invalid drive status check #2248

Merged
merged 4 commits into from
Oct 18, 2024

Conversation

shumkov
Copy link
Member

@shumkov shumkov commented Oct 16, 2024

Issue being fixed or feature implemented

Drive's and Platform's statuses in dashmate doesn't work correctly due to multiple issues:

  • The drive's status command doesn't work in case if prometheus metrics are disabled
  • The actual Drive status check result is ignored
  • Not started Drive was treated as errored

What was done?

  • Platform Status now respects both tenderdash and drive statuses
  • If Drive container is stopped we represent Drive status as stopped but not errored
  • Use gRPC getStatus endpoint instead of Prometheus metrics in Drive

How Has This Been Tested?

  • Running commands locally against local network
  • Updated existing tests

Breaking Changes

None

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section if my code contains any
  • I have made corresponding changes to the documentation if needed

For repository code-owners and collaborators only

  • I have assigned this pull request to a milestone

Summary by CodeRabbit

  • New Features

    • Introduced a new environment variable, PLATFORM_DRIVE_ABCI_METRICS_URL, enhancing configuration options.
    • Added a new property, drive, to the platform overview, providing more detailed service information.
  • Bug Fixes

    • Improved status determination logic for better accuracy in reporting service statuses.
    • Enhanced error handling for service status checks, particularly for the drive_abci service.
  • Documentation

    • Enhanced logging during server initialization and shutdown for better observability.
  • Tests

    • Updated test suite to improve error handling and reflect changes in service status handling.

@shumkov shumkov linked an issue Oct 16, 2024 that may be closed by this pull request
@shumkov shumkov modified the milestone: v1.5.0 Oct 17, 2024
@shumkov shumkov marked this pull request as ready for review October 17, 2024 08:20
Copy link
Contributor

coderabbitai bot commented Oct 17, 2024

Walkthrough

The pull request introduces several modifications across multiple files to enhance platform status determination and error handling. Key changes include the enhancement of the StatusCommand class for improved service status consolidation, the addition of a new environment variable in the generateEnvsFactory function, and the implementation of improved error handling in the getPlatformScopeFactory function. Additionally, the check_status function in main.rs has been refactored to support asynchronous operations, improving server functionality.

Changes

File Path Change Summary
packages/dashmate/src/commands/status/index.js Modified StatusCommand to use a new platformStatus variable for improved service status determination.
packages/dashmate/src/config/generateEnvsFactory.js Added PLATFORM_DRIVE_ABCI_METRICS_URL environment variable based on configuration option.
packages/dashmate/src/status/determineStatus.js Updated platform method to return ServiceStatusEnum.stopped when dockerStatus is DockerStatusEnum.not_started.
packages/dashmate/src/status/scopes/overview.js Added drive property in the platform object returned by getOverviewScope.
packages/dashmate/src/status/scopes/platform.js Enhanced error handling and logic in getPlatformScopeFactory; updated getDriveInfo for container status handling.
packages/dashmate/test/unit/status/scopes/platform.spec.js Updated tests to reflect changes in service status handling and added import for ContainerIsNotPresentError.
packages/rs-drive-abci/src/main.rs Refactored check_status function to be asynchronous; improved logging during server operations.
packages/rs-drive-abci/Cargo.toml Updated package version and modified dependencies, including removing ureq and adding metrics-exporter-prometheus.

Possibly related PRs

Suggested labels

bug

Suggested reviewers

  • QuantumExplorer

🐰 "In the code we hop and play,
Improving status day by day.
With new checks and clearer views,
Our platform shines with vibrant hues!
Errors caught, and logs so bright,
Together we code, in pure delight!" 🐇


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between b082afe and c482a37.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • packages/rs-drive-abci/Cargo.toml (0 hunks)
💤 Files with no reviewable changes (1)
  • packages/rs-drive-abci/Cargo.toml

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (9)
packages/dashmate/src/status/scopes/overview.js (1)

60-62: LGTM! Consider adding a comment for clarity.

The changes look good and align with the PR objectives. The addition of the drive property to the platform object allows for more comprehensive status reporting.

Consider adding a brief comment explaining the purpose of the drive property, similar to how tenderdash is documented in the initial platform object declaration. This would improve code readability and maintainability.

 if (config.get('platform.enable')) {
   const { drive, tenderdash } = await getPlatformScope(config);

+  // Add drive status to the platform object
   platform.drive = drive;
   platform.tenderdash = tenderdash;
 }
packages/dashmate/src/status/determineStatus.js (1)

52-55: Approved: Good addition to handle non-started status.

The new conditional block correctly addresses the case when the docker status is not_started, returning a stopped status instead of error. This aligns well with the PR objectives and improves the accuracy of status reporting.

For consistency with the rest of the method, consider using a single-line return statement:

 if (dockerStatus === DockerStatusEnum.not_started) {
-  return ServiceStatusEnum.stopped;
+  return ServiceStatusEnum.stopped;
 }

This minor change would make the new block consistent with the style used in the rest of the method.

packages/dashmate/src/config/generateEnvsFactory.js (1)

70-73: LGTM! Consider future configurability for metrics URL.

The implementation of the PLATFORM_DRIVE_ABCI_METRICS_URL environment variable looks good and aligns with the PR objectives. It correctly sets the metrics URL only when the feature is enabled.

For future consideration: The hardcoded IP and port (0.0.0.0:29090) for the metrics URL might limit flexibility. Consider making these values configurable in the future if there's a possibility that different setups might require different IP/port combinations.

Also applies to: 89-89

packages/dashmate/src/commands/status/index.js (1)

111-116: Approved: Good consolidation of platform status

The introduction of the platformStatus variable effectively consolidates the status of tenderdash and drive services, prioritizing tenderdash status. This change aligns well with the PR objectives and improves the clarity of the status determination logic.

Consider adding a brief comment explaining the priority given to tenderdash status over drive status. This would enhance code readability and make the intent clearer for future maintainers. For example:

// Prioritize tenderdash status over drive status for overall platform status
const platformStatus = platform.tenderdash.serviceStatus !== ServiceStatusEnum.up
  ? platform.tenderdash.serviceStatus
  : platform.drive.serviceStatus;
packages/dashmate/src/status/scopes/platform.js (1)

166-193: Improved error handling and status checking.

The changes enhance the robustness of the getDriveInfo function:

  1. Distinguishing between a missing container and other errors.
  2. Determining service status more accurately by checking if the service is responding.
  3. Consistent use of the newly imported DockerComposeError.

These improvements align well with the PR objectives of addressing issues related to drive status checks.

Consider adding a comment explaining the significance of the exit code check on line 187:

// Non-zero exit code indicates an error in the drive-abci status command
if (e instanceof DockerComposeError
    && e.dockerComposeExecutionResult
    && e.dockerComposeExecutionResult.exitCode !== 0) {
  info.serviceStatus = ServiceStatusEnum.error;
} else {
  throw e;
}
packages/rs-drive-abci/src/main.rs (4)

229-229: Adjust logging syntax for error reporting

In the tracing::error! macro, using the {e} formatting inside the message is acceptable, but you can enhance clarity by using the provided syntax of the tracing crate.

Consider modifying the error logging for consistency:

-                tracing::error!(error = e, "drive-abci failed: {e}");
+                tracing::error!(%e, "drive-abci failed");

This way, the error is logged both as structured data (%e uses the Display implementation) and the message remains clear.


446-446: Remove unnecessary leading colon in use statement

The leading :: in the use statement is redundant since paths in use statements are already absolute from the crate root.

Apply this diff to simplify the import:

-use ::drive::{drive::Drive, query::Element};
+use drive::{drive::Drive, query::Element};

447-448: Organize imports for clarity

The imports in the test module can be grouped and organized for better readability.

Consider reorganizing the imports:

-use dpp::block::epoch::Epoch;
-use drive::drive::credit_pools::epochs::epoch_key_constants;
+use drive::{
+    drive::Drive,
+    drive::credit_pools::epochs::{epoch_key_constants, paths::EpochProposers},
+    query::Element,
+};
+use dpp::block::epoch::Epoch;

112-112: Log server start completion for better observability

An additional log statement indicating that the server has started successfully can improve observability.

Consider adding a log statement after the server has started:

 server::start(runtime, Arc::new(platform), config, cancel);

+tracing::info!("drive-abci server started successfully");
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 908d0b3 and b082afe.

📒 Files selected for processing (7)
  • packages/dashmate/src/commands/status/index.js (1 hunks)
  • packages/dashmate/src/config/generateEnvsFactory.js (1 hunks)
  • packages/dashmate/src/status/determineStatus.js (1 hunks)
  • packages/dashmate/src/status/scopes/overview.js (1 hunks)
  • packages/dashmate/src/status/scopes/platform.js (2 hunks)
  • packages/dashmate/test/unit/status/scopes/platform.spec.js (3 hunks)
  • packages/rs-drive-abci/src/main.rs (8 hunks)
🧰 Additional context used
🔇 Additional comments (11)
packages/dashmate/src/status/determineStatus.js (2)

Line range hint 1-55: Summary: Improved status determination with accurate handling of non-started status.

The changes made to the platform method in this file effectively address one of the key objectives of the PR. By adding a specific condition to handle the DockerStatusEnum.not_started case, the code now correctly reports a stopped status instead of an error status for non-started drives. This improvement enhances the accuracy of the status determination process and aligns well with the overall goals of the PR.

The changes are well-integrated with the existing code structure and logic. They do not introduce any breaking changes and maintain the overall readability and consistency of the method.

To further ensure the robustness of this change, consider:

  1. Adding a unit test specifically for this new condition.
  2. Documenting this new behavior in any relevant documentation or comments.

Overall, these changes represent a positive improvement to the status determination logic.


52-55: Verify consistency of DockerStatusEnum.not_started usage.

The addition of handling for DockerStatusEnum.not_started is a good improvement. To ensure consistency across the codebase, let's verify how this status is used in other parts of the system.

Run the following script to check the usage of DockerStatusEnum.not_started:

This will help us ensure that the new handling of not_started status is consistent with its usage elsewhere in the codebase.

✅ Verification successful

Consistency of DockerStatusEnum.not_started usage confirmed.

No issues found regarding the usage of DockerStatusEnum.not_started across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check usage of DockerStatusEnum.not_started across the codebase

# Search for DockerStatusEnum.not_started in all JavaScript files
echo "Occurrences of DockerStatusEnum.not_started:"
rg --type js "DockerStatusEnum\.not_started" -n

# Search for other methods that determine docker status
echo "\nOther methods determining docker status:"
rg --type js "docker.*status" -n

Length of output: 1485

packages/dashmate/src/status/scopes/platform.js (3)

2-2: LGTM: New import for improved error handling.

The addition of the DockerComposeError import is appropriate for the new error handling implemented in the getDriveInfo function.


194-195: LGTM: Consistent return statement placement.

Moving the return statement outside of the try-catch block ensures that the info object is always returned, regardless of whether an error occurred. This is a good practice that improves the function's consistency and readability.


Line range hint 1-324: Summary: Improved drive status checking aligns with PR objectives.

The changes in this file, primarily in the getDriveInfo function, successfully address the PR objectives:

  1. The function now correctly handles cases where the Drive container is stopped, representing its status as "stopped" instead of "errored".
  2. The improved error handling ensures more accurate status reporting, particularly when the gRPC getStatus endpoint is used.

These modifications enhance the reliability of the platform status determination without introducing breaking changes. The overall structure of the file remains intact, maintaining code readability and organization.

packages/dashmate/test/unit/status/scopes/platform.spec.js (2)

1-2: LGTM: New import for error handling

The addition of ContainerIsNotPresentError import is appropriate. This new import suggests improved error handling for scenarios where a container is not present, which aligns with the PR objectives of addressing issues related to drive status checks.


382-382: LGTM: Improved handling of non-present drive container

These changes correctly implement the PR objective of addressing the incorrect treatment of a non-started drive. By throwing a ContainerIsNotPresentError for the 'drive_abci' container and updating the expected dockerStatus to not_started and serviceStatus to stopped, the test now accurately represents the scenario where the drive is not running. This improvement enhances the robustness of the test suite and ensures more accurate status reporting.

Also applies to: 435-436

packages/rs-drive-abci/src/main.rs (4)

6-9: Imported modules are appropriately added

The new imports from dapi_grpc are necessary for the gRPC client implementation in the check_status function. They are correctly specified and used in the code.


149-149: Ensure asynchronous check_status is properly awaited

The check_status function is now asynchronous. Using runtime.block_on is appropriate for synchronously executing this asynchronous function.


419-419: Verify conversion from verbose level to log level

When converting cli.verbose using try_into(), ensure that all possible verbosity levels are correctly handled by the implementation of TryFrom<u8> for your log level type.

To confirm that all verbosity levels are supported, review the implementation:

#!/bin/bash
# Description: Check the implementation of TryFrom<u8> for LogLevel.

# Search for the TryFrom implementation in the codebase.
rg --type rust 'impl.*TryFrom<u8>.*for.*LogLevel'

14-14: Confirm necessity of Prometheus import

The Prometheus import is retained, and it is still being used in the start_prometheus function. Ensure that retaining Prometheus metrics aligns with the PR objectives of moving away from relying on Prometheus for status checks.

To verify if Prometheus is still required, check for its usage across the codebase:

✅ Verification successful

Prometheus import is still necessary

The Prometheus import is actively used in the start_prometheus function and other parts of the codebase. Retaining Prometheus metrics aligns with the current implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for usages of Prometheus in the codebase.

# Search for 'Prometheus' import statements and usages.
rg --type rust 'Prometheus'

Length of output: 127


Script:

#!/bin/bash
# Description: Check for definitions and usages of Prometheus and start_prometheus in the codebase.

# Search for 'Prometheus' definitions and usages.
rg --type rust 'Prometheus'

# Search for 'start_prometheus' function definitions and usages.
rg --type rust 'start_prometheus'

Length of output: 1808

Comment on lines +106 to +111
tracing::info!(
version = env!("CARGO_PKG_VERSION"),
features = list_enabled_features().join(","),
rust = env!("CARGO_PKG_RUST_VERSION"),
"drive-abci server initializing",
);
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential undefined environment variable CARGO_PKG_RUST_VERSION

The use of env!("CARGO_PKG_RUST_VERSION") may result in a compilation error because CARGO_PKG_RUST_VERSION is not a standard Cargo-provided environment variable. This could cause the build to fail if the environment variable isn't defined.

Consider using the rustc_version crate to obtain the Rust compiler version:

  1. Add rustc_version to your Cargo.toml dependencies:

    [dependencies]
    rustc_version = "0.4"
  2. Modify the code to use rustc_version:

        tracing::info!(
            version = env!("CARGO_PKG_VERSION"),
            features = list_enabled_features().join(","),
    -       rust = env!("CARGO_PKG_RUST_VERSION"),
    +       rust = rustc_version::version().unwrap().to_string(),
            "drive-abci server initializing",
        );

Copy link
Member Author

@shumkov shumkov Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a standard env.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shumkov, understood. Thank you for the clarification.

(・ω・)?


✏️ Learnings added
Learnt from: shumkov
PR: dashpay/platform#2248
File: packages/rs-drive-abci/src/main.rs:106-111
Timestamp: 2024-10-17T08:52:54.159Z
Learning: In this project, the environment variable `CARGO_PKG_RUST_VERSION` is defined and can be used safely with the `env!` macro.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +311 to +331
async fn check_status(config: &PlatformConfig) -> Result<(), String> {
// Convert the gRPC bind address string to a Uri
let uri = Uri::from_str(&format!("http://{}", config.grpc_bind_address))
.map_err(|e| format!("invalid url: {e}"))?;

// Connect to the gRPC server
let mut client = PlatformClient::connect(uri.clone())
.await
.map_err(|e| format!("can't connect to grpc server {uri}: {e}"))?;

// Make a request to the server
let request = dapi_grpc::platform::v0::GetStatusRequest {
version: Some(get_status_request::Version::V0(GetStatusRequestV0 {})),
};

println!("{}", body);
Ok(())
} else {
Err("ABCI_PROMETHEUS_BIND_ADDRESS not defined, cannot check status".to_string())
}
// Should return non-zero error code if Drive is not responding
client
.get_status(request)
.await
.map(|_| ())
.map_err(|e| format!("can't request status: {e}"))
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling in check_status function

The current error handling in the check_status function could be improved to provide more context and robustness. Adding timeouts and handling specific error cases can make the function more reliable.

Consider implementing timeouts and more detailed error messages:

use tokio::time::{timeout, Duration};

async fn check_status(config: &PlatformConfig) -> Result<(), String> {
    // Convert the gRPC bind address string to a Uri
    let uri = Uri::from_str(&format!("http://{}", config.grpc_bind_address))
        .map_err(|e| format!("Invalid gRPC URL '{}': {}", config.grpc_bind_address, e))?;

    // Set a timeout duration
    let timeout_duration = Duration::from_secs(5);

    // Connect to the gRPC server within the timeout period
    let mut client = timeout(
        timeout_duration,
        PlatformClient::connect(uri.clone())
    )
    .await
    .map_err(|_| format!("Timeout connecting to gRPC server at {}", uri))?
    .map_err(|e| format!("Can't connect to gRPC server {}: {}", uri, e))?;

    // Create the request
    let request = dapi_grpc::platform::v0::GetStatusRequest {
        version: Some(get_status_request::Version::V0(GetStatusRequestV0 {})),
    };

    // Send the request within the timeout period
    timeout(timeout_duration, client.get_status(request))
        .await
        .map_err(|_| "Timeout requesting status from gRPC server".to_string())?
        .map(|_| ())
        .map_err(|e| format!("Can't request status: {}", e))
}

This adds a timeout to both the connection and the request, providing better control over potential network issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connection happening in local network. Either port is listening or not. We do not expect timeouts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shumkov, understood. Since the connection is within the local network and timeouts are not expected, the current implementation should suffice.


✏️ Learnings added
Learnt from: shumkov
PR: dashpay/platform#2248
File: packages/rs-drive-abci/src/main.rs:311-331
Timestamp: 2024-10-17T08:52:46.706Z
Learning: In our codebase, connections to the gRPC server happen within the local network, so timeouts are not expected, and adding timeouts is unnecessary.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@shumkov shumkov merged commit 00df5a6 into v1.4-dev Oct 18, 2024
29 checks passed
@shumkov shumkov deleted the fix/invalid-drive-status-check branch October 18, 2024 09:41
@shumkov shumkov added this to the 1.4.2 milestone Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drive check status failing when metrcs are disabled
2 participants