[TE][Store] refactor: centralize environment variable handling across components#1538
[TE][Store] refactor: centralize environment variable handling across components#1538stmatengss wants to merge 4 commits intomainfrom
Conversation
Refactored environment variable access across mooncake-store and mooncake-transfer-engine to use centralized GetEnv() functions from mooncake-common/environ.h. This improves code maintainability and reduces duplication. Changes: - Extended mooncake-common/environ.h with additional env var getters - Updated config.cpp to use centralized env var functions (342 insertions, 404 deletions) - Migrated all components to use GetEnv() instead of direct getenv() calls - Affected components: client_metric, client_service, ha_helper, master, transfer_engine, topology, and all transport implementations
Summary of ChangesHello @stmatengss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant refactoring effort to centralize how environment variables are accessed throughout the Mooncake codebase. By moving all environment variable retrieval into a dedicated Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors environment variable access across the codebase by introducing a new Environ singleton class. The changes involve adding numerous getter methods and corresponding member variables to mooncake-common/include/environ.h and initializing them in mooncake-common/src/environ.cpp. Subsequently, all direct calls to std::getenv() for MC_ prefixed environment variables are replaced with calls to Environ::Get().Get<EnvVarName>() in various files, including mooncake-store/src/client_metric.cpp, client_service.cpp, ha_helper.cpp, master.cpp, transfer_task.cpp, and mooncake-transfer-engine/src/config.cpp, topology.cpp, transfer_engine.cpp, transfer_engine_impl.cpp, transfer_metadata.cpp, transfer_metadata_plugin.cpp, transport/cxl_transport/cxl_transport.cpp, transport/hip_transport/hip_transport.cpp, transport/nvlink_transport/nvlink_transport.cpp, and transport/rpc_communicator/rpc_communicator.cpp. Review comments highlight a missing closing brace in client_metric.cpp causing a compilation error, a broken fallback logic for MC_GID_INDEX to NCCL_IB_GID_INDEX in config.cpp due to the new Environ class's default return value, a risky C-style cast hiding a const_cast in cxl_transport.cpp when assigning cxl_dev_path, and a suggestion to change MC_STORE_MEMCPY and MC_STORE_USE_HUGEPAGE to return bool instead of std::string for consistency.
| } else { | ||
| LOG(WARNING) |
| int gid_index = env.GetGidIndex(); | ||
| if (gid_index == 0) { | ||
| const char *nccl_gid = std::getenv("NCCL_IB_GID_INDEX"); | ||
| if (nccl_gid) gid_index = atoi(nccl_gid); | ||
| } | ||
|
|
||
| const char *mtu_length_env = std::getenv("MC_MTU"); | ||
| if (mtu_length_env) { | ||
| size_t val = atoi(mtu_length_env); | ||
| if (val == 512) | ||
| config.mtu_length = IBV_MTU_512; | ||
| else if (val == 1024) | ||
| config.mtu_length = IBV_MTU_1024; | ||
| else if (val == 2048) | ||
| config.mtu_length = IBV_MTU_2048; | ||
| else if (val == 4096) | ||
| config.mtu_length = IBV_MTU_4096; | ||
| else { | ||
| LOG(ERROR) << "Ignore value from environment variable MC_MTU, it " | ||
| "should be 512|1024|2048|4096"; | ||
| exit(EXIT_FAILURE); | ||
| } | ||
| if (gid_index >= 0 && gid_index < 256) | ||
| config.gid_index = gid_index; |
There was a problem hiding this comment.
The fallback logic for MC_GID_INDEX to NCCL_IB_GID_INDEX appears to be broken with this change. env.GetGidIndex() returns a default value of 3 if MC_GID_INDEX is not set, so the if (gid_index == 0) block is not executed, and the fallback to NCCL_IB_GID_INDEX is missed. Also, if a user explicitly sets MC_GID_INDEX=0, the current logic would incorrectly fall back to NCCL_IB_GID_INDEX.
The original logic of checking MC_GID_INDEX first, then NCCL_IB_GID_INDEX, should be preserved. A proper fix would be to handle this fallback logic inside the Environ class itself for better encapsulation.
| if (!env_cxl_dev_path.empty()) { | ||
| LOG(INFO) << "MC_CXL_DEV_PATH: " << env_cxl_dev_path; | ||
| cxl_dev_path = (char *)env_cxl_dev_path; | ||
| cxl_dev_path = (char *)env_cxl_dev_path.c_str(); |
There was a problem hiding this comment.
The C-style cast (char *) hides a const_cast. The member cxl_dev_path is a char*, but it's being assigned from env_cxl_dev_path.c_str() which returns a const char*. Modifying the string literal returned by c_str() is undefined behavior. It would be safer to change the type of cxl_dev_path to std::string or const char* to avoid this risky cast.
| std::string GetStoreMemcpy() const { return store_memcpy_; } | ||
| std::string GetStoreUseHugepage() const { return store_use_hugepage_; } |
There was a problem hiding this comment.
The environment variables MC_STORE_MEMCPY and MC_STORE_USE_HUGEPAGE seem to be boolean flags. However, they are defined here to return std::string. For consistency with other boolean flags like GetUseHipIpc and to simplify usage, consider changing them to return bool and use GetBool in the implementation. This would also require changing the corresponding member variables store_memcpy_ and store_use_hugepage_ to bool.
- Fix undefined variable 'value' in rpc_communicator.cpp, replace with 'rpc_protocol' - Fix malformed InitializeMetricsConfig() function in transfer_engine_impl.cpp - Fix malformed parseMetricsInterval() function in client_metric.cpp - Add mooncake_common library link to transfer_engine CMakeLists.txt - Add mooncake:: namespace qualifier to Environ::Get() calls in master.cpp All compilation and linking errors resolved. Build completes successfully.
Fix CXL hardware-dependent tests to skip instead of failing when CXL devices are not available in CI environments. Changes: - mooncake-transfer-engine/tests/cxl_transport_test.cpp: Use GTEST_SKIP() when CXL transport installation fails - mooncake-store/tests/cxl_client_integration_test.cpp: Skip test suite when CXL client creation fails, preventing segfault This allows tests to pass (as skipped) in environments without CXL hardware while still running properly on systems with actual devices.
f2389e3 to
b220000
Compare
|
Glad that we now have a unified place to set and retrieve environment variables. I believe this will make things much more convenient for users. |
|
LGTM for TE part. |
mooncake-transfer-engine/src/transport/hip_transport/hip_transport.cpp
Outdated
Show resolved
Hide resolved
alogfans
left a comment
There was a problem hiding this comment.
Before merging it, pls fix the build issue of Mooncake TE.
Refactored environment variable access across mooncake-store and mooncake-transfer-engine to use centralized GetEnv() functions from mooncake-common/environ.h. This improves code maintainability and reduces duplication.
Changes:
Description
Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.