Skip to content

Conversation

@cwschilly
Copy link

@cwschilly cwschilly commented Dec 8, 2025

Work in progress -- do not merge (I'm still testing locally on a sample application, and we should consider adding some unit testing for the connector)

These changes are required to compile Caliper with LDMS 4.5.1. In addition to general updates to the API, I encountered several bugs in the implementation of the pipeline.

API Updates

  • ldms_xprt_put() now requires a name. As a placeholder, I'm using "tst"
  • ldms_xprt_new_with_auth has a new function signature--I removed the second nullptr from the args

Summary of Bug Fixes

1. CMake

The LDMS_PREFIX cmake variable was not being used anywhere, so Caliper could not link with a custom LDMS installation. I added a few lines in FindLDMS.cmake to use LDMS_PREFIX.

2. Env Variables

In LdmsForwarder.cpp, if one of the env variables is not set, all are set to the defaults. I broke up that if-block so I could set just one var and the rest fall back to defaults.

3. JSON

In LdmsForwarder.cpp (write_ldms_record()), the JSON string is missing a key for the "caliper-perf-data" value. I added the "stream" key so the JSON went from:

{
    ...,
    nodelist: XX,
    caliper-perf-data,
    duration: YY,
    ...
}

to

{
    ...,
    nodelist: XX,
    stream: caliper-perf-data,
    duration: YY,
    ...
}

4. Connection

I was getting errors running on more than 6 ranks. I think that the issue was that write_ldms_record() was calling caliper_ldms_connector_initialize() on every write, so as the number of ranks increased I was creating too many connections.

As a possible fix, I made the LDMS connection into a member variable in the LdmsForwarder class, initializing this once per rank and reusing it for every write on that rank. I'm not sure if this has consequences beyond my narrow use-case, but it did seem to resolve the problem for me.

FYI @ppebay

@daboehme
Copy link
Member

Hi @cwschilly, thanks, that looks good to me and fixes the compilation issues I had.

There are a few more improvements we can do, such as moving the global variables as static members into the LdmsConnector class or at least into the anonymous namespace. The ldms_cali variable also seems redundant now. I'll go ahead and write up the code to customize the internal profiling channel.

Let me know when you think this is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants