Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove runtime config update for FLEX_COUNTER_TABLE #20302

Open
wen587 opened this issue Sep 19, 2024 · 7 comments
Open

Remove runtime config update for FLEX_COUNTER_TABLE #20302

wen587 opened this issue Sep 19, 2024 · 7 comments
Assignees
Labels
NVIDIA Triaged this issue has been triaged

Comments

@wen587
Copy link
Contributor

wen587 commented Sep 19, 2024

Description

FLEX_COUNTER_TABLE config will be updated some time later after load_minigraph. This will result in the FLEX_COUNTER_TABLE is inconsistent between the desired config and running config.
The ask it to make it consistent and not more run time update. The fix should also consider the backward compatibility.

Steps to reproduce the issue:

  1. Run config load_minigraph -y
  2. Record the config right after the command finish.
    The FLEX_COUNTER_TABLE are the same as the init_cfg.json
admin@bjw-can-2700-2:~$ show run all  | grep FLEX_COUNTER_TABLE -A6
    "FLEX_COUNTER_TABLE": {
        "ACL": {
            "FLEX_COUNTER_STATUS": "enable",
            "FLEX_COUNTER_DELAY_STATUS": "true",
            "POLL_INTERVAL": "10000"
        }
    },

3.Record the config 5 min later after the command finish.
The FLEX_COUNTER_TABLE was updated at runtime, which make it inconsistent with the initial config.

admin@bjw-can-2700-2:~$ show run all | grep FLEX_COUNTER_TABLE -A35
    "FLEX_COUNTER_TABLE": {
        "ACL": {
            "FLEX_COUNTER_DELAY_STATUS": "false",
            "FLEX_COUNTER_STATUS": "enable",
            "POLL_INTERVAL": "10000"
        },
        "BUFFER_POOL_WATERMARK": {
            "FLEX_COUNTER_STATUS": "enable"
        },
        "PFCWD": {
            "FLEX_COUNTER_STATUS": "enable"
        },
        "PG_DROP": {
            "FLEX_COUNTER_STATUS": "enable"
        },
        "PG_WATERMARK": {
            "FLEX_COUNTER_STATUS": "enable"
        },
        "PORT": {
            "FLEX_COUNTER_STATUS": "enable"
        },
        "PORT_BUFFER_DROP": {
            "FLEX_COUNTER_STATUS": "enable"
        },
        "QUEUE": {
            "FLEX_COUNTER_STATUS": "enable"
        },
        "QUEUE_WATERMARK": {
            "FLEX_COUNTER_STATUS": "enable"
        },
        "RIF": {
            "FLEX_COUNTER_STATUS": "enable"
        }
    },

Describe the results you received:

FLEX_COUNTER_TABLE was updated with new entry.

Describe the results you expected:

We expect the FLEX_COUNTER_TABLE to be consitent and no longer have run time update. This will help us manage the running config of the device.
Currently if we refer the init_cfg.json as the source of truth, it will be wrong.

Output of show version:

SONiC Software Version: SONiC.20231110.21
SONiC OS Version: 11
Distribution: Debian 11.10
Kernel: 5.10.0-23-2-amd64
Build commit: e868b9ca70

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

](https://github.com/sonic-net/sonic-buildimage/issues/new)

@bingwang-ms
Copy link
Contributor

Discussed in meeting, @volodymyrsamotiy Please help assign the issue to the right owner. Thanks!

@bingwang-ms bingwang-ms added Triaged this issue has been triaged NVIDIA labels Sep 25, 2024
@stepanblyschak
Copy link
Collaborator

stepanblyschak commented Oct 10, 2024

  1. The script that modifies CONFIG_DB counters configuration after config has been loaded is enable_counters.py
  2. It is not the only place where CONFIG_DB is modified at runtime. See featured and featured. I created a separate issue - Remove runtime config update for FEATURE table  #20462
  3. The CONFIG_DB modification at runtime is part of the original design for both applications, so this is not a bug but a feature request.
  4. This issue has two parts:
  5. Since this is essentially a new requirement the question is - how do we know other applications aren't doing the same? How we ensure other applications aren't modifying CONFIG_DB in the future? A test case is required.

@stepanblyschak
Copy link
Collaborator

@wen587 @bingwang-ms @qiluo-msft Could you please describe with a more specific example why this causes issues for you?
I would like to better understand the use case.

First part of the issue is about having non-user initiated modifications of CONFIG_DB, however you are also referring to init_cfg.json as a source of truth of configuration. Are you expecting init_cfg.json values to be present in CONFIG_DB? What is the use case of checking init_cfg.json in general?

I am asking because one potential reason to update CONFIG_DB at runtime is lack of information available during initial config generation time. E.g. in init_cfg.json we don't know what the device type would be, hence we don't know which features to enable (dhcp_relay only on ToR) and which counters to enable (additional counters needed for DPU).

Do you think we need to redefining what init_cfg.json is? It can't be generic default configuration because different switch types, skus, platforms will have their own defaults.

@shiraez
Copy link

shiraez commented Oct 14, 2024

@stepanblyschak I think you need to remove FLEX_COUNTER_DELAY_STATUS also from sonic-utilities: counterpoll and db_migrator

@stepanblyschak
Copy link
Collaborator

@shiraez PR is WIP

@bingwang-ms
Copy link
Contributor

bingwang-ms commented Oct 14, 2024

Discussed in issue triage meeting, there are other entries being updated in runtime.

For this issue, PR already raised for reviewing.

@bingwang-ms
Copy link
Contributor

@wen587 Can you please help review the PR?

qiluo-msft pushed a commit that referenced this issue Nov 13, 2024
Why I did it
Simplify approach to delaying counters on warm boot and fast boot. Removed FLEX_COUNTER_DELAY_STATUS_FIELD and instead postpone all FC processing to happen after apply view to not delay data plane configuration.

The CONFIG_DB should not be updated in runtime anymore for counters to be delayed.

To address #20302.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NVIDIA Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

5 participants