Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow source interface binding for Graphite output #14614

Closed
phuntik opened this issue Jan 23, 2024 · 9 comments · Fixed by #14628 or #15224
Closed

Allow source interface binding for Graphite output #14614

phuntik opened this issue Jan 23, 2024 · 9 comments · Fixed by #14628 or #15224
Assignees
Labels
feature request Requests for new plugin and for new features to existing plugins

Comments

@phuntik
Copy link

phuntik commented Jan 23, 2024

Use Case

Hello, Dears!
While using VRF routing it is not always obvious which interface will be chosen by kernel to route connection through.
There was a kinda similar issue years ago - 4287. And this functionality was not implemented, though it sounds like can be done easily.

I've found this simple piece of code and build Telegraf with it. It worked just as needed. Here you can see the changes: master...phuntik:telegraf:master
Please, consider such a functionality. Would be glad if this feature will appear in any form possible eventually.

p.s. Please assume I'm not a programmer. Just tested it out in this way and it worked for me, so I am opened for any critic. Cause even if it won't be implemented, seems like I will stick with permanent patching Telegraf by myself in future to be able to use binding.

Thanks for attention!

Expected behavior

be able to bind specific interface which Telegraf will send metrics through

Actual behavior

routing is ruled totally by kernel, while with VRF it is not always applicable

Additional info

No response

@phuntik phuntik added the feature request Requests for new plugin and for new features to existing plugins label Jan 23, 2024
@srebhan srebhan self-assigned this Jan 24, 2024
@srebhan
Copy link
Member

srebhan commented Jan 25, 2024

@phuntik can you please test the binary in PR #14628, available as soon as CI finished the tests!? Let me know if that work for you! Head-up: I called the setting local_address instead of bind.

@phuntik
Copy link
Author

phuntik commented Jan 25, 2024

Hello, @srebhan
Thanks a lot for your rapid reaction!

Sadly but it didn't work. Telegraf is in bootloop with this error:

telegraf[1732643]: 2024-01-25T22:02:26Z E! [agent] Failed to connect to [outputs.graphite], retrying in 15s, error was "cannot resolve local address: unknown network tcp"  

I've tried to rebuild your branch changing back

net.ResolveIPAddr("tcp", g.LocalAddr)

to

net.ResolveIPAddr("ip", g.LocalAddr)

It started without errors now but no connection established for this IP.
before:

root@sgw-ixia:/etc/telegraf/telegraf.d# netstat -nap | grep telegraf                                                                                   
tcp        0      0 127.0.0.1:57668         127.0.0.1:9555          ESTABLISHED 1728904/telegraf                                                       
tcp        0      0 127.0.0.1:57694         127.0.0.1:9555          ESTABLISHED 1728904/telegraf                                                       
tcp        0      0 192.168.125.73:35313    192.168.109.237:42001   ESTABLISHED 1728904/telegraf                                                       
tcp        0      0 192.168.125.73:33775    192.168.109.237:42000   ESTABLISHED 1728904/telegraf                                                       
tcp        0      0 127.0.0.1:57684         127.0.0.1:9555          ESTABLISHED 1728904/telegraf                                                       
tcp        0      0 127.0.0.1:57698         127.0.0.1:9555          ESTABLISHED 1728904/telegraf                                                       
unix  2      [ ]         DGRAM      CONNECTED     8053161  1728904/telegraf                                                                            
unix  3      [ ]         STREAM     CONNECTED     8051363  1728904/telegraf

after:

root@sgw-ixia:/etc/telegraf/telegraf.d# netstat -nap | grep telegraf
tcp        0      0 127.0.0.1:39328         127.0.0.1:9555          ESTABLISHED 1741668/telegraf    
tcp        0      0 127.0.0.1:39284         127.0.0.1:9555          ESTABLISHED 1741668/telegraf    
tcp        0      0 127.0.0.1:39316         127.0.0.1:9555          ESTABLISHED 1741668/telegraf    
tcp        0      0 192.168.125.73:46904    192.168.109.237:42001   ESTABLISHED 1741668/telegraf    
tcp        0      0 127.0.0.1:39300         127.0.0.1:9555          ESTABLISHED 1741668/telegraf    
unix  2      [ ]         DGRAM      CONNECTED     8089798  1741668/telegraf     
unix  3      [ ]         STREAM     CONNECTED     8089785  1741668/telegraf

my config for reference:

      [[outputs.graphite]]
  servers = ["192.168.109.237:42000"]
  local_address = "192.168.125.73"
  prefix = ""
  graphite_tag_support = true
  timeout = 2
      [[outputs.graphite]]
  servers = ["192.168.109.237:42001"]
  prefix = ""
  graphite_tag_support = true
  timeout = 2

Seems like definition of net.TCPAddr{} type might be mandatory. Honestly idk, just suspect. Lines 113-115 of my branch:

	localTCPAddr := net.TCPAddr{
			IP: localAddr.IP,
		}

I will check few things and make more tests today, and get back to you with more info.

@srebhan
Copy link
Member

srebhan commented Jan 26, 2024

@phuntik can you please try again with the binary in the updated PR?!?

@phuntik
Copy link
Author

phuntik commented Jan 29, 2024

@srebhan thanks for update.
It is working now!
Below are a few tests I've performed FYI:

  1. local_address = "172.30.31.113"
    port chosen automatically, connection established
  2. local_address = "172.30.31.113:"
    port chosen automatically, connection established
  3. local_address = "172.30.31.113:539" (privileged port < 1024)
    no connection, no error. other outputs' connections are ok
  4. local_address = "172.30.31.113:5390"
    all is ok, works as expected
  5. local_address = "172.30.31.113:53905"
    app in bootloop with "invalid port: strconv.ParseInt: parsing \"53905\": value out of range".
  6. local_address = "83.149.63.1"
    (existing interface on node but no connectivity between src and dst IPs)
    no connection from this interface, no errors. other outputs' connections are ok

My belief regarding 5th is that if we allow user to define src port it should be full range of ports. Then it should be UInt16 (ParseUint) check, not Int16. But I'm almost fully sure it's just a typo :)

Other cases are good for me, I just think you might want to add some explanation errors to stdout for 3rd and 6th if the same approach with src port binding might be used with other TCP-based outputs in future, but it is not mandatory for sure. Up to you. Especially if consider that 6th might be not so easy to check, so maybe user should be self-aware of using wrong IPs.

@phuntik
Copy link
Author

phuntik commented Jan 29, 2024

There might be a potential problem, which I'd like to point your attention at.

Lets say we use graphite cluster as dst listener.
TCP/HTTP cannot use same tuple(IP,port) for multiple src connections.
How will dialer act then?
Bet it will fail. I will try to create graphite cluster at my lab and test it to be sure.

But even if the only regime of cluster connection is load-balancing, meaning there won't be a situation when two connections are being active simultaneously, I don't know exactly how much time will dialer need to fully close one connection before creating new one from the same port without scodling about already used port. And anyways it seems that connections will always be reinitialized, without ability to reuse existing connection. IDK for sure about all these but thought it is at least worse mentioning.

Also I think it might be difficult to handle this all together in code an maybe even unnecessary if for the years I'm the only one who asked for this feature. So maybe just mentioning this in readme will be enough: that user should not use port defenition while using clusters. Or should use relay then, if there is a strong need in src port defenition.

@srebhan
Copy link
Member

srebhan commented Jan 29, 2024

@phuntik thanks for testing. Yeah ParseInt() was a typo 🤦...

Regarding graphite cluster, I guess the solution would be to only provide a local IP address and let the dialer choose the port...

@srebhan
Copy link
Member

srebhan commented Jan 29, 2024

@phuntik would appreciate if you can take another look...

@srebhan
Copy link
Member

srebhan commented Feb 1, 2024

@phuntik ping! Can you please test the binary in the updated PR and let me know if this now works as intended!?

@phuntik
Copy link
Author

phuntik commented Feb 19, 2024

Dear @srebhan
Sorry for late response, I was away from github and not able to test.
Can confirm now everything is working.
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
2 participants