Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote log level #4413

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

europaul
Copy link
Contributor

This PR is a work in progress, but I appreciate if somebody could take a look at the code and the general approach and give me feedback.

Please see the commit messages (especially the second commit) for the overview of the changes.

Before merging this I need to update the pillar first to include the dependency in newlogd. I also need to update eve-api to include the new metrics.

I'm also planning to add more metrics as well as proper tests for newlogd to at least ensure that there is no regression.

@europaul
Copy link
Contributor Author

the build is gonna break of course because the dependencies are missing in pillar

Comment on lines 203 to 210
// create the necessary directories upfront
if _, err := os.Stat(collectDir); os.IsNotExist(err) {
if err := os.MkdirAll(collectDir, 0755); err != nil {
log.Fatal(err)
}
}

if _, err := os.Stat(uploadDevDir); os.IsNotExist(err) {
if err := os.Mkdir(uploadDevDir, 0755); err != nil {
log.Fatal(err)
}
}

if _, err := os.Stat(uploadAppDir); os.IsNotExist(err) {
if err := os.Mkdir(uploadAppDir, 0755); err != nil {
log.Fatal(err)
}
}

if _, err := os.Stat(keepSentDir); os.IsNotExist(err) {
if err := os.MkdirAll(keepSentDir, 0755); err != nil {
log.Fatal(err)
}
}

if _, err := os.Stat(panicFileDir); os.IsNotExist(err) {
if err := os.MkdirAll(panicFileDir, 0755); err != nil {
log.Error(err)
return
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for loop?

// app.log...gz
// dev.log.keep...gz
// dev.log.upload....gz
// TODO: declare a clear order of removal by directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is format for saved files? It's not always that alphabetical <=> chronological, i.e. if we don't put zero in front of day/month, it won't work, f.e. '9' > '11' since we do symbol-wise comparison

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the files are named like this:
app.6656f860-7563-4bbf-8bba-051f5942982b.log.1730464687367.gz
dev.log.keep.1730404601953.gz
dev.log.upload.1730404601953.gz

so we use timestamps instead of dates and the alphabetical order should also represent the time order.
However I understand that alphabetical order is not ideal and that's why I put that todo for myself :)

Put the correct name for the per-agent log level setting in the
CONFIG-PROPERTIES.md file.

Signed-off-by: Paul Gaiduk <paulg@zededa.com>
As opposed to the default log level, which was setting which logs are
produced by EVE's microservices, the remote log level will set which
device logs are uploaded to the cloud. So it's assumed that the remote
log level is always higher than the default log level.

There are no changes in how newlogd collects the logs, only in how it
handles the log files.
For the logs that are uploaded:
- we create a separate file with prefix dev.log.upload in collect
- that file is gzipped when it reaches a certain size or by timer - the
  same as before
- once gzipped the file is moved to devUpload - the same as before
- once the file is uploaded successfully, it's deleted instead of being
  moved to keepSentQueue

For the logs that stay on the device:
- we create a separate file with prefix dev.log.keep in collect
- that file is gzipped when it reaches a certain size - no timer
- once gzipped the file is moved to keepSentQueue - the name of the
  folder is preserved

The commit also contains some structural changes to the newlogd code:
- init fileinfo inside trigMoveToGzip instead of passing it as parameter
- add initNewLogfile function

Signed-off-by: Paul Gaiduk <paulg@zededa.com>
Add new metrics for newlogd:
- LatestAvailableLog - the timestamp of the latest log available on the
    device
- TotalSizeLogs - the total size of all logs on the device

Signed-off-by: Paul Gaiduk <paulg@zededa.com>
defer devStatsKeep.file.Close()
devStatsKeep.notUpload = true

logmetrics.DevMetrics.LatestAvailableLog = time.Now()
Copy link
Contributor

@naiming-zededa naiming-zededa Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one will change on every reboots of the device, but the older log entries can still stay on the disk, so it won't reflect the earliest one. Also this 'Latest' can mean the most recent timestamp, maybe something like 'oldest_saved' or something like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true, very nice catch!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else to watch out for are devices (like RPi4) with no battery-backed up clock, which always boot on Jan 1st 1970 UTC. ntpd/chronos sets the clock, but if the network is out the agents including newlogd start before the time is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknordmark I'm not sure how to address that, because the logs will be then also dated Jan 1st 1970... we cannot wait until the time is set correctly and only then start logging

@@ -65,6 +70,7 @@ type NewlogMetrics struct {
NumKmessages uint64 // total input kmessages
NumSyslogMessages uint64 // total input syslog message
DevTop10InputBytesPCT map[string]uint32 // top 10 sources device log input in percentage
TotalSizeLogs uint64 // total size of logs on device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have like 'NumSkipUploadAppFile', maybe we can also record the number of uploaded files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already recorded in NumGZipFilesSent

Copy link
Contributor

@naiming-zededa naiming-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch in general looks good to me, I have some comments

  • in collect-info.sh for 'newlog' directory, we should skip the 'devUpload' directory, since those not yet uploaded files will be the duplicates for the keep directory
  • similar in the edgeview in walkLogDirs() function, we need to skip the 'devUpload' directory to avoid duplicates
  • in loguploader.go, there is 'failSendDir' and logic to move the failed to send to controller log files and keep them there. With this change, i don't think it is needed anymore, since everything will be in the keep directory. this can simplify some logic
  • when doing kibana search, i do search for 'Alpine' string for locate when the device has rebooted, now we may not have this. maybe to always log at least the first 5 device log entries?

@@ -108,8 +109,9 @@ var (
// from different goroutines, so in order to push the changes out of the
// goroutines local caches and correctly observe changed values in another
// goroutine sync/atomic synchronization is used. You've been warned.
syslogPrio = types.SyslogKernelLogLevelNum[types.SyslogKernelDefaultLogLevel]
kernelPrio = types.SyslogKernelLogLevelNum[types.SyslogKernelDefaultLogLevel]
remotelogPrio = types.SyslogKernelLogLevelNum[types.SyslogKernelDefaultLogLevel]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we will have eve-api to have new config for device remote logging settings, it will be good to handle the setting for 'not upload any device log', which is not part of the log-level strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point!

tbh I was even thinking that we might need yet another log level - the local log level. because the user might not wanna store any logs on the device and only send them to the controller and that's not supported at the moment.
right now we store everything on the device because our default.log.level == local.log.level. and we also assume that the remote log level is going to be higher than default.log.level but we don't have any checks for it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We should include some settings through eve-api for defining (in project level, but can be overwritten by device configitem) local logging also. Remote logging level can not be higher than the local settings.

Copy link
Contributor

@naiming-zededa naiming-zededa Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for end-to-end feature pov, the 'local' config should be per device, while the remote can be project scope. And the controller can add optional timeout for the local settings, e.g. turn on 'debug' for deviceA for 3 hours, controller side will remove this config after it expires.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@europaul I think we always need to store/queue the logs locally since the device might not be connected to the controller when the log is generated. But a notion of which log levels to keep stored on the device (after sent to the controller) might be what makes sense here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknordmark the logs that are meant to be sent to the controller are queued until they are

  • sent to the controller - then they are removed
  • fail to many many times - then we put them in failedUpload

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the description in https://github.com/lf-edge/eve/blob/master/docs/LOGGING.md with the intended new state of things?

Add a chapter explaining how different log levels
are handled in the system.

Signed-off-by: Paul Gaiduk <paulg@zededa.com>
@europaul
Copy link
Contributor Author

europaul commented Nov 6, 2024

Can you update the description in https://github.com/lf-edge/eve/blob/master/docs/LOGGING.md with the intended new state of things?

@eriknordmark I made the changes to the documentation, please have a look. I'm not sure if it's an overkill with the debug.local.loglevel and is a feature that nobody wants that only makes the system harder to understand?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants