Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logcli: Add parallel flags #7543

Closed
wants to merge 0 commits into from
Closed

logcli: Add parallel flags #7543

wants to merge 0 commits into from

Conversation

angaz
Copy link
Contributor

@angaz angaz commented Oct 28, 2022

What this PR does / why we need it:
Requesting a large range, a day, for example, takes a very long time to download. You can download in parallel by starting many processes, but this just made my job so much easier.

Special notes for your reviewer:
TBH, I am not good at testing, so IDK how such a thing can be tested.

Checklist

  • Reviewed the CONTRIBUTING.md guide
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@angaz angaz requested a review from a team as a code owner October 28, 2022 15:58
@CLAassistant
Copy link

CLAassistant commented Oct 28, 2022

CLA assistant check
All committers have signed the CLA.

@angaz angaz changed the title [logcli] Add parallel flags logcli: Add parallel flags Oct 28, 2022
@pull-request-size pull-request-size bot added size/L and removed size/M labels Oct 28, 2022
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Oct 28, 2022
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.4%
+               loki	0%

@MichelHollands
Copy link
Contributor

@SN9NV I'm not sure we can accept this as is. The log lines will be returned out of order which will be very confusing for users. Is there a way you can add ordering? Also this needs rebasing.

@angaz
Copy link
Contributor Author

angaz commented Nov 8, 2022

@SN9NV I'm not sure we can accept this as is. The log lines will be returned out of order which will be very confusing for users. Is there a way you can add ordering? Also this needs rebasing.

Hi, yeah, I know it would be confusing, that is why it specifically is mentioned in the help text that using the flag will result in the logs being out of order.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@angaz
Copy link
Contributor Author

angaz commented Nov 8, 2022

One of the ways I thought we could add ordering is if we created an output file for each part. Then the user can order the logs using the filename.

But I guess that would be some work. Adding new flags and stuff.
But if that is what you want, I can give it a try.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@angaz
Copy link
Contributor Author

angaz commented Dec 18, 2022

I have pushed a new bit of functionality which allows the user to create output files, one for each partition range, then the user can order file files as they please. I think we can leave that up to them. Not so sure on the naming. I couldn't think of another tool which does this to get some naming inspiration.

The reason for the prefix is so that you can place the files in a specific directory, or "label" a specific download set so many in the same directory don't collide.

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the ist but I'd wie the results into one file. As I wouldn't expect all results to arrive at once you could use a channel to the writer that synchronizes the writes. Maybe a simple mutex will do.

}

start = end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkout the comment in

// Batching works by taking the timestamp of the last query and using it in the next query,
I think you need to specify a limit.

@angaz
Copy link
Contributor Author

angaz commented Jan 18, 2023

I like the ist but I'd wie the results into one file. As I wouldn't expect all results to arrive at once you could use a channel to the writer that synchronizes the writes. Maybe a simple mutex will do.

I'm not quite sure what you are saying.

Once you start the workers in parallel, sorting the results into one file would mean that you would need all the memory to hold all the messages until the worker with the lowest value is done. This is why I have saved one file per section.

The previous comment said we should try to keep the ordering of the log messages. And you comment seems to contradict that.

@jeschkies
Copy link
Contributor

Once you start the workers in parallel, sorting the results into one file would mean that you would need all the memory to hold all the messages until the worker with the lowest value is done. This is why I have saved one file per section.

You don't have to keep them in memory. You could insert them into a file. That's a little bit more complicated though. It's like insertion sort with a file.

@jeschkies
Copy link
Contributor

@SN9NV I've done a quick research. What I mean is called External Sorting. However, we probably do not need it this complicated. I know this is some extra work but what do you think about naming the part files with timestamps and then merge them afterwards? It would be an append only operation. The difficult part is to figure out if we need deduplication.

@angaz angaz requested a review from JStickler as a code owner February 4, 2023 15:14
@github-actions github-actions bot removed the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Feb 6, 2023
@angaz angaz closed this Feb 7, 2023
@pull-request-size pull-request-size bot added size/XS and removed size/L labels Feb 7, 2023
@jeschkies
Copy link
Contributor

@SN9NV I'm sorry that you closed the PR. I liked the idea.

@angaz
Copy link
Contributor Author

angaz commented Feb 7, 2023

I moved the code to another branch. I guess that's why it was closed. I'll open another PR from the new branch.

@angaz angaz mentioned this pull request Feb 13, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants