Skip to content

Plumb --samples_per_plugin to Rust data server #4689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 22, 2021

Conversation

psybuzz
Copy link
Contributor

@psybuzz psybuzz commented Feb 18, 2021

Introduces a --samples-per-plugin flag on the Rust data server, which
takes the --samples_per_plugin flag from standard TensorBoard's CLI.
At the moment, this change makes the value of 0 interpreted as
"create a reservoir with capacity 0", while traditionally the value would be
interpreted as "create a reservoir with unbounded capacity". This
discrepancy will be fixed in a separate PR.

Test plan:
Added unit tests and ran with cargo test.
Ran bazel run tensorboard:dev --define=link_data_server=true -- --load_fast --logdir <logdir> --bind_all --samples_per_plugin=scalars=5,images=0 and observed that scalar charts that normally show lots of points now only show 5, while images do not appear at all.
image

Associated issue: "samples per plugin" item in the Rustboard task list #4422

@psybuzz psybuzz added the core:rustboard //tensorboard/data/server/... label Feb 18, 2021
@google-cla google-cla bot added the cla: yes label Feb 18, 2021
@psybuzz psybuzz marked this pull request as ready for review February 18, 2021 10:36
@psybuzz psybuzz requested a review from wchargin February 18, 2021 10:38
@psybuzz psybuzz changed the title Plumb --samples_per_plugin to Rust data server. Plumb --samples_per_plugin to Rust data server Feb 18, 2021
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks excellent. Apart from the error handling for the parser, I’m
quite happy with this; nice work.

@wchargin wchargin mentioned this pull request Feb 19, 2021
34 tasks
Copy link
Contributor Author

@psybuzz psybuzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for another look

Comment on lines 119 to 123
if let Some(ref pd) = metadata.plugin_data {
if let Some(hint) = &*plugin_sampling_hint {
if let Some(&num_samples) = hint.0.get(&pd.plugin_name) {
// TODO(psybuzz): if the hint prescribes 0 samples, the reservoir should ideally
// be unbounded. For now, it simply creates a reservoir with capacity 0.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making the plugin sampling hint required to remove one layer of conditionals?

Sounds good!

What do you think about encapsulating this logic into a method on PluginSamplingHint?

Hm, if we extracted the logic into a method that accepted metadata, it would be able to compute the capacity entirely, so there would be less reason to keep the DataClass match around in StageTimeSeries::new. In that case, PluginSamplingHint::capacity(metadata) should contain all the capacity logic.

If the concern is that the capacity logic is a bit much to have in the constructor, would it suffice to just move the code into a method StageTimeSeries::capacity?

aside: also switched the if capacity > 0 to if data_class != Unknown, since even if a known data class had 0 default capacity, the user hint should still override.

part: pair_str.to_string(),
});
}
let num_samples = pair[1].parse::<usize>()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice: good use of the implicit error type conversion.

Comment on lines 119 to 123
if let Some(ref pd) = metadata.plugin_data {
if let Some(hint) = &*plugin_sampling_hint {
if let Some(&num_samples) = hint.0.get(&pd.plugin_name) {
// TODO(psybuzz): if the hint prescribes 0 samples, the reservoir should ideally
// be unbounded. For now, it simply creates a reservoir with capacity 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if we extracted the logic into a method that accepted metadata, it would be able to compute the capacity entirely, so there would be less reason to keep the DataClass match around in StageTimeSeries::new. In that case, PluginSamplingHint::capacity(metadata) should contain all the capacity logic.

Agreed.

If the concern is that the capacity logic is a bit much to have in the constructor, would it suffice to just move the code into a method StageTimeSeries::capacity?

I’m not too worried by the number of lines of code in the constructor.
I was wondering whether it makes sense from a conceptual standpoint that
the plugin sampling hint should know how to determine the capacity. For
instance, that would provide a natural place to unit-test that logic if
we so desired.

That said, I don’t feel strongly about it; the question was genuine.
This is fine with me, and I do think that having it in a separate
function (as you’ve done) is nice, because you can return early.

Comment on lines 196 to 198
data: RunLoaderData {
plugin_sampling_hint,
..RunLoaderData::default()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that it would make sense to remove the Default
implementation and instead provide RunLoaderData::new that requires
an Arc<PluginSamplingHint>? No strong preference, and happy to
defer to you; just curious for your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this, I realize I have no strong preference, but weakly think dropping the Default is a good idea.

Since RunLoaderData passes the hint to StageTimeSeries::new, which requires it, it's a bit more obvious that a default Arc<PluginSamplingHint> might not be appropriate for the caller. Introducing a ::new() with only the hint parameter also looks more consistent with, say, LogdirLoader::new().

Done.

args = [
server_binary,
"--logdir=%s" % (self._logdir,),
"--reload=%s" % reload,
"--port=0",
"--port-file=%s" % (port_file_path,),
"--die-after-stdin",
"--samples-per-plugin=%s" % samples_per_plugin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m noticing that while this works fine with the data server at head, it
will fail for the v0.3.0 data server (no such flag), which is what’s
currently released. What do you think about either:

  • only including this flag when samples_per_plugin is non-empty, as
    you had previously? then --load_fast would still be broken with
    v0.3.0 with explicit --samples_per_plugin, but at least that’s not
    in the default path; or

  • explicitly gating the flag on tensorboard_data_server.__version__?
    you could use pkg_resources.parse_version to ensure that it’s at
    least 0.4.0a0 (cf. version_test.py)

I appreciate that this kind of dependency constrant is less than ideal,
and can think about ways to improve it in the general case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only including this flag when samples_per_plugin is non-empty, as you had previously?

Done.

Agreed that it's non ideal, but hopefully the --undefok Clap may go through before we do the next release of tensorboard-data-server.

Explicitly gating the flag on version # would work for us, but I'm not sure what the story would be for users writing their own custom data server using it via TENSORBOARD_DATA_SERVER_BINARY.

Besides --undefok, the only thing I have thought of is trying to detect support for the flag by running <binary> --help and parsing the output to see whether --samples-per-plugin is there. Which is also very non-ideal!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly gating the flag on version # would work for us, but I'm not sure what the story would be for users writing their own custom data server using it via TENSORBOARD_DATA_SERVER_BINARY.

Ah, good point, yes. You could invoke rustboard --version and see what
it prints, which is not too hard but certainly less convenient.

Besides --undefok, the only thing I have thought of is trying to detect support for the flag by running <binary> --help and parsing the output to see whether --samples-per-plugin is there. Which is also very non-ideal!

Also a good point. Certainly not ideal, and only really works for
feature additions (e.g., would be harder to detect if we add support for
something like --samples-per-plugin=scalars=100,images=all).
But I don’t mind it that much if we want to go that route.

Copy link
Contributor Author

@psybuzz psybuzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Comment on lines 196 to 198
data: RunLoaderData {
plugin_sampling_hint,
..RunLoaderData::default()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this, I realize I have no strong preference, but weakly think dropping the Default is a good idea.

Since RunLoaderData passes the hint to StageTimeSeries::new, which requires it, it's a bit more obvious that a default Arc<PluginSamplingHint> might not be appropriate for the caller. Introducing a ::new() with only the hint parameter also looks more consistent with, say, LogdirLoader::new().

Done.

args = [
server_binary,
"--logdir=%s" % (self._logdir,),
"--reload=%s" % reload,
"--port=0",
"--port-file=%s" % (port_file_path,),
"--die-after-stdin",
"--samples-per-plugin=%s" % samples_per_plugin,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only including this flag when samples_per_plugin is non-empty, as you had previously?

Done.

Agreed that it's non ideal, but hopefully the --undefok Clap may go through before we do the next release of tensorboard-data-server.

Explicitly gating the flag on version # would work for us, but I'm not sure what the story would be for users writing their own custom data server using it via TENSORBOARD_DATA_SERVER_BINARY.

Besides --undefok, the only thing I have thought of is trying to detect support for the flag by running <binary> --help and parsing the output to see whether --samples-per-plugin is there. Which is also very non-ideal!

Comment on lines 119 to 123
if let Some(ref pd) = metadata.plugin_data {
if let Some(hint) = &*plugin_sampling_hint {
if let Some(&num_samples) = hint.0.get(&pd.plugin_name) {
// TODO(psybuzz): if the hint prescribes 0 samples, the reservoir should ideally
// be unbounded. For now, it simply creates a reservoir with capacity 0.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. In that case, keeping the logic here in StageTimeSeries still seems preferable than moving to PluginSamplingHint::capacity imo, so I'll keep as-is for now.

Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

args = [
server_binary,
"--logdir=%s" % (self._logdir,),
"--reload=%s" % reload,
"--port=0",
"--port-file=%s" % (port_file_path,),
"--die-after-stdin",
"--samples-per-plugin=%s" % samples_per_plugin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly gating the flag on version # would work for us, but I'm not sure what the story would be for users writing their own custom data server using it via TENSORBOARD_DATA_SERVER_BINARY.

Ah, good point, yes. You could invoke rustboard --version and see what
it prints, which is not too hard but certainly less convenient.

Besides --undefok, the only thing I have thought of is trying to detect support for the flag by running <binary> --help and parsing the output to see whether --samples-per-plugin is there. Which is also very non-ideal!

Also a good point. Certainly not ideal, and only really works for
feature additions (e.g., would be harder to detect if we add support for
something like --samples-per-plugin=scalars=100,images=all).
But I don’t mind it that much if we want to go that route.

@psybuzz psybuzz merged commit 319db49 into tensorflow:master Feb 22, 2021
wchargin added a commit that referenced this pull request Feb 25, 2021
Summary:
Nothing critical: some unnecessary `.into()`s from #4677 and some
unnecessary `&Box<T>` double-indirection from #4689.

wchargin-branch: rust-clippy-fixes-20210225
wchargin-source: 27bd0fee8a91ed809d78b07bbcaf60dd0c276fef
wchargin added a commit that referenced this pull request Feb 26, 2021
Summary:
Nothing critical: some unnecessary `.into()`s from #4677 and some
unnecessary `&Box<T>` double-indirection from #4689.

wchargin-branch: rust-clippy-fixes-20210225
wchargin added a commit that referenced this pull request Mar 18, 2021
Summary:
In #4689, we decided to only pass `--samples-per-plugin` to the data
server conditionally, because at the time the latest public release of
the data server did not support that flag. Since then, v0.4.0 has been
published, so we can pass the flag unconditionally.

Test Plan:
Verified that with latest `tb-nightly`’s dependencies, the server
behaves as expected both with and without `--samples_per_plugin`.

wchargin-branch: data-always-pass-samples
wchargin-source: 04f7394f218d395db4366dceeed9b5dfb0d1dcef
wchargin added a commit that referenced this pull request Mar 18, 2021
Summary:
In #4689, we decided to only pass `--samples-per-plugin` to the data
server conditionally, because at the time the latest public release of
the data server did not support that flag. Since then, v0.4.0 has been
published, so we can pass the flag unconditionally.

Test Plan:
Verified that with latest `tb-nightly`’s dependencies, the server
behaves as expected both with and without `--samples_per_plugin`.

wchargin-branch: data-always-pass-samples
wchargin added a commit that referenced this pull request Mar 18, 2021
Summary:
Under `--load_fast=auto`, if the data server exits without writing a
port file, we now gracefully degrade to the multiplexer reading path.
This subsumes and expands upon the checks from #4786, with semantics now
defined by the data server as in #PARENT.

This depends on the `--error-file` data server flag added in #PARENT,
which is not yet released. To handle this, we add a mechanism for gating
flags against data server versions. As [discussed in #4689][1], this
raises a concern of what to do when the data server is chosen by a
mechanism other than the Python package. We resolve this by simply
treating such data servers as bleeding-edge. If you link in a data
server yourself, you should use it with a copy of `tb-nightly` built
from the same Git commit, or your mileage may vary.

See also #4786, which first proposed this functionality and offered some
alternative mechanisms.

[1]: #4689 (comment)

Test Plan:
Set your `GOOGLE_APPLICATION_CREDENTIALS` to an invalid JSON file, and
try running with `--load_fast=auto` with release v0.4.0 of the data
server and also with latest head (`//tensorboard/data/server:install`).
Note with `--verbosity 0` that with the old server, `--error-file` is
not passed and the data server has to fall back to anonymous creds,
whereas with the new server, `--error-file` is passed and TensorBoard
correctly falls back to the multiplexer.

wchargin-branch: data-server-fallback
wchargin-source: ac77fa64fc6b10b6dc80f9962ee291e8084dba80
wchargin added a commit that referenced this pull request Mar 19, 2021
Summary:
Under `--load_fast=auto`, if the data server exits without writing a
port file, we now gracefully degrade to the multiplexer reading path.
This subsumes and expands upon the checks from #4786, with semantics now
defined by the data server as in #4793.

This depends on the `--error-file` data server flag added in #4793,
which is not yet released. To handle this, we add a mechanism for gating
flags against data server versions. As [discussed in #4689][1], this
raises a concern of what to do when the data server is chosen by a
mechanism other than the Python package. We resolve this by simply
treating such data servers as bleeding-edge. If you link in a data
server yourself, you should use it with a copy of `tb-nightly` built
from the same Git commit, or your mileage may vary.

See also #4786, which first proposed this functionality and offered some
alternative mechanisms.

[1]: #4689 (comment)

Test Plan:
Set your `GOOGLE_APPLICATION_CREDENTIALS` to an invalid JSON file, and
try running with `--load_fast=auto` with release v0.4.0 of the data
server and also with latest head (`//tensorboard/data/server:install`).
Note with `--verbosity 0` that with the old server, `--error-file` is
not passed and the data server has to fall back to anonymous creds,
whereas with the new server, `--error-file` is passed and TensorBoard
correctly falls back to the multiplexer.

wchargin-branch: data-server-fallback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes core:rustboard //tensorboard/data/server/...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants