Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard in TFP and TF v2 #356

Closed
janosh opened this issue Apr 8, 2019 · 19 comments
Closed

TensorBoard in TFP and TF v2 #356

janosh opened this issue Apr 8, 2019 · 19 comments

Comments

@janosh
Copy link
Contributor

janosh commented Apr 8, 2019

There don't appear to be any docs on how to use TensorBoard with TensorFlow Probability. I'm specifically interested in a guide for the 2.0 release. Is this planned or am I missing something?

@csuter
Copy link
Member

csuter commented Apr 9, 2019

We don't have any explicit TB features in TFP, but you should be able to monitor anything you're interested in using tf.summary and friends. You can pass any Tensor you want to those.

Is there something in particular you're trying to do? Maybe we can help a bit with idioms.

@janosh
Copy link
Contributor Author

janosh commented Apr 9, 2019

Yes, I'm trying to monitor the progress and final results of training a Bayesian NN with HMC. I tried writing a trace_fn and passing that to tfp.mcmc.sample_chain, i.e. something like

def trace_fn(weights, kernel_results):
    print("weights", weights)
    print("kernel_results", kernel_results)

@tf.function
def run_hmc(
    num_results=100,
    num_burnin_steps=0,
    step_size=0.01,
    current_state=get_initial_state(),
    num_steps_between_results=0,
):
    hmc_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
        tfp.mcmc.HamiltonianMonteCarlo(
            target_log_prob_fn=joint_log_prob_fn,
            num_leapfrog_steps=2,
            step_size=step_size,
            state_gradients_are_stopped=True,
        ),
        num_adaptation_steps=num_results + num_burnin_steps,
    )
    weights, kernel_results = tfp.mcmc.sample_chain(
        num_results=num_results,
        num_burnin_steps=num_burnin_steps,
        current_state=current_state,
        kernel=hmc_kernel,
        trace_fn=trace_fn,
    )
    print("Acceptance rate:", kernel_results.inner_results.is_accepted.numpy().mean())

but whatever signature I use or action I take in that function, it causes the whole operation to come crashing down. Some docs or guidance on this would be much appreciated!

@csuter
Copy link
Member

csuter commented Apr 9, 2019

Ah yeah, maybe this is a documentation bug -- check the docs on trace_fn in sample_chain and let me know if you think we could improve the verbiage there.

Basically, trace_fn gets to look at the current chain states and "kernel results" structures at each step, and decide which values to create traces of. These traces are what are returned in the kernel_results return value from sample_chain. So, e.g. if you wanted to keep track of is_accepted, but throw away everything else, you could do

def trace_fn(current_state, kernel_results)
  return kernel_results.inner_results.is_acceted

weights, kernel_results = tfp.mcmc.sample_chain(...)

kernel_results would then be a single Tensor with shape [num_results], containing the value of is_accepted at each of the num_results steps at which a result was computed.

You can also return more complicated nested structures (tuples, named_tuples, dicts [i think...]) from trace_fn.

I guess you could also make calls to tf.summary in that function (I'm not sure this will won't badly degrade performance), but you do need to return a valid Tensor, otherwise there'll definitely be some crashiness like you're seeing.

@SiegeLordEx may have something to add to what I've said.

@SiegeLordEx
Copy link
Member

What @csuter said is correct. Indeed, if you want to track your weights over time on TensorBoard, you'd place tf.summary calls inside trace_fn, something like this (untested):

def trace_fn(weights, results):
   with tf.compat.v2.summary.record_if(tf.equal(results.step % 100, 0)):
     tf.compat.v2.summary.histogram(weights, step=results.step)
   return ()

Note how I set it up to record every 100 steps, for efficiency, but you can do whatever suits your needs.

It might also make sense to run sample_chain without summaries, and then iterate over the return values of sample_chain (I can imagine this playing nicer on the GPU), but obviously you'd lose the in-progress display of your statistics.

@brianwa84
Copy link
Contributor

brianwa84 commented Apr 10, 2019 via email

@SiegeLordEx
Copy link
Member

SiegeLordEx commented Apr 10, 2019

That's true only of V1 summaries, V2 summaries are just regular ops with a side-effect of writing to a file. Here's a complete working example:

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions

dist = tfd.Normal(0., 1.)

kernel = tfp.mcmc.SimpleStepSizeAdaptation(tfp.mcmc.HamiltonianMonteCarlo(dist.log_prob, step_size=0.1, num_leapfrog_steps=3), num_adaptation_steps=100)

summary_writer = tf.compat.v2.summary.create_file_writer('/tmp/summary_chain', flush_millis=10000)

def trace_fn(state, results):
  with tf.compat.v2.summary.record_if(tf.equal(results.step % 10, 1)):
    tf.compat.v2.summary.scalar("state", state, step=tf.cast(results.step, tf.int64))
  return ()
    
with summary_writer.as_default():
  chain, _ = tfp.mcmc.sample_chain(kernel=kernel, current_state=0., num_results=200, trace_fn=trace_fn)
  
summary_writer.close()

There is a bit of an annoyance in that the summaries use the name scope of where they are as the name, which leaks a whole bunch of internal implementation details of sample_chain... I don't have a solution for this yet.

@janosh
Copy link
Contributor Author

janosh commented Apr 10, 2019

@SiegeLordEx I found the same thing, creating summaries in trace_fn seems to work well. I also didn't notice any slow-down but I'll check that more carefully later. However, both with my own implementation and your code, I'm unable to open the summary in TensorBoard. In both cases tensorboard --logdir ./tmp/summary_chain throws

Exception in thread Reloader:
AttributeError: module 'tensorflow._api.v2.compat.v1' has no attribute 'pywrap_tensorflow'

followed by

W0410 17:26:13.712886 123145489154048 core_plugin.py:172] Unable to get first event timestamp for run .: No event timestamp could be found

and an empty TB dashboard. I'm running the latest tb-nightly. Any ideas what's causing this?

@janosh
Copy link
Contributor Author

janosh commented Apr 10, 2019

@brianwa84 That's a great suggestion. I'll try that as soon as I have a working implementation.

@SiegeLordEx
Copy link
Member

SiegeLordEx commented Apr 10, 2019

@janosh Not sure, my TensorBoard works okay. I'd try things out without TFP, just:

summary_writer = 
with summary_writer.as_default():
   tf.compat.v2.summary.scalar(...)
summary_writer.close()

And make sure that works. Maybe it's just some TF2 incompatibility nonsense which has nothing to do with TFP.

@janosh
Copy link
Contributor Author

janosh commented Apr 10, 2019

Same problem without tfp. I'll file another issue in the main repo.

@janosh janosh closed this as completed Apr 12, 2019
@janosh
Copy link
Contributor Author

janosh commented Apr 12, 2019

@brianwa84 What would be the best way of resuming the calculation? Just pass the last state of the previous run into the next one and then concatenate the results of all runs for final diagnostics? E.g.

hmc_kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn, step_size=step_size, num_leapfrog_steps=num_leapfrog_steps
)
adaptive_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
    hmc_kernel, num_adaptation_steps=num_adaptation_steps
)

chain1, (_, kernel_results1) = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=current_state,
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)

# Some mid-execution diagnostics

chain2, (_, kernel_results2) = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=states1[-1],
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)

chain = tf.concat((chain1, chain2), 0)

But then how to merge the kernel results kernel_results1 and kernel_results2? They are each classes (SimpleStepSizeAdaptation) and it appears as though I would have to merge their attributes like adaptation_rate, new_step_size, inner_results.is_accepted, inner_results.log_accept_ratio, etc. individually. That seems like a lot of manual work and not so much like "supported well" so I'm guessing I'm doing something wrong?

@janosh janosh reopened this Apr 12, 2019
@brianwa84
Copy link
Contributor

Something like that:

state, kernel_results = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=current_state,
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)
chain1, (_, kernel_results1) = state, kernel_results

# Some mid-execution diagnostics
state, kernel_results = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=states[-1],  # or tf.[compat.v2.]nest.map_structure(lambda x:x[-1], states)
    previous_kernel_results=kernel_results,   # This line is new.
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)
chain2, (_, kernel_results2) = state, kernel_results

chain = tf.concat((chain1, chain2), 0)

@brianwa84
Copy link
Contributor

Re: how to merge the kernel results
You can use tf.nest.map_structure to map the tf.concat over everything in there.

@brianwa84
Copy link
Contributor

@SiegeLordEx should what I put above work?

@SiegeLordEx
Copy link
Member

SiegeLordEx commented Apr 12, 2019

Thanks @brianwa84. Yes, it's something like that. Here's a 'loop' version of the above:

kernel_results = kernel.boostrap_results(current_state)
chain_blocks = []
trace_blocks = []
for i in range(num_blocks):
	chain, trace, kernel_results = tfp.mcmc.sample_chain(
		current_state=current_state,
		previous_kernel_results=kernel_results,
		trace_fn=...,
		return_final_kernel_results=True,
		)
	
	# Do your partial analysis here.
	
	current_state = tf.nest.map_structure(lambda x: x[-1])
	chain_blocks.append(chain)
	trace_blocks.append(trace)

full_chain = tf.nest.map_structure(lambda *parts: tf.concat(parts, axis=0), *chain_blocks)
full_trace = tf.nest.map_structure(lambda *parts: tf.concat(parts, axis=0), *trace_blocks)

# full_trace/full_chain now contain num_blocks * num_results elements

@janosh
Copy link
Contributor Author

janosh commented Apr 12, 2019

@SiegeLordEx Why do you need kernel_results = kernel.boostrap_results(current_state)? Wouldn't kernel_results = None work?

Also, what's the advantage of

current_state = tf.nest.map_structure(lambda x: x[-1], chain)

over

current_state = chain[-1]

@SiegeLordEx
Copy link
Member

kernel_results = None will work, but I wanted to illustrate the loop such that it had no Python control flow in it. Eschewing Python control lets us use tf.function efficiently to speed up that computation. It's a minor point as far as the example goes, but it's just more natural to me to write it that way.

tfp.mcmc supports list-valued chain states, so current_state might actually be a list of Tensors, each of which needs to be indexed separately. It's just a bit more general that way.

@viotemp1
Copy link

viotemp1 commented Apr 6, 2020

For loss in TB:
################################################################
def write_TB_metrics(metric={}, step=0, metrics_file_writer=None):
with metrics_file_writer.as_default():
with name_scope(tb_metrics_name_scope):
for key in metric.keys():
value = metric[key]
summary.scalar(key, value, step=step)
metrics_file_writer.flush()
metrics_file_writer = summary.create_file_writer(LOG_DIR_METRICS)
################################################################
#@tf.function()
def trace_fn(traceable_quantities):
if write_metrics_tb:
write_TB_metrics(metric={'loss': traceable_quantities.loss}, step=traceable_quantities.step,
metrics_file_writer=metrics_file_writer)
#print("step", traceable_quantities.step)
#print("loss", traceable_quantities.loss)
return traceable_quantities.loss
################################################################
...
loss_curve = tfp.vi.fit_surrogate_posterior(
target_log_prob_fn=target_log_prob_fn,
surrogate_posterior=variational_posteriors,
optimizer=optimizer,
num_steps=num_variational_steps,
trace_fn=trace_fn,
seed=42
)
Screenshot 2020-04-06 at 18 20 05

@merplumander
Copy link

About resuming:

I had hoped that, when setting random seeds, resuming and running the full chain from the beginning would produce the same results, but it doesn't. Is this expected behavior or am I doing something wrong?

Here's a minimal example building on the code that @SiegeLordEx provided (Python 3.6.5; tensorflow==2.3.1; tensorflow-probability==0.11.1):

def target_log_prob(x):
    return -x - x ** 2.0


current_state = 1.0
tf.random.set_seed(0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=target_log_prob, step_size=0.01, num_leapfrog_steps=5
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    kernel, num_adaptation_steps=0
)

kernel_results = kernel.bootstrap_results(current_state)
chain_blocks = []
for i in range(2):
    

    chain, trace, kernel_results = tfp.mcmc.sample_chain(
        num_results=3,
        current_state=current_state,
        previous_kernel_results=kernel_results,
        trace_fn=trace_fn,
        return_final_kernel_results=True,
        kernel=kernel,
    )

    current_state = tf.nest.map_structure(lambda x: x[-1], chain)
    chain_blocks.append(chain)


full_chain = tf.nest.map_structure(
    lambda *parts: tf.concat(parts, axis=0), *chain_blocks
)
full_chain
==> <tf.Tensor: shape=(6,), dtype=float32, numpy=
array([ 0.95076746,  0.12316042,  0.5397935 , -0.21367444, -0.21657643,
       -1.0244453 ], dtype=float32)>

# Let's do it all again but now without a break in between:

current_state = 1.0
tf.random.set_seed(0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=target_log_prob, step_size=0.01, num_leapfrog_steps=5
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    kernel, num_adaptation_steps=0
)

kernel_results = kernel.bootstrap_results(current_state)
chain_blocks = []
chain, trace, kernel_results = tfp.mcmc.sample_chain(
    num_results=6,
    current_state=current_state,
    previous_kernel_results=kernel_results,
    trace_fn=trace_fn,
    return_final_kernel_results=True,
    kernel=kernel,
)

chain_blocks.append(chain)

full_chain = tf.nest.map_structure(
    lambda *parts: tf.concat(parts, axis=0), *chain_blocks
)
full_chain
==> <tf.Tensor: shape=(6,), dtype=float32, numpy=
array([0.95076746, 0.12316042, 0.5397935 , 1.1745309 , 0.37639475,
       0.19865556], dtype=float32)>

So the two chains produce the same samples up to step three (as they must since I set a random seed), but produce different samples after resuming. Is there a way to make these two produce equivalent results by setting some internal seeds?

Appreciating every feedback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants