TensorBoard in TFP and TF v2 #356

janosh · 2019-04-08T16:53:21Z

There don't appear to be any docs on how to use TensorBoard with TensorFlow Probability. I'm specifically interested in a guide for the 2.0 release. Is this planned or am I missing something?

csuter · 2019-04-09T22:25:03Z

We don't have any explicit TB features in TFP, but you should be able to monitor anything you're interested in using tf.summary and friends. You can pass any Tensor you want to those.

Is there something in particular you're trying to do? Maybe we can help a bit with idioms.

janosh · 2019-04-09T22:34:05Z

Yes, I'm trying to monitor the progress and final results of training a Bayesian NN with HMC. I tried writing a trace_fn and passing that to tfp.mcmc.sample_chain, i.e. something like

def trace_fn(weights, kernel_results):
    print("weights", weights)
    print("kernel_results", kernel_results)

@tf.function
def run_hmc(
    num_results=100,
    num_burnin_steps=0,
    step_size=0.01,
    current_state=get_initial_state(),
    num_steps_between_results=0,
):
    hmc_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
        tfp.mcmc.HamiltonianMonteCarlo(
            target_log_prob_fn=joint_log_prob_fn,
            num_leapfrog_steps=2,
            step_size=step_size,
            state_gradients_are_stopped=True,
        ),
        num_adaptation_steps=num_results + num_burnin_steps,
    )
    weights, kernel_results = tfp.mcmc.sample_chain(
        num_results=num_results,
        num_burnin_steps=num_burnin_steps,
        current_state=current_state,
        kernel=hmc_kernel,
        trace_fn=trace_fn,
    )
    print("Acceptance rate:", kernel_results.inner_results.is_accepted.numpy().mean())

but whatever signature I use or action I take in that function, it causes the whole operation to come crashing down. Some docs or guidance on this would be much appreciated!

csuter · 2019-04-09T22:40:27Z

Ah yeah, maybe this is a documentation bug -- check the docs on trace_fn in sample_chain and let me know if you think we could improve the verbiage there.

Basically, trace_fn gets to look at the current chain states and "kernel results" structures at each step, and decide which values to create traces of. These traces are what are returned in the kernel_results return value from sample_chain. So, e.g. if you wanted to keep track of is_accepted, but throw away everything else, you could do

def trace_fn(current_state, kernel_results)
  return kernel_results.inner_results.is_acceted

weights, kernel_results = tfp.mcmc.sample_chain(...)

kernel_results would then be a single Tensor with shape [num_results], containing the value of is_accepted at each of the num_results steps at which a result was computed.

You can also return more complicated nested structures (tuples, named_tuples, dicts [i think...]) from trace_fn.

I guess you could also make calls to tf.summary in that function (I'm not sure this will won't badly degrade performance), but you do need to return a valid Tensor, otherwise there'll definitely be some crashiness like you're seeing.

@SiegeLordEx may have something to add to what I've said.

SiegeLordEx · 2019-04-10T04:11:01Z

What @csuter said is correct. Indeed, if you want to track your weights over time on TensorBoard, you'd place tf.summary calls inside trace_fn, something like this (untested):

def trace_fn(weights, results):
   with tf.compat.v2.summary.record_if(tf.equal(results.step % 100, 0)):
     tf.compat.v2.summary.histogram(weights, step=results.step)
   return ()

Note how I set it up to record every 100 steps, for efficiency, but you can do whatever suits your needs.

It might also make sense to run sample_chain without summaries, and then iterate over the return values of sample_chain (I can imagine this playing nicer on the GPU), but obviously you'd lose the in-progress display of your statistics.

brianwa84 · 2019-04-10T10:39:37Z

I don't expect summaries inside the trace fn to work because they sit inside a while control for context. Summaries must be fetchable at the top level of the graph. Are you running a chain for so long that you want summaries out mid execution? For that I think you would want to run sample_chain for n steps, output a summary, then resume sampling, which iirc is supported well.

…

On Wed, Apr 10, 2019, 12:11 AM Pavel Sountsov ***@***.***> wrote: What @csuter <https://github.com/csuter> said is correct. Indeed, if you want to track your weights over time on TensorBoard, you'd place tf.summary calls inside trace_fn, something like this (untested): def trace_fn(weights, results): with tf.compat.v2.summary.record_if(tf.equal(results.step % 100, 0)): tf.compat.v2.summary.histogram(weights, step=results.step) return () Note how I set it up to record every 100 steps, for efficiency, but you can do whatever suits your needs. It might also make sense to run sample_chain without summaries, and then iterate over the return values of sample_chain (I can imagine this playing nicer on the GPU), but obviously you'd lose the in-progress display of your statistics. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#356 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVJZI19dgud6h8mGMgfNtBc1o7NGS_gxks5vfWRYgaJpZM4cipoZ> .

SiegeLordEx · 2019-04-10T16:19:14Z

That's true only of V1 summaries, V2 summaries are just regular ops with a side-effect of writing to a file. Here's a complete working example:

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions

dist = tfd.Normal(0., 1.)

kernel = tfp.mcmc.SimpleStepSizeAdaptation(tfp.mcmc.HamiltonianMonteCarlo(dist.log_prob, step_size=0.1, num_leapfrog_steps=3), num_adaptation_steps=100)

summary_writer = tf.compat.v2.summary.create_file_writer('/tmp/summary_chain', flush_millis=10000)

def trace_fn(state, results):
  with tf.compat.v2.summary.record_if(tf.equal(results.step % 10, 1)):
    tf.compat.v2.summary.scalar("state", state, step=tf.cast(results.step, tf.int64))
  return ()
    
with summary_writer.as_default():
  chain, _ = tfp.mcmc.sample_chain(kernel=kernel, current_state=0., num_results=200, trace_fn=trace_fn)
  
summary_writer.close()

There is a bit of an annoyance in that the summaries use the name scope of where they are as the name, which leaks a whole bunch of internal implementation details of sample_chain... I don't have a solution for this yet.

janosh · 2019-04-10T16:32:55Z

@SiegeLordEx I found the same thing, creating summaries in trace_fn seems to work well. I also didn't notice any slow-down but I'll check that more carefully later. However, both with my own implementation and your code, I'm unable to open the summary in TensorBoard. In both cases tensorboard --logdir ./tmp/summary_chain throws

Exception in thread Reloader:
AttributeError: module 'tensorflow._api.v2.compat.v1' has no attribute 'pywrap_tensorflow'

followed by

W0410 17:26:13.712886 123145489154048 core_plugin.py:172] Unable to get first event timestamp for run .: No event timestamp could be found

and an empty TB dashboard. I'm running the latest tb-nightly. Any ideas what's causing this?

janosh · 2019-04-10T16:44:37Z

@brianwa84 That's a great suggestion. I'll try that as soon as I have a working implementation.

SiegeLordEx · 2019-04-10T17:49:27Z

@janosh Not sure, my TensorBoard works okay. I'd try things out without TFP, just:

summary_writer = 
with summary_writer.as_default():
   tf.compat.v2.summary.scalar(...)
summary_writer.close()

And make sure that works. Maybe it's just some TF2 incompatibility nonsense which has nothing to do with TFP.

janosh · 2019-04-10T17:58:33Z

Same problem without tfp. I'll file another issue in the main repo.

janosh · 2019-04-12T14:35:12Z

@brianwa84 What would be the best way of resuming the calculation? Just pass the last state of the previous run into the next one and then concatenate the results of all runs for final diagnostics? E.g.

hmc_kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn, step_size=step_size, num_leapfrog_steps=num_leapfrog_steps
)
adaptive_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
    hmc_kernel, num_adaptation_steps=num_adaptation_steps
)

chain1, (_, kernel_results1) = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=current_state,
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)

# Some mid-execution diagnostics

chain2, (_, kernel_results2) = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=states1[-1],
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)

chain = tf.concat((chain1, chain2), 0)

But then how to merge the kernel results kernel_results1 and kernel_results2? They are each classes (SimpleStepSizeAdaptation) and it appears as though I would have to merge their attributes like adaptation_rate, new_step_size, inner_results.is_accepted, inner_results.log_accept_ratio, etc. individually. That seems like a lot of manual work and not so much like "supported well" so I'm guessing I'm doing something wrong?

brianwa84 · 2019-04-12T15:12:42Z

Something like that:

state, kernel_results = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=current_state,
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)
chain1, (_, kernel_results1) = state, kernel_results

# Some mid-execution diagnostics
state, kernel_results = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=states[-1],  # or tf.[compat.v2.]nest.map_structure(lambda x:x[-1], states)
    previous_kernel_results=kernel_results,   # This line is new.
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)
chain2, (_, kernel_results2) = state, kernel_results

chain = tf.concat((chain1, chain2), 0)

brianwa84 · 2019-04-12T15:13:46Z

Re: how to merge the kernel results
You can use tf.nest.map_structure to map the tf.concat over everything in there.

brianwa84 · 2019-04-12T15:14:57Z

@SiegeLordEx should what I put above work?

SiegeLordEx · 2019-04-12T15:57:24Z

Thanks @brianwa84. Yes, it's something like that. Here's a 'loop' version of the above:

kernel_results = kernel.boostrap_results(current_state)
chain_blocks = []
trace_blocks = []
for i in range(num_blocks):
	chain, trace, kernel_results = tfp.mcmc.sample_chain(
		current_state=current_state,
		previous_kernel_results=kernel_results,
		trace_fn=...,
		return_final_kernel_results=True,
		)
	
	# Do your partial analysis here.
	
	current_state = tf.nest.map_structure(lambda x: x[-1])
	chain_blocks.append(chain)
	trace_blocks.append(trace)

full_chain = tf.nest.map_structure(lambda *parts: tf.concat(parts, axis=0), *chain_blocks)
full_trace = tf.nest.map_structure(lambda *parts: tf.concat(parts, axis=0), *trace_blocks)

# full_trace/full_chain now contain num_blocks * num_results elements

janosh · 2019-04-12T17:06:17Z

@SiegeLordEx Why do you need kernel_results = kernel.boostrap_results(current_state)? Wouldn't kernel_results = None work?

Also, what's the advantage of

current_state = tf.nest.map_structure(lambda x: x[-1], chain)

over

current_state = chain[-1]

SiegeLordEx · 2019-04-12T17:11:53Z

kernel_results = None will work, but I wanted to illustrate the loop such that it had no Python control flow in it. Eschewing Python control lets us use tf.function efficiently to speed up that computation. It's a minor point as far as the example goes, but it's just more natural to me to write it that way.

tfp.mcmc supports list-valued chain states, so current_state might actually be a list of Tensors, each of which needs to be indexed separately. It's just a bit more general that way.

viotemp1 · 2020-04-06T15:18:04Z

For loss in TB:
################################################################
def write_TB_metrics(metric={}, step=0, metrics_file_writer=None):
with metrics_file_writer.as_default():
with name_scope(tb_metrics_name_scope):
for key in metric.keys():
value = metric[key]
summary.scalar(key, value, step=step)
metrics_file_writer.flush()
metrics_file_writer = summary.create_file_writer(LOG_DIR_METRICS)
################################################################
#@tf.function()
def trace_fn(traceable_quantities):
if write_metrics_tb:
write_TB_metrics(metric={'loss': traceable_quantities.loss}, step=traceable_quantities.step,
metrics_file_writer=metrics_file_writer)
#print("step", traceable_quantities.step)
#print("loss", traceable_quantities.loss)
return traceable_quantities.loss
################################################################
...
loss_curve = tfp.vi.fit_surrogate_posterior(
target_log_prob_fn=target_log_prob_fn,
surrogate_posterior=variational_posteriors,
optimizer=optimizer,
num_steps=num_variational_steps,
trace_fn=trace_fn,
seed=42
)

merplumander · 2020-12-04T16:11:47Z

About resuming:

I had hoped that, when setting random seeds, resuming and running the full chain from the beginning would produce the same results, but it doesn't. Is this expected behavior or am I doing something wrong?

Here's a minimal example building on the code that @SiegeLordEx provided (Python 3.6.5; tensorflow==2.3.1; tensorflow-probability==0.11.1):

def target_log_prob(x):
    return -x - x ** 2.0


current_state = 1.0
tf.random.set_seed(0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=target_log_prob, step_size=0.01, num_leapfrog_steps=5
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    kernel, num_adaptation_steps=0
)

kernel_results = kernel.bootstrap_results(current_state)
chain_blocks = []
for i in range(2):
    

    chain, trace, kernel_results = tfp.mcmc.sample_chain(
        num_results=3,
        current_state=current_state,
        previous_kernel_results=kernel_results,
        trace_fn=trace_fn,
        return_final_kernel_results=True,
        kernel=kernel,
    )

    current_state = tf.nest.map_structure(lambda x: x[-1], chain)
    chain_blocks.append(chain)


full_chain = tf.nest.map_structure(
    lambda *parts: tf.concat(parts, axis=0), *chain_blocks
)
full_chain
==> <tf.Tensor: shape=(6,), dtype=float32, numpy=
array([ 0.95076746,  0.12316042,  0.5397935 , -0.21367444, -0.21657643,
       -1.0244453 ], dtype=float32)>

# Let's do it all again but now without a break in between:

current_state = 1.0
tf.random.set_seed(0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=target_log_prob, step_size=0.01, num_leapfrog_steps=5
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    kernel, num_adaptation_steps=0
)

kernel_results = kernel.bootstrap_results(current_state)
chain_blocks = []
chain, trace, kernel_results = tfp.mcmc.sample_chain(
    num_results=6,
    current_state=current_state,
    previous_kernel_results=kernel_results,
    trace_fn=trace_fn,
    return_final_kernel_results=True,
    kernel=kernel,
)

chain_blocks.append(chain)

full_chain = tf.nest.map_structure(
    lambda *parts: tf.concat(parts, axis=0), *chain_blocks
)
full_chain
==> <tf.Tensor: shape=(6,), dtype=float32, numpy=
array([0.95076746, 0.12316042, 0.5397935 , 1.1745309 , 0.37639475,
       0.19865556], dtype=float32)>

So the two chains produce the same samples up to step three (as they must since I set a random seed), but produce different samples after resuming. Is there a way to make these two produce equivalent results by setting some internal seeds?

Appreciating every feedback :)

janosh closed this as completed Apr 12, 2019

janosh reopened this Apr 12, 2019

janosh mentioned this issue Apr 15, 2019

tfp.mcmc.HamiltonianMonteCarlo: add example how to infer parameters of Bayesian neural network #292

Open

brianwa84 closed this as completed May 2, 2019

SiegeLordEx mentioned this issue May 30, 2019

Progress bar #435

Open

merplumander mentioned this issue Dec 14, 2020

Resuming sampling on an MCMC chain produces different results than running it in one go. #1193

Closed

anvvalade mentioned this issue Sep 21, 2021

Save kernel results to resume MCMC run: pickle, drill or dedicated tensorflow tool? #1434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorBoard in TFP and TF v2 #356

TensorBoard in TFP and TF v2 #356

janosh commented Apr 8, 2019

csuter commented Apr 9, 2019

janosh commented Apr 9, 2019 •

edited

Loading

csuter commented Apr 9, 2019

SiegeLordEx commented Apr 10, 2019

brianwa84 commented Apr 10, 2019 via email

SiegeLordEx commented Apr 10, 2019 •

edited

Loading

janosh commented Apr 10, 2019

janosh commented Apr 10, 2019

SiegeLordEx commented Apr 10, 2019 •

edited

Loading

janosh commented Apr 10, 2019

janosh commented Apr 12, 2019

brianwa84 commented Apr 12, 2019

brianwa84 commented Apr 12, 2019

brianwa84 commented Apr 12, 2019

SiegeLordEx commented Apr 12, 2019 •

edited

Loading

janosh commented Apr 12, 2019

SiegeLordEx commented Apr 12, 2019

viotemp1 commented Apr 6, 2020 •

edited

Loading

merplumander commented Dec 4, 2020

TensorBoard in TFP and TF v2 #356

TensorBoard in TFP and TF v2 #356

Comments

janosh commented Apr 8, 2019

csuter commented Apr 9, 2019

janosh commented Apr 9, 2019 • edited Loading

csuter commented Apr 9, 2019

SiegeLordEx commented Apr 10, 2019

brianwa84 commented Apr 10, 2019 via email

SiegeLordEx commented Apr 10, 2019 • edited Loading

janosh commented Apr 10, 2019

janosh commented Apr 10, 2019

SiegeLordEx commented Apr 10, 2019 • edited Loading

janosh commented Apr 10, 2019

janosh commented Apr 12, 2019

brianwa84 commented Apr 12, 2019

brianwa84 commented Apr 12, 2019

brianwa84 commented Apr 12, 2019

SiegeLordEx commented Apr 12, 2019 • edited Loading

janosh commented Apr 12, 2019

SiegeLordEx commented Apr 12, 2019

viotemp1 commented Apr 6, 2020 • edited Loading

merplumander commented Dec 4, 2020

janosh commented Apr 9, 2019 •

edited

Loading

SiegeLordEx commented Apr 10, 2019 •

edited

Loading

SiegeLordEx commented Apr 10, 2019 •

edited

Loading

SiegeLordEx commented Apr 12, 2019 •

edited

Loading

viotemp1 commented Apr 6, 2020 •

edited

Loading