Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A demo is 1.5x faster in Flux than tensorflow, both use cpu; while 3.0x slower during using CUDA #1694

Closed
deahhh opened this issue Aug 19, 2021 · 5 comments

Comments

@deahhh
Copy link

deahhh commented Aug 19, 2021

As the title, I use Julia 1.6.2 and tensorflow 2.3.0 and cuda 11.0, my code as follows:

Flux:

using Flux
using CUDA
data = randn(Float32, 2, 100000) |> gpu
y = reshape(sin.(data[1,:] .* data[2,:]), (1, size(data)[2])) |> gpu
model = Chain(
Dense(2, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 10, relu),
Dense(10, 1),
) |> gpu
opt = ADAM(0.001, (0.9, 0.999))
loss(x, y) = Flux.Losses.mse(model(x), y)
ps = Flux.params(model)
dl = Flux.DataLoader((data, y), batchsize=500, shuffle=true)|> gpu
Flux.@epochs 100 Flux.train!(loss, ps, dl, opt; cb = Flux.throttle(() -> @show(loss(data, y)), 10))

tensorflow

def test_tf():
import tensorflow as tf
import numpy as np
from tensorflow import keras
# tf.config.experimental.set_visible_devices(gpu[0], 'GPU')
with tf.device("/gpu:0"):
model = tf.keras.Sequential([
keras.layers.Dense(units=10, activation='relu', input_shape=[2]),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=10, activation='relu'),
keras.layers.Dense(units=1),
]
)
model.compile(optimizer=keras.optimizers.Adam(1e-3), loss="mean_squared_error")
xs = np.random.randn(100000, 2).astype(np.float32)
ys = np.sin(xs[:,0] * xs[:, 1]).astype(np.float32)
model.fit(xs, ys, epochs=100, batch_size=500)

if name == "main":
import time
t0 = time.time()
test_tf()
print("everage time of epoch is {}".format((time.time()-t0)/100))

@DhairyaLGandhi
Copy link
Member

You'd want to avoid globals, and perhaps turn off the logging. The output of CUDA.versioninfo() would be good to know too.

It you want to amortize the cost copying data on the gpu with every iteration which can add up quickly and silently, you may want to check out #1530

Besides that, these questions may be better suited to the JuliaLang Slack or discourse perhaps?

@ToucheSir
Copy link
Member

ToucheSir commented Aug 19, 2021

Yes, there is a lot we can talk about, but the issue tracker isn't a great place for it. Please open a thread on Discourse (see https://discourse.julialang.org/t/psa-how-to-quote-code-with-backticks/7530 about formatting) and we can pick up there.

@deahhh
Copy link
Author

deahhh commented Aug 20, 2021

@DhairyaLGandhi
Thanks for your quick answers, I think I know what the problem is, that Julia is programmatically controlled parallelism, while in python many background programs are binary packages that actually run in parallel.
What I need to do is to provide a hook for the training program to run in parallel, and bind the memory to the corresponding area of the memory of GPU to reduce the data preparation time.

And, I test "X = randn(Float32, 2000, 100000) |> gpu", the momery of X allocated in local momery not GPU memory.
Thanks again.

@deahhh
Copy link
Author

deahhh commented Aug 20, 2021

@ToucheSir
Thanks, I will open a thread on Discourse next time.

@ToucheSir
Copy link
Member

Flux layers should be exploiting parallelism through BLAS and other libraries as well, I don't believe that is the culprit. Anyhow, the offer still stands to open a discourse thread if you're having trouble closing the performance gap. I can provide a whole laundry list of recommendations :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants