-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to integrate returnn_common.nn
into Sisyphus pipelines
#63
Comments
I like this approach. A mistake in the building of the config which was seen too late can then be fixed by calling |
I don't see how this is relevant. The RETURNN version should not have any influence on Also the TensorFlow version should not have any influence on
We already discussed about that here: rwth-i6/returnn_common#27
So, it's possible, although this would require some work. But also, it sounds a bit like an artificial problem, which we could just directly solve (just make TF import fast), instead of putting lots of effort into it to work around it. We also already discussed how to solve this directly, i.e. how to make TF import fast. There are many solutions to that.
I don't think the net construction really adds any noticeable overhead in runtime. This should not be the case. It should be just as fast as before. If this is not the case, we can easily fix that. There is no reason why this should not be fast. I also disagree that it is complex but this is maybe subjective. Anyway, complexity does not imply slowness. Do you have any numbers on that? It really should not be slow. Even not 2000 calls, or even not 1M calls. If you think this is slow, again, we can fix this. Please open an issue on this with some numbers.
No, it cannot be influenced by the RETURNN version. Anyway, to your main proposal here: What I don't understand is: When you now have the construction code somewhere else (where exactly?), why do you need it at all as a separate job or task? Why not directly have this code in the RETURNN config and run it on-the-fly when RETURNN starts? What is the benefit to separate this? To be able to see the net dict more directly? So you make this single thing a bit easier (it would also be easy to dump it on-the-fly if you want to see it) while making the whole setup overall much more complicated? |
Setting the other comments aside (because we do not need to fix this if we can not work like this anyway).
What "before"? There is no "before". We constructed the network dicts, which is merging some pre-defined dictionaries together and of course much faster than the new construction logic. I just did the timing on one Transformer construction and it is definitely too slow (not in general, but to do it in the manager). Just one single construction was 2.3 seconds.
I am not sure how you read this out of my proposal, the location does not change at all.
This is one way of doing it, using it with the `ReturnnTrainFromFileJob", which of course would make all of the mentioned issues non-existing. But this means you have limited interaction with Sisyphus, and your own defined modules always have to be copy&pasted in there (unless they are part of returnn_common) and you can have no additional helpers. But this is in a way saying "do not use Sisyphus for anything Returnn config related". Which is a valid approach, but not what should be discussed. So referring to the title of this issue your answer is "don't do it". |
Before = before you used No, this is what I'm saying: It should not make any noticeable difference. Why should it? If there is really any noticeable slowdown, we can fix this. There is really no reason why there should be any noticeable slowdown due to the new construction. Even if you think it is complex.
What exactly did you measure? Did you maybe count the TensorFlow import? The construction itself really should not be slow.
Then I don't understand your proposal. You proposed to do the construction in a job? So that means the location is now somewhere difference, i.e. in the job. If you don't mean that, what do you mean then?
I don't see the difference to having the construction in a job, what you proposed here. What exactly is the difference?
No, of course you don't need to copy&paste anything. Why? I don't understand. Also, again, I don't understand how this is different to your proposal here.
Sure you can have. Why not? And again I don't understand the difference to your proposal here.
No. At the moment I just try to understand how your proposal really has any benefit, or what your are actually proposing, and how that is different to what I'm saying. |
It seems this topic is too complicated to be discussed in an issue here, as we are lacking some common ground, we should take this offline. In short:
Because executing hundreds of lines of code is certainly slower than merging dicts.
with nn.NameCtx.new_root() as name_ctx_network:
net = BLSTMDownsamplingTransformerASR(...)
out = net(...)
out.mark_as_default_output()
for param in net.parameters():
param.weight_decay = 0.1
serializer = nn.ReturnnConfigSerializer(name_ctx_network)
base_string = serializer.get_base_extern_data_py_code_str()
network_string = serializer.get_ext_net_dict_py_code_str(net, ref_base_dims_via_global_config=True)
You can not import from the recipes in a returnn config file. A job (as in: a generated job folder in work dir) has to be independent of any outside influence that is not part of the inputs.
just because the construction is called in a job does not mean the code that does construction is in a different location (like physical as file) |
I don't see why this issue here is complicated.
Sure, we can do that, but it would be good to have it in written form anyway here such that others can also join the discussion or understand decisions later on. So it would be anyway good if you could answer my questions. I probably will not be able to come to the office today in person. But probably tomorrow.
But this is what I mean about noticeable. Hundreds of lines of code should still execute in the nano-seconds range. This is still very fast. Unless sth is wrong.
This might involve imports. Otherwise, I don't see why this should not be fast. If there is anything slow, we can certainly fix this. Some profiling would be good.
I don't understand. Sure you can import from the recipes, or from anywhere else. What stops you from doing that? Sure a job is influenced on many other things, like the content of the filesystem, Sisyphus, Python, the env, the Python env, etc. The recipes can just be part of the Python env. I don't really understand your argument here. You say you cannot have helpers because of some artificial constraint you are making up that there should be no shared common code, i.e. there should be no helpers? So you say you cannot have helpers because there must not be helpers. But why? This does not make sense to me. I also still don't understand how the situation is different to what you propose. You propose to do the net construction inside a job, or not? So where does the code for the net construction come from?
Yes, exactly as I'm saying. You can do it as you want. It does not matter. Although I'm not really sure if you propose to change that or not, or where exactly you would want the the code to be. What do you actually mean when you write "constructor location"? What changes is where the code is executed, e.g. inside a job or in Sisyphus. E.g. inside the net construction job or in the And again, how does the situation change when this is inside a new custom net construction job, or just directly in the |
I called this repeatedly to make sure this is not the case.
All accessed content should be local in the job folder or from the input folders. Sure, if you manually edit those it will break, but this is not something commonly done, so the first thing can be excluded. Functions from Sisyphus are usually (and should not) be used inside job tasks except for path resolving. Python, yes Python can have an influence but people usually do not change their major python version while working with the same setup folder. The env, yes, the env can change, but optimally (this is unfortunately not the case yet), you have a job for the env, then this is not an issue. So all the things you listed are not something you alter on a daily basis. Your personal recipe code is edited on a daily basis, as this what you work with. Importing that externally is a very bad idea.
I did not say that. This is what
The difference is that this job would need a different kind of parameter passing, it would need fixed parameters for returnn and returnn_common, and it would need to make a local copy of the files that were used to build the network. |
I think I don't really understand what code you are talking about when you say net construction code, and where you would have this code. I was not talking about In any case, where-ever this code is located (it does not matter), what difference does it make whether it is executed in |
I do not want to alter the
But also I do not fully know how a |
Why not? This could simply be an extension (if any change/extension is really needed, but see below). So, just because you don't want to alter But despite, I don't think any change or extension is needed (see below).
Wrong. Sure you can do all of that. You can simply put it into the Also, I still don't know where you actually would have your construction code located. Inside recipes? Outside recipes? Where exactly?
I don't understand. In this issue here, you wrote:
So, I just gave this job a name, and called it So, what do you mean? Do you propose to add such job or not? Or what is the problem? Or you say you don't know yet how to implement it?
Right. That is what I am saying all the time. Actually not just similar but exactly the same. That is why I don't understand why you want to have a separate job. I don't understand the advantage. I keep asking about this but so far you have not answered why it should be or must be a separate job, or why it can't be together in the
Ah ok. I thought this issue is not about discussing that, but about proposing actually to have it external. On the aspect whether the complete within Sisyphus net construction is feasible: Again, I don't see why not. Whatever you think is slow can (and should) be fixed. There is no reason why it should be slow. On the question of how to do it externally (no matter if inside |
So you suggest to extend the
It is adding code and parameters to the job that from my point of view do not belong in there. If there is an extra job the interface for the Job and the
I am raising the possibility this could be a solution (I am not sure myself). There is no problem, I just do not see any reason yet to start implementing this as long as there is no full concept to have in mind.
I think there is, but lets exclude that here. This is also only one aspect.
Somewhere under
One possibility of many would be providing a |
No, you also don't need an extension to
I don't understand. I did not say the job can not do that. You said that, and I said this is wrong, because the job can do that, as I described.
About understandable/user-friendly: Sure, but this can easily be solved.
There is no change needed to
Ok. As said, this was not clear from the issue description to me. It sounds like you were proposing this specific solution.
I don't understand. I thought this is the one and only reason why you need this at all? Assume that this can be fixed. Then the whole discussion here in this issue is obsolete or not? As I understand you, you would anyway even prefer this?
So, inside the recipes. So it means the job accesses the recipes. But this contradicts what you wrote earlier that you don't want that?
I don't exactly understand. Can you be more specific? Maybe give an example? Python executable, you mean |
We can leave the rest for now, it does not make sense to discuss, but:
If there is a solution that would need no code changes at all this would be a good starting point to do any experiments. So please tell me how this should work. |
I don't understand. Surely all the rest is very relevant here as well, and very important to discuss? Esp, most importantly, maybe this whole issue here is actually obsolete, as I explained? And I thought this is the main question this issue is actually about, as you explained?
But I thought first the question is if this is needed at all, and a within Sisyphus construction is not possible? Now you seem to assume again that it is needed. There are many trivial ways how you can get whatever data/code with whatever dependencies into a But the more relevant question is, which I keep asking, which is still not clear to me: Where exactly is the net construction code? And what do you want to pass exactly to the job? The code itself (a copy of it or so?), or a Maybe you can just give an example to that? Ignore the part how exactly it ends up in the |
Haven't read the whole discussion but I did try to use returnn_common.nn with my setup and experienced similar problems @JackTemaki mentioned here. What I did was: 1 ) Test the network Construction via Steps 1 & 2 could be done by some sort of I like the Idea of a net construction job, in my view this would solve following problems:
In this case the output for the Am am not using returnn_common in my setup currently ( Due to some time constraints and not being to familiar with it ). |
Note from personal discussion with @JackTemaki: We came to the conclusion that such net construction job is indeed not needed and we can directly do it in the The RC code with the model definition would be inside the users recipe dir, i.e. somewhere in There was the suggestion to automatically copy the model definition file into the job work dir locally such that any modifications on it afterwards to not infer with the job. But it's not clear whether this is a good idea, or whether this is even so simple because the model definition might not be a single file but also access other files from the recipes. What remains are many small details. E.g. as said, we could pass a class ReturnnCommonModelSerializer:
def __init__(self, module_name: str, *, func_name: str = "main", func_kwargs: Dict[str, Any] = None):
self.module_name = module_name
self.func_name = func_name
self.func_kwargs = func_kwargs or {}
def get(self) -> str:
"""
This is called by Delayed.get.
"""
from returnn_common import nn
import importlib
module = importlib.import_module(module_name)
func = getattr(module, self.func_name)
root_module = func(**self.func_kwargs)
return nn.get_returnn_config().get_complete_py_code_str(root_module)
def py_code_direct(self) -> str:
return "\n".join([
"from returnn_common import nn",
f"import {self.module_name}",
f"root_module = {self.module_name}.{self.func_name}({', '.join(f'{k}={v!r}' for k, v in self.func_kwargs.items())})",
"py_code = nn.get_returnn_config().get_complete_py_code_str(root_module)",
"eval(py_code, globals())",
""
]) This is a very simplistic draft. This ignores pretraining now. Also, maybe we should separate extern data from model definition logic more anyway? Extern data might also partially be auto-generated from dataset? In this example, the hash would be defined via The usage in Sis would maybe look like this: model_def = ReturnnCommonModelSerializer("i6_experiments.users.zeyer.model.my_best_model_123")
config = ReturnnConfig(..., python_epilog=[Delayed(model_def), ...])
train_job = ReturnnTrainingJob(config, ...) Or if we don't want model_def = ReturnnCommonModelSerializer("i6_experiments.users.zeyer.model.my_best_model_123")
config = ReturnnConfig(..., python_epilog=[model_def.py_code_direct(), ...])
train_job = ReturnnTrainingJob(config, ...) This example would not copy the file. In the first case with Again, as said, many details. It depends also on how your work-flow should look like, e.g. during debugging. Currently (before RC) you could just go to the work dir and run
I don't understand what you need this for. You already constructed the code, so you already have it, so why do you need to obtain it so indirectly again? |
@tbscode As I currently have other deadlines I postponed working on this, but the returnn_common integration will be available in 3-4 weeks I guess (after some more extensive testing by @Atticus1806 maybe). I already have partial code / concept on how to extend the above mentioned idea by @albertz to also support pre-training and using tk.Variables or tk.Paths as flexible parameters. Also note that |
Ok I see. Yes
It was convenient to use
I will not use |
I thought a bit further. Some loose thoughts: I want to decouple things more, like the More specifically: I definitely want to decouple the The model definition should also be decoupled from the I want to have a common model API for hybrid HMM and CTC like setups, sth like: class Model(nn.Module):
def __init__(self, out_dim: nn.Dim, in_dim: nn.Dim, **kwargs):
...
@nn.scoped
def __call__(self, x: nn.Tensor, *, in_spatial_dim: nn.Dim) -> Tuple[nn.Tensor, nn.Dim]:
...
return y, out_spatial_dim Or actually maybe just I would expect that logits come out of this, and Then separately I would have the loss definition. This could look like: # this is elsewhere:
model = Model(out_dim=output_dim + 1, in_dim=input_dim) # +1 for blank
inputs = nn.get_extern_data(nn.Data("data", dim_tags=[nn.batch_dim, time_dim, input_dim]))
logits, out_spatial_dim = model(inputs, in_spatial_dim=time_dim)
targets = nn.get_extern_data(nn.Data("classes", dim_tags=[nn.batch_dim, targets_time_dim], sparse_dim=output_dim))
# loss:
loss = nn.ctc_loss(logits=logits, targets=targets)
loss.mark_as_loss()
decoded, decoded_spatial_dim = nn.ctc_greedy_decode(logits, in_spatial_dim=out_spatial_dim)
error = nn.edit_distance(a=decoded, a_spatial_dim=decoded_spatial_dim, b=targets, b_spatial_dim=targets_time_dim)
error.mark_as_loss(as_error=True, custom_inv_norm_factor=nn.length(targets, axis=targets_time_dim)) The model (network) def maybe could be split further, e.g. by having some preprocessing like SpecAugment. All that would go into the Maybe I would introduce specifically: class NonhashedCode:
def __init__(self, code: str):
self.code = code And then it could look like: # Define some training exp:
extern_data = ReturnnCommonModelSerializer(...)
model = ReturnnCommonModelSerializer(...)
boilerplate1 = NonhashedCode(...)
loss = ReturnnCommonModelSerializer(...)
boilerplate2 = NonhashedCode(...)
epilog = [extern_data, model, boilerplate1, loss, boilerplate2]
# common function:
def train(epilog, version=1):
epilog_hash = (version,) + tuple(obj for obj in epilog if not isinstance(NonhashedCode))
epilog = [obj.py_code_direct() if isinstance(obj, ReturnnCommonModelSerializer) else obj for obj in epilog]
epilog = [obj.code if isinstance(obj, NonhashedCode) else obj for obj in epilog]
config = ReturnnConfig(
...,
python_epilog=epilog, python_epilog_hash=epilog_hash,
...)
job = ReturnnTrainingJob(config, ...)
... There are still some further details to be clarified, or maybe slightly fixed in the suggestions above. |
This is not too far away from my current code, just some comments: The data handling obviously needs to be separated from the model construction, but be aware that calling The manual handling of epilog/epilog_hash is also not needed, this should be done automatically by making So basically the code can be similar to what you wrote, just the three lines after |
Right. But this would only be for the extern data part.
It depends how you pass it. When you pass it as
This was intended as a convention to be used specifically for the boilerplate code in Btw, according to @michelwi in rwth-i6/i6_core#266 (comment), I anyway could not set |
This is the only valid solution I think, otherwise this will be a little bit messy, to not treat all "objects" passed to epilog the same. At least I would not understand why some things have to be converted outside to a string an others can be passed as is.
While this is certainly not the way it was intended this is fine to change. I will spend the monday with @Atticus1806 to fully build and test a pipeline that covers all your and our needs, which are:
Optionally we can look at:
So far I do not see any conflict in having a solution that works for all. Well, except maybe for the network hashing part but I think this is something we can live with not having for now, everyone just has to be careful with his/her @albertz please intervene here if something is unclear or wrong in your eyes. The only thing that is not fully clear is how some of the more independent dataset/datastream are going to be integrated, but this is only about personal structure/style preference and not about any technical limitation. |
Well, there is at least the difference that some of these are "boilerplate" which should not influence the hash in any way, by intention, and then the others which influence the hash, but only in the explicit defined way, e.g. via an explicit
It's still not totally clear to me how you map this to the model inputs and the loss. Maybe we would have the standard case of just
I want the boilerplate behavior to be just the same as if it would not be there. I want to be able to add further such objects into it without any change in the hash. So this means this is different to a boilerplate object where
Sure. We should put any such relevant helpers then maybe to |
We now have something (partially) working, but I am quite sure that in some details this does not meet exactly what you imagined, so we should continue the discussion. My current def get_config(
returnn_common_root,
training_datasets,
**kwargs):
"""
:param tk.Path returnn_common_root:
:param TrainingDatasets training_datasets:
"""
# changing these does not change the hash
post_config = {
[....],
'debug_print_layer_output_template': True,
}
config = {
[...]
'optimizer': {'class': 'Adam', 'epsilon': 1e-8},
'accum_grad_multiple_step': 2,
'batch_size': 10000,
'max_seqs': 200,
[...]
}
from i6_experiments.users.rossenbach.returnn.nnet_constructor import ReturnnCommonSerializer,\
ReturnnCommonExternData, ReturnnCommonDynamicNetwork, NonhashedCode
network_file = Path("get_transformer_network.py")
extern_data = [
datastream.as_nnet_constructor_data(key) for key, datastream in training_datasets.datastreams.items()]
config["train"] = training_datasets.train.as_returnn_opts()
config["dev"] = training_datasets.cv.as_returnn_opts()
config["eval_datasets"] = {'devtrain': training_datasets.devtrain.as_returnn_opts()}
recursionlimit = NonhashedCode(code=RECURSION_LIMIT_CODE)
rc_extern_data = ReturnnCommonExternData(
extern_data=extern_data
)
rc_network = ReturnnCommonDynamicNetwork(
network_file=network_file,
data_map={"source_data": "audio_features",
"target_data": "bpe_labels"},
parameter_dict={}
)
serializer = ReturnnCommonSerializer(
delayed_objects=[recursionlimit, rc_extern_data, rc_network],
returnn_common_root=returnn_common_root,
returnn_config = ReturnnConfig(
config=config,
post_config=post_config,
python_epilog=[serializer],
)
return returnn_config So this is close to what you imagined, except that we now have one Serializer which takes a list of custom objects/code etc... that do the returnn_common serialization for us. This allows user to use quite strict modules like the ones I presented here (which to exactly what I would like to have), but you can also write just any custom code you want and have less things managed by helpers. This code still uses the data helpers I used before, which we might want to replace. I do like the idea though of having the data helpers more Sisyphus specific, as this is where the data comes from. This of course does not prohibit us from moving the "Dataset" (not Datastream) helpers to Another thing is that here the
I strongly oppose this idea. You should always give understandable names to your inputs, and when using custom names you never trigger any custom behavior that was specifically written for those two entries. (Yes, we could remove this custom behavior with a new behavior version, but still it is better to just give good names and not ones which only make sense for basic ASR tasks). |
As additional information, the template = textwrap.dedent("""\
network_parameters = ${NETWORK_PARAMETERS}
${NETWORK_CODE}
def get_network(epoch, **kwargs):
nn.reset_default_root_name_ctx()
net = construct_network(epoch, **network_parameters)
return nn.get_returnn_config().get_net_dict_raw_dict(net)
""" @albertz maybe you have a good idea to get this more flexible, because otherwise things like a custom I also do not like the "conventions", that are not obvious to the user, meaning that you need to have a |
Where can I see your code? Why is it not pushed? How is What is What is
Why would you hardcode that? As I wrote before in my example, next to the module name ( Where is the code which does the f"root_module = {self.module_name}.{self.func_name}({', '.join(f'{k}={v!r}' for k, v in self.func_kwargs.items())})" How does the
In your example, you also did just the same but now it is called Where and how do you define the loss, and connect the model output and extern data with the loss? How is How is How is The
You probably refer to #55. I still think this should all be in
I don't really understand. How do you get the |
I can not just push it in common without review or making an extra branch, so I first put it under my user. Otherwise my Hiwis can not use it easily. You can of course copy it there, make your own adjustment and then start a PR for discussion. I just did not have any capacity for that yet. |
Why not? I thought we discussed this already? We would put a readme saying "this is work in progress" or so. I think this would be the easiest workflow. Sure, we could also have it in a separate extra branch. But not sure if there is really any benefit. |
Some initial code was merged as part of PR #66. The code is now in I added some summary from the discussion here to the README. |
The dataset stuff is still somewhat an open question. We had some initial discussion in the (unmerged, closed) PR #55. My opinion was that such dataset helpers belong to returnn_common as well, at least the definition of the dataset config dicts for RETURNN. Only recipe pipeline related dataset logic should be here in i6_experiments/common. |
How should we continue with discussions on the helper code? Here in this issue? In separate issues? Just directly (Slack, in person) (but then others cannot read it afterwards or join the discussion)? |
|
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
The logic in |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
You should never use If the dir is somehow relative to _module_dir = os.path.dirname(os.path.absname(__file__)) Or if this is the Sisyphus base dir, we can use |
For a dict it is not well defined what parameters should be defined. |
But this returns something completely different. I do not see how those two are related. |
This comment was marked as resolved.
This comment was marked as resolved.
Why does it need to be? It's whatever |
First of all, again, you should never use |
Where else? I asked you about that, see above. Or you see you don't want to discuss it at all and I should just change it? I don't understand your statement.
I don't understand? I just want to know what Python code standards you want to follow here. I'm not used to the code style in i6_core and i6_experiments/common. I see a lot of Python code violations everywhere in there. So I wondered whether you just don't care about that and you want it to be this way or not. Why do you want that I don't give comments on the code? Why is that related to pushing work-in-progress code? Why do you want to refrain from pushing WIP code? This does not make sense.
Again, I asked you about that, where to post these things, and you did not answer, and we already said earlier that we might just discuss it here in the issue, and I did not have a better idea where to post them, so I just posted it here. Where exactly is the problem now? Where else should I post them? Or you just want that I don't give any comments on this and we do not discuss it further? Also, I just wanted to write them down, to not forget about them. So you don't want to see my comments and I should better keep it private for me?
No, I did not say that. I said I want that we work directly in the master branch because I also directly want to work with the code and not just look at it in a PR, or work in another branch. I did not say that we should stop discussions on the code. I wonder, how do you want to work on the code together if you don't want that we discuss the code? |
This comment was marked as resolved.
This comment was marked as resolved.
But then please at least refrain from adding unnecessary statements to code comments that could be counted as personal attack. Nobody writes perfect code right away, and of course you can comment on that, but then do this is in a proper professional way without personal accusations, especially when you know that this was just some arbitrary WIP code that was pushed here. So lets keep this discussion first about relevant high-level things:
But this exactly what I do not want when working with RETURNN, that I have to look up in another code base or documentation what parameters are valid. The code itself should clearly define what is valid. This was the fundamental problem when working with dict-structures for layer definitions, and this is also something that should not happen here.
This would bring back the Tensorflow and RETURNN dependency to the Sisyphus manager thread, which I still strongly disagree with. Then regarding comments I consider rather off-topic at this moment: W.r.t. the
In most cases this is just because we can not break with old behavior (importing locations from |
Where are you reading anything like that? Why do you interpret things like that into what I wrote? I'm totally confused. Are you referring to my question on what code style you want to follow here? So I should not ask about what code style we want to follow? |
My suggestion was actually to just use the Note that for layers, returnn-common obviously creates a wrapper around them but this is for other technical reasons, and I'm also not super happy with it. However, to really avoid the problem that the wrapper and the layer interface diverge (this is one of the big problems with wrappers, when the wrapper even in the beginning does not fully match the real interface), we create them automatically. Although this might be a bit overkill of course now for
I still see this as very arbitrary. We also have e.g. Numpy and other libs. And why not? Your only argument here is that the file system is slow and TF import is slow? Such technical administrative issues should really not be a reason to make the code more complicated and introduce unnecessary wrappers and layers of abstraction. In any case, this is why I bring this up here. Such things need to be discussed. Some opinion from others would also be helpful. |
Lets talk about this offline, it is definitely not about a question regarding code style (or in case you wanted to ask only that, then you did not formulate it that way).
But about this I do not argue, this is exactly what we can and should discuss.
And I do not see this as arbitrary. Yes, some modules unfortunately import numpy on the global level, but in most cases it is only used in tasks. But here you could argue that most people kind of consider numpy as elemental part of python. Again, this is not (only) about the TF import. This is about having a fixed TF and RETURNN dependency in my Sisyphus manager in the first place. The manager should not have that. We can rather argue to remove numpy from the global imports than to add Tensorflow and RETURNN. (E.g. you can run all jobs with a specified singularity/apptainer image, which would specifically define which numpy to use for job tasks, independent of what happens in the manager). Also, the wrapper has the advantage that you are not showing many parameters that are completely irrelevant. If you expose nn.Dim and nn.Data, you have many options that are not relevant at all and are more confusing than helpful to the user. The additional code here is really minimal, it is not about re-implementing anything. It is really just a dataclass object holding some parameters, thats it. With your argument, you could also say we should import modules directly from RASR with C/C++ to Python bindings instead of re-implementing some interface helpers (like bliss format implementations or RASR-flow definitions). |
Note on the dataset questions, there has been some progress: |
Regarding the main points this Issue became obsolete now, as we are no longer passing constructed networks dicts for training. For both RETURNN-frontend and RETURNN-pure-PyTorch setups, we are linking code directly from separate files into the RETURNN configs using e.g. https://github.com/rwth-i6/i6_experiments/blob/main/common/setups/serialization.py As importing those separate files containing model code can be avoided if necessary, also importing PyTorch/Tensorflow in the manager is no longer an issue. |
When trying to create Returnn networks with
returnn_common.nn
, the following issues or questions appeared:If
returnn_common.nn
would be used for construction within the manager:returnn_common
version can be "pinned" before updating by using the git-hashed based python import call fromReturnn
, but theReturnn
version itself can not be pinned. Nevertheless the construction might be influenced by the Returnn version, making the correct handling of network hashes as we are used to very difficult up to impossible.As we will not solve the hashing issue without spending way to much time not doing experiments, I would propose to go with a completely different approach than we did before, which is:
-> Create a job which takes a executable and some pre-defined parameters does the network construction as task instead of in the manager.
This of course would mean that the construction result is not hashed, but only the constructor location and the parameters. To be able to find potentially unwanted changes and to be able to reconstruct the networks exactly, the job should create a local copy of the used construction code at runtime (returnn and returnn_common can of course correctly be identified by their git hash).
While this would require some manual management of the network construction to not make logic changing mistakes, I currently do not see any suitable alternative to this.
The text was updated successfully, but these errors were encountered: