Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix passing str variable to readFrame/readMatrix #598

Conversation

MarcusParadies
Copy link
Contributor

Fixes #582

I have also added some end2end tests, but don't really know where to put them. Any suggestions?

Copy link
Collaborator

@corepointer corepointer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever move, adding type inference to do the type inference ;-) I like it. Concerning the test cases - I don't see any in the PR yet. You could put them to the read kernel tests, to the function call tests or to the type inference tests? Since you're mainly fixing the read call I'd say it would be ok to put it to the first option but I don't really have a strong opinion about it.

Comment on lines +303 to +301
// If an individual value type was specified per column
// (fmd.isSingleValueType == false), then this silently uses the
// type of the first column.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment about isSingleValueType is not correct and can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved it over, but I can also remove it alltogether.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment referred to reading a matrix from a file that has a schema defined. So it is not utterly wrong. You could move it a few lines up to the code that handles reading matrices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -188,6 +188,7 @@ namespace {
std::multimap<std::string, func::FuncOp> specializedVersions;
std::set<func::FuncOp> visited;
std::set<func::FuncOp> called;
std::set<func::FuncOp> templateFunctions;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra set for templateFunctions (and its unconditional use in line 299) is not clear to me right away (I did not test-run the code, maybe it would become clear then). Could you explain why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I thought the check isFunctionTemplate would already do the trick, but apparantely it does something else. I basically remove the "template function", which is specialized from considerations in later passes (i.e., doing a type inference on such a function doesn't make much sense imho). Template functions are those, which are specialized, that's why I add them unconditionally. According to the test suite this should be fine, but I feel this will require more thorough testing (independent from the reported bug).

In general there are quite a few question marks in this pass. For example the constant parameter propagation leads to X functions with identical code (except for the constant parameter of course) when invoking the UDF X times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played around with the code a bit now and see what you mean. The generation of a separate function for each invocation (even if it is the same parameter) is a bit weird and could be improved.

@@ -408,7 +410,7 @@ void SpecializeGenericFunctionsPass::runOnOperation() {
entryFunctions.push_back(entry.second);
}
for(const auto &function : entryFunctions) {
if(isFunctionTemplate(function) || visited.count(function))
if(isFunctionTemplate(function) || visited.count(function) || templateFunctions.count(function))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the invocations of count() were there before, so doing it the same is quite alright and tbh I'm unsure if using empty() is just a matter of taste or if it would yield a (minor?) performance benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count has logarithmic complexity whereas empty has constant complexity, so there should be a slight advantage here. In general, I tried to change as little as possible although the entire pass probably needs a bigger rewrite anyways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count actually makes sense here, as we're checking for that specific function in the set and not if there are any functions at all.

@MarcusParadies
Copy link
Contributor Author

Clever move, adding type inference to do the type inference ;-) I like it. Concerning the test cases - I don't see any in the PR yet. You could put them to the read kernel tests, to the function call tests or to the type inference tests? Since you're mainly fixing the read call I'd say it would be ok to put it to the first option but I don't really have a strong opinion about it.

Done.

@pdamme pdamme self-requested a review September 5, 2023 14:37
Copy link
Collaborator

@pdamme pdamme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @MarcusParadies. Overall, it looks very good to me. Moving the retrieval of the file meta data from the DaphneDSL parser to the type/property inference is a very good idea, since it gives us more flexibility.

Required before the merge:

  • DaphneInferTypesOpInterface.cpp: daphne::ReadOp::inferTypes() does not only infer the (value) type, but also the column labels in case of a frame. However, frame label inference is done by another interface (DaphneInferFrameLabelsOpInterface). I recommend implementing that interface for ReadOp as well. Even though it means reading the metadata file once more, I think it would be better from a division-of-concerns point of view. Note that we followed the same principle with DaphneInferShapeOpInterface and DaphneInferSparsityOpInterface.

Optional improvements:

  • While the immediate problem of passing str variable to readFrame/readMatrix does not work #582 is fixed by this PR, the problem is slightly more general: the file name used in readMatrix()/readFrame() may remain unknown until some point. With the current code, the error reported in passing str variable to readFrame/readMatrix does not work #582 can still be caused, e.g., by the following script:
    i = 1;
    filename = "test/api/cli/io/ReadCsv" + i + ".csv";
    m = readMatrix(filename);
    print(m);
    
    This is because here, the file name is not known yet when the first inference pass starts, but only after the first round of constant propagation. I think we could easily tolerate this case by making the type inference for ReadOp return an unknown value type if the file name is not constant (yet). For the script above, the type inference should be successful after a few passes then.
  • It would be great to add a comment line to testReadFrame.daphne and testReadMatrix.daphne that clarifies what is the crucial point that is tested here. We do that in many/most script-level test cases as it may help to understand why a test case was written that way later. I.e., it's about passing the path to the UDF, such that it is not known at parse-time.

Regarding the changes to SpecializeGenericFunctionsPass.cpp: I was also a bit surprised, like @corepointer. But yes, this pass anyway needs a rewrite.

@pdamme
Copy link
Collaborator

pdamme commented Sep 5, 2023

By the way, the place for the script-level test cases is fine :) .

@MarcusParadies
Copy link
Contributor Author

I incorporated everything so far. It might be a good idea to think about limiting what can be passed as path name to the read functions. We can always have cases where we cannot compile-time-know the string path, which would lead currently to errors since no type deduction etc. can be performed.

For now it might be most reasonable to restrict the usage of string literals representing paths to be "compile-time-known" (whatever that specifically means). For trivial cases, like the one posted by @pdamme, the compiler might be able to do that, but think about dynamic path generation in loops... my 2 cents: restrict that (for now) and throw a proper, meaningful error message.

@corepointer
Copy link
Collaborator

Generating paths in a loop to read files with sequence numbers is the prime concern of this issue to be resolved for one of our use case partners.

Copy link
Collaborator

@pdamme pdamme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements, @MarcusParadies, looks good to me. The required point about the frame label inference has been addressed. So I think, given the initial issue description, this is a valid fix and could be merged.

Regarding (non-)compile-time known paths: The example script above could be made working (*). But, yes, there will always be cases where we cannot know the path at compile-time, and we will need a way to cope with that. For now, the idea of an understandable error message sounds good to me.

Indeed, generating paths in loops has been hinted by @m-birke in #582. I assume that means something like

for(i in 1:10) {
  m = readMatrix("path/to/my/file_" + i + ".csv");
  print(m);
}

That would be a bit more difficult to get working, since the value of i is not a compile-time constant. Some possible solutions include:

  • passing a hint on the value type to readMatrix() (could mean inconsistencies with the actual file meta data)
  • using another abstraction for loading data sets split into multiple files, such that it is done by just one read-call
  • dynamic recompilation for each loop iteration

But I think all of that is beyond the scope of this PR.


(*) How to make the example in testReadMatrix_DynamicPath.daphne work: The expression "test/api/cli/io/ReadCsv" + i + ".csv" should be resolved at compile-time through constant folding, which is implemented for various DaphneIR ops in src/ir/daphneir/DaphneDialect.cpp.

Running the scripts as bin/daphne --explain property_inference test/api/cli/io/testReadMatrix_DynamicPath.daphne yields the following IR output:

IR after inference:
module {
  func.func @main() {
    %0 = "daphne.constant"() {value = false} : () -> i1
    %1 = "daphne.constant"() {value = true} : () -> i1
    %2 = "daphne.constant"() {value = ".csv"} : () -> !daphne.String
    %3 = "daphne.constant"() {value = 1 : si64} : () -> si64
    %4 = "daphne.constant"() {value = "test/api/cli/io/ReadCsv"} : () -> !daphne.String
    %5 = "daphne.cast"(%3) : (si64) -> !daphne.String
    %6 = "daphne.concat"(%4, %5) : (!daphne.String, !daphne.String) -> !daphne.String
    %7 = "daphne.concat"(%6, %2) : (!daphne.String, !daphne.String) -> !daphne.String
    %8 = "daphne.read"(%7) : (!daphne.String) -> !daphne.Matrix<?x?x!daphne.Unknown>
    "daphne.print"(%8, %1, %0) : (!daphne.Matrix<?x?x!daphne.Unknown>, i1, i1) -> ()
    "daphne.return"() : () -> ()
  }
}

Constant folding is implemented for ConcatOp (daphne.concat). However, constant folding only works if all inputs are already constants. In this case, %5 is not a constant, but the result of a daphne.cast. CastOp (daphne.cast) also supports constant folding, but the case for casting to a string hasn't been implemented yet. I think this part of the constant folding is the missing piece to make the example work.

@pdamme
Copy link
Collaborator

pdamme commented Sep 8, 2023

Nevertheless, making read with a generated file name inside a loop work is something we should address soon, maybe using one of the possible approaches mentioned above.

@corepointer
Copy link
Collaborator

Another workaround that just came to my mind to mitigate the problem when generating path names in a loop would be to unroll the loop at compile time. In the example above all required knowledge to generate paths 1 to 10 is already in the code.

MarcusParadies and others added 2 commits September 11, 2023 09:39
This commit fixes a crash when passing "dynamic strings" that have not yet been materialized in the paring stage to the getMetaData() function. The current solution is to move this functionality out of the parsing stage and into the type inference stage of the DAPHNE compiler.

Incorporated review suggestions

- Implement DaphneInferFrameLabelsOpInterface for ReadOp
- Added comment lines to test scripts
- Fixed IsConstant Helper function for string types
- Added further example for dynamic path computation and fixed null string problem - still does not work since the string cannot be (yet) computed at compile time

Fixes daphne-eu#582
Closes daphne-eu#598
This commit fixes two compiler warnings.
- an unnecessary declaration
- a control flow issue that would resultin a missing return value
@corepointer corepointer force-pushed the 582-fix-passing-str-variable-to-readFrame-and-readMatrix branch from 6c5a2c0 to 96fbecb Compare September 11, 2023 08:03
@corepointer corepointer merged commit 96fbecb into daphne-eu:main Sep 11, 2023
2 checks passed
@corepointer
Copy link
Collaborator

LGTM - all tests passing here.
Thx @MarcusParadies for fixing this and @pdamme for the additional review.

@MarcusParadies MarcusParadies deleted the 582-fix-passing-str-variable-to-readFrame-and-readMatrix branch September 11, 2023 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

passing str variable to readFrame/readMatrix does not work
3 participants