Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dupcol_action::presolve Method Performance Optimization Help] When dealing with large-scale MILP problems, the efficiency issue of dupcol_action::presolve method #221

Open
wozhendeshuai opened this issue Mar 26, 2024 · 2 comments

Comments

@wozhendeshuai
Copy link

wozhendeshuai commented Mar 26, 2024

@jjhforrest
I encountered a performance bottleneck while trying to solve a large MILP problem containing variables of tens of millions. Currently, the solver I am using is CBC, and I spent up to 6 hours in the preprocessing phase.
After careful analysis, it was found that time was mainly consumed during the processing of duplicate columns in the preprocessing stage, which means that the execution time of the dupcol_action::presolve method is quite long (reaching more than 3H), becoming the main bottleneck of the entire preprocessing stage. Although attempts have been made to enable multi-threaded concurrent processing, there has been no significant improvement in speed in practical applications.

In summary:

  • MILP model size: tens of millions of variables
  • Preprocessing time: concentrated on thedupcol_action::presolve method, reaching more than 3 hours
  • Attempted optimization method: enabling multithreading, but no significant speed increase observed

The current understanding of the dupcol_action::presolve method is as follows:
This method is used to find and process duplicate columns with the same coefficients in mathematical programming models. When it is found that all non-zero elements and their corresponding rows between two or more columns are completely consistent, the preprocessor will merge these columns into one column and update the corresponding lower bound, upper bound, column elements, and possible column solutions. During the merging process, the values of the lower and upper bounds are added together to obtain a new boundary range; If the boundary value exceeds a reasonable range, set it to infinity to indicate unbounded. In addition, it will also determine the state of the merged new column based on the state information of the column (such as one of the columns being the base variable). In some cases, if the sparsity pattern of the columns is different but their contribution to the rows is the same (for example, considering tolerance), it is possible to choose to delete one of the columns and adjust the right-hand term of the remaining columns to maintain equivalence. At the same time, the code also handles situations where inequality constraints may become contradictory or overlapping due to column merging, as well as the resulting infeasibility judgments.

If there is any deviation in my understanding above, I hope you can help me correct it.

I hope to receive some professional guidance from you on whether there are specific optimization strategies for such large-scale problems. If you have any suggestions on how to tune, change parameter settings, or improve algorithms to adapt to large-scale scenarios, please let me know.

At the same time, if more details about problem instances or computing environments are needed, I am willing to cooperate in providing them.

Looking forward to your reply and support, thanks!

@jjhforrest
Copy link
Contributor

If there are not many duplicate columns then checking for them can be switched off in preprocessing. If you are using stand-alone cbc (or calling CbcMain as in some of the examples) then this can be switched off by -tunePreprocess - changing this from 7 to 7|4096 (4103). However the initialSolve will also try and find duplicate columns and to switch that off you will need -presolve off.

If you just want good solutions - not the optimal solution it may be useful to look at papers on aircrew scheduling problems as these have zillions of possible columns.

If you like, you can send me a sample problem.

@wozhendeshuai
Copy link
Author

@jjhforrest
Thank you very much for your help. Due to some issues, we are unable to synchronize data with you. But we are honored to synchronize with you about the current progress.
We have successfully reduced the overall solving time from 40000 seconds to 20000 seconds by removing the code that handles duplicate columns below.
image

When analyzing the dataset in-depth, we found that our tens of millions of data are concentrated within the range of 0 to 1, which means that the right side adjustment in the data preprocessing step has become an unnecessary step, further simplifying our process.
As for why removing this code can significantly accelerate the solving process, we currently lack a clear theoretical explanation, which is a puzzle we are eager to explore.
We sincerely request that you provide a possible analytical direction or approach to help us unravel the mystery of efficiency improvement.
Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants