Skip to content

[WIP: Implementation of a New Algorithm] Extension of MixtureFinder to morphological data #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 58 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
f2c802b
allow mAIC calculation for 0 or 1 "overlapping" sequences.
HuaiyanRen Mar 4, 2025
96d6b53
allow mAIC calculation for 2 "overlapping" sequences.
HuaiyanRen Mar 7, 2025
7357f83
method: add one gappy sequence for "2-overlapping" cases.
HuaiyanRen Mar 19, 2025
aa60951
fix bugs for old method of "2-overlapping" cases. Although we don't u…
HuaiyanRen Mar 19, 2025
219fd3f
Fix protein ambiguous preblem for mAIC. Only compute mAIC for DNA and…
HuaiyanRen Mar 21, 2025
687002e
Allow mAIC calculation for binary and codon data. Make mAIC calculati…
HuaiyanRen Mar 26, 2025
6df0555
extend MixtureFinder to codon, binary, multistate, and amino acid data
Apr 6, 2025
d3e3e59
fix frequency types
Apr 8, 2025
3a4de2b
allow FU in mixture models
Apr 15, 2025
13b6892
Allow to fix the parameters for RHAS when using mixture finder. Also …
thomaskf Apr 16, 2025
81f4e7a
Merge pull request #19 from thomaskf/mixFinder_update
thomaskf Apr 16, 2025
4710559
Fixed the issue happened when user specifies the RHAS model when runn…
thomaskf Apr 16, 2025
dd0c411
Merge pull request #20 from thomaskf/mixFinder_update
thomaskf Apr 16, 2025
39c8a32
delete the class ModelMultistate
HS6986 Apr 28, 2025
4af564f
Merge branch 'master' into feature/HS6986/extend-MixtureFinder
HS6986 Apr 28, 2025
477b8b0
Changed the wording of the GTRX warnings; disabled the GTRX warnings …
HS6986 Apr 29, 2025
5e95d0e
Fix indentation issues
HS6986 Apr 29, 2025
f262dbd
Merge branch 'master' into mixFinder_update
thomaskf May 9, 2025
53ec25a
Merge branch 'master' into huaiyan
HuaiyanRen May 9, 2025
06fffe7
Only generateNestNetwork for DNA models.
HuaiyanRen May 9, 2025
988531d
Allow the option -m MIX+MF, MIX+MFP, MF+MIX, MFP+MIX to run MixtureFi…
thomaskf May 9, 2025
16fc417
create a function to initialise MixtureFinder frequencies
HuaiyanRen May 9, 2025
61342be
Allow mixtureFinder on an alignment with a single partition
thomaskf May 9, 2025
016a6b8
Fixed the double commas inside the best_model.nex file
thomaskf May 9, 2025
6bac548
Fix conflicts
HS6986 May 11, 2025
0b417c2
Restore a comment I accidentally deleted
HS6986 May 11, 2025
ce2b9a9
Refine the code according to the advice by StefanFlaumberg and bqminh…
HS6986 May 11, 2025
727a934
Delete free(init_state_freq_set)
HS6986 May 12, 2025
623874a
Relocate the misplaced name = "GTRX";
HS6986 May 12, 2025
3f01fd9
Restrict isRateTypeNested() only to DNA data
HS6986 May 12, 2025
0442a7a
Set the best model reported by mixturefinder to the first partition
thomaskf May 12, 2025
0ec0308
Save the model checkpoint to the file when mixtureFinder finishes
thomaskf May 12, 2025
d4aebb4
Update mixtureFinder for alignment with single partition
thomaskf May 12, 2025
17eecd6
Merge remote-tracking branch 'origin/mixFinder_update' into mixFinder…
HuaiyanRen May 12, 2025
e0e397c
report the mixture model in single partition in mixture model format.
HuaiyanRen May 12, 2025
98779b8
Fixed the issue of using these models LG4X,LG4M in -madd
thomaskf May 12, 2025
d29ff88
report the mixture model in only in single-partition alignment in mix…
HuaiyanRen May 12, 2025
64423b1
Copy the resulting final tree to the first partition for mixfinder wi…
thomaskf May 12, 2025
0c58dc4
Implement a new MixtureFinder-like algorithm for selecting mixture mo…
HS6986 May 13, 2025
1e073ec
update the checkpoint file to keep the model parameters of the best m…
thomaskf May 14, 2025
0a9f5a5
Merge branch 'huaiyan' into mixFinder_update
HuaiyanRen May 15, 2025
c6b25ab
report the mixture model in only in single-partition alignment STILL …
HuaiyanRen May 15, 2025
66c5278
introduce an option -mfopt to further optimise nucl/aa frequencies in…
thomaskf May 15, 2025
6fc4e70
Add an option --writefreqchkpt to force writing nucl/aa freq to check…
thomaskf May 15, 2025
fdf0476
update the version number
thomaskf May 15, 2025
418a483
Fix algorithm
HS6986 May 15, 2025
83e6826
Merge master of thomaskf/iqtree3 and fix conflicts
HS6986 May 18, 2025
786bcb5
Fix algorithm
HS6986 May 18, 2025
d3b7290
minor fix
HS6986 May 18, 2025
89dac41
Fix
HS6986 May 18, 2025
af06267
Fix
HS6986 May 18, 2025
1388f73
Delete MIX{MK+FQ,MK+FQ}
HS6986 May 18, 2025
d2389ed
Revert "Merge master of thomaskf/iqtree3 and fix conflicts"
HS6986 May 18, 2025
aca3c94
Merge branch 'master' into feature/HS6986/MorphMixtureFinder
HS6986 May 21, 2025
005eead
Set warnings according to the advice by StefanFlaumberg; Set an error…
HS6986 May 21, 2025
077f45e
Change the wording of an error
HS6986 May 21, 2025
4bc5ef9
Change the wording of an error
HS6986 May 21, 2025
b1a757d
Merge remote-tracking branch 'upstream/master' into feature/HS6986/Mo…
HS6986 May 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 153 additions & 21 deletions main/phylotesting.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6207,6 +6207,16 @@ void addModel(string model_str, string& new_model_str, string new_subst) {
}
}

// initialise model frequency set in MixtureFinder for different sequence types
char* initFreqSet(SeqType seq_type) {
switch (seq_type) {
case SEQ_CODON: return ",F1X4,F3X4";
case SEQ_MORPH: return "FQ";
case SEQ_PROTEIN: return ",FO";
default: return "FO";
}
}

/**
@brief Find the best component of a Q-mixture model while fixing other components
@note This function is similar to runModelFinder, but it is adapted for optimisation of Q-Mixture model
Expand Down Expand Up @@ -6338,6 +6348,8 @@ CandidateModel findMixtureComponent(Params &params, IQTree &iqtree, ModelCheckpo

skip_all_when_drop = false;
} else if (mixture_action == MA_NUMBER_CLASS) {
// 2025/05/14 @HS6986
// I don't know when MA_NUMBER_CLASS is used and what should be expected in it for morphology!
params.ratehet_set = iqtree.getModelFactory()->site_rate->name;
// generate candidate models for the possible mixture models
multi_class_str = "";
Expand All @@ -6355,8 +6367,10 @@ CandidateModel findMixtureComponent(Params &params, IQTree &iqtree, ModelCheckpo
}
}
skip_all_when_drop = true;
} else if (mixture_action == MA_FIND_CLASS) {
char init_state_freq_set[] = "FO";
} else if (mixture_action == MA_FIND_CLASS) {
// 2025/05/14 @HS6986
// I don't know when MA_FIND_CLASS is used and what should be expected in it for morphology!
char* init_state_freq_set = initFreqSet(iqtree.aln->seq_type);
if (!params.state_freq_set) {
params.state_freq_set = init_state_freq_set;
}
Expand Down Expand Up @@ -6404,6 +6418,7 @@ CandidateModel findMixtureComponent(Params &params, IQTree &iqtree, ModelCheckpo

getModelSubst(iqtree.aln->seq_type, iqtree.aln->isStandardGeneticCode(), params.model_name,
params.model_set, params.model_subset, model_names);
StrVector model_names_orig = model_names;

if (model_names.empty())
return best_model; // empty
Expand Down Expand Up @@ -6432,11 +6447,54 @@ CandidateModel findMixtureComponent(Params &params, IQTree &iqtree, ModelCheckpo
}
}
}

for (i=0; i<model_names.size(); i++) {
string new_model_str;
addModel(model_str, new_model_str, model_names[i]);
candidate_models.push_back(CandidateModel(new_model_str, best_rate_name, iqtree.aln, 0));

if(getClassNum(model_str) == 1 && params.morph_mix_finder) {
auto isGTRXIncluded = [&]() -> bool {
return std::any_of(model_names_orig.begin(), model_names_orig.end(),
[](const std::string& s) { return s == "GTRX" || s == "GTR"; });
};
auto isFOIncluded = [&]() -> bool {
return std::any_of(freq_names.begin(), freq_names.end(),
[](const std::string& s) { return s == "+FO"; });
};
size_t new_model_strs_two_classes_len = (isGTRXIncluded() && isFOIncluded()) ? 3 : 1;
const char *new_model_strs_two_classes_GTRX_FO[] = {"MIX{MK+FO,MK+FO}", "MIX{GTRX+FQ,GTRX+FQ}", "MIX{GTRX+FO,GTRX+FO}"};
const char *new_model_strs_two_classes_GTRX[] = {"MIX{GTRX+FQ,GTRX+FQ}"};
const char *new_model_strs_two_classes_FO[] = {"MIX{MK+FO,MK+FO}"};
const char **new_model_strs_two_classes =
(isGTRXIncluded() && isFOIncluded())
? new_model_strs_two_classes_GTRX_FO
: (isGTRXIncluded())
? new_model_strs_two_classes_GTRX
: new_model_strs_two_classes_FO;

for (i=0; i<new_model_strs_two_classes_len; i++) {
string new_model_str = new_model_strs_two_classes[i];
candidate_models.push_back(CandidateModel(new_model_str, best_rate_name, iqtree.aln, 0));
}
} else {
for (i=0; i<model_names.size(); i++) {
string new_model_str;
addModel(model_str, new_model_str, model_names[i]);
if (params.morph_mix_finder) {
int newModelCountGTRX = 0;
size_t newModelPosGTRX = 0;
while (((newModelCountGTRX = new_model_str.find("GTR", newModelPosGTRX)) != string::npos)) {
++newModelCountGTRX;
newModelPosGTRX += 3;
}
int newModelCountFO = 0;
size_t newModelPosFO = 0;
while (((newModelPosFO = new_model_str.find("+FO", newModelPosFO)) != string::npos)) {
++newModelCountFO;
newModelPosFO += 3;
}
if (newModelCountGTRX == 1 || newModelCountFO == 1) {
continue;
}
}
candidate_models.push_back(CandidateModel(new_model_str, best_rate_name, iqtree.aln, 0));
}
}

skip_all_when_drop = false;
Expand Down Expand Up @@ -6526,8 +6584,8 @@ void runMixtureFinderMain(Params &params, IQTree* &iqtree, ModelCheckpoint &mode
bool better_model;
double LR, df_diff, pvalue;
string criteria_str;

char init_state_freq_set[] = "FO";
char* init_state_freq_set = initFreqSet(iqtree->aln->seq_type);
if (!params.state_freq_set) {
params.state_freq_set = init_state_freq_set;
}
Expand All @@ -6538,18 +6596,71 @@ void runMixtureFinderMain(Params &params, IQTree* &iqtree, ModelCheckpoint &mode

// Step 0: (reorder candidate DNA models when -mset is used) build the nest-relationship network
map<string, vector<string> > nest_network;
if (iqtree->aln->seq_type == SEQ_DNA) {
StrVector model_names, freq_names;
getModelSubst(iqtree->aln->seq_type, iqtree->aln->isStandardGeneticCode(), params.model_name,
params.model_set, params.model_subset, model_names);
getStateFreqs(iqtree->aln->seq_type, params.state_freq_set, freq_names);
StrVector model_names, freq_names;
getModelSubst(iqtree->aln->seq_type, iqtree->aln->isStandardGeneticCode(), params.model_name,
params.model_set, params.model_subset, model_names);
getStateFreqs(iqtree->aln->seq_type, params.state_freq_set, freq_names);

if (params.morph_mix_finder) {
auto areRatesJustified = [&]() -> bool {
return std::all_of(model_names.begin(), model_names.end(),
[](const std::string& s) { return s == "MK" || s == "GTRX" || s == "GTR"; });
};
if (!areRatesJustified()) {
outError("Error! Specifying models other than MK and GTRX for MixtureFinder for morphological data is not justified");
}

auto areFreqsJustified = [&]() -> bool {
return std::all_of(freq_names.begin(), freq_names.end(),
[](const std::string& s) { return s == "+FQ" || s == "+FO"; });
};
if (!areFreqsJustified()) {
outError("Error! Specifying frequencies other than +FQ and +FO for MixtureFinder for morphological data is not justified");
}

auto isMKIncluded = [&]() -> bool {
return std::any_of(model_names.begin(), model_names.end(),
[](const std::string& s) { return s == "MK"; });
};
if (!isMKIncluded()) {
outError("Error! Not specifying the MK model for MixtureFinder for morphological data is not justified");
}

auto isFQIncluded = [&]() -> bool {
return std::any_of(freq_names.begin(), freq_names.end(),
[](const std::string& s) { return s == "+FQ"; });
};
if (!isFQIncluded()) {
outError("Error! Not specifying the FQ frequency for MixtureFinder for morphological data is not justified");
}
}

auto isOnlyMKAndFQ = [&]() -> bool {
bool onlyMK = std::all_of(model_names.begin(), model_names.end(),
[](const std::string& s) { return s == "MK"; });
bool onlyFQ = std::all_of(freq_names.begin(), freq_names.end(),
[](const std::string& s) { return s == "+FQ"; });
return onlyMK && onlyFQ;
};
if (isOnlyMKAndFQ()) {
outError("Error! Running MixtureFinder only with the MK model and the FQ frequency is completely meaningless.\nPlease provide additional models and/or frequencies, such as GTRX, +F, and +FO, using -mset and/or -mfreq, if you really want to use MixtureFinder for your multistate data.");
}

if (iqtree->aln->seq_type == SEQ_DNA) {
nest_network = generateNestNetwork(model_names, freq_names);
}

// Step 1: run ModelFinder
params.model_name = "";
bool under_mix_finder = true;

string orig_model_set = params.model_set;
char *orig_state_freq_set = params.state_freq_set;
if (params.morph_mix_finder) {
params.model_set = "MK";
params.state_freq_set = "+FQ";
}

runModelFinder(params, *iqtree, model_info, best_subst_name, best_rate_name, nest_network, under_mix_finder);

// (cancel) Step 2: do tree search for this single-class model
Expand All @@ -6571,6 +6682,11 @@ void runMixtureFinderMain(Params &params, IQTree* &iqtree, ModelCheckpoint &mode
model_info.getString("best_model_list_" + criteria_str, best_model_pre_list);

// Step 3: keep adding a new class until no further improvement
if (params.morph_mix_finder) {
params.model_set = orig_model_set;
params.state_freq_set = orig_state_freq_set;
}

if (params.opt_qmix_criteria == 1) {
cout << endl << "Keep adding an additional class until the p-value from the likelihood ratio test > " << params.opt_qmix_pthres << endl;
} else {
Expand Down Expand Up @@ -6649,13 +6765,29 @@ void runMixtureFinder(Params &params, IQTree* &iqtree, ModelCheckpoint &model_in

if (iqtree->isSuperTree())
outError("Error! The option -m '" + params.model_name + "' cannot work on data set with partitions");

if (iqtree->aln->seq_type != SEQ_DNA)
outError("Error! The option -m '" + params.model_name + "' can only work on DNA data set");

cout << "--------------------------------------------------------------------" << endl;
cout << "| Running MixtureFinder |" << endl;
cout << "--------------------------------------------------------------------" << endl;
outWarning("MixtureFinder has not been tested for non-DNA data types. Be cautious about interpreting the results");

if (iqtree->aln->getMaxNumStates() > 6)
outWarning("Running MixtureFinder for the given data type can take much time. Please consider restricting the set of the models to test as much as possible");

if (iqtree->aln->seq_type == SEQ_PROTEIN && !params.force_aa_mix_finder)
outError("Error! We already have the mixture frequency vectors C10–C60 which are effective for modeling amino acid data.\nPlease make sure running MixtureFinder for your amino acid data makes sense.\nIf you are determined to do that, please add an option --force-aa-mix-finder to the command line.");

if (params.morph_mix_finder && iqtree->aln->seq_type != SEQ_MORPH) {
outError("Error! The option -morph-mix can only work on morphological data set");
}

if (params.morph_mix_finder) {
cout << "--------------------------------------------------------------------" << endl;
cout << "| Running MixtureFinder for morphological data |" << endl;
cout << "--------------------------------------------------------------------" << endl;
} else {
cout << "--------------------------------------------------------------------" << endl;
cout << "| Running MixtureFinder |" << endl;
cout << "--------------------------------------------------------------------" << endl;
}

// disable the bootstrapping
int orig_gbo_replicates = params.gbo_replicates;
Expand Down
2 changes: 2 additions & 0 deletions model/modelmixture.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1403,6 +1403,8 @@ void ModelMixture::initMixture(string orig_model_name, string model_name, string
} else {
outError("The user defined frequency model is incorrect");
}
} else if (fstr == "+FU" || fstr == "+Fu") {
model_freq = FREQ_USER_DEFINED;
} else if (fstr == "+FO" || fstr == "+Fo") {
model_freq = FREQ_ESTIMATE;
} else if (fstr == "+FQ" || fstr == "+Fq") {
Expand Down
26 changes: 15 additions & 11 deletions model/modelmorphology.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,11 @@ void ModelMorphology::init(const char *model_name, string model_params, StateFre
}
num_params = 0;
} else if (name == "GTR" || name == "GTRX") {
outWarning("GTRX multistate model will estimate " + convertIntToString(getNumRateEntries()-1) + " substitution rates that might be overfitting!");
outWarning("Please only use GTRX with very large data and always test for model fit!");
name = "GTRX";
if (num_states > 6 || phylo_tree->aln->getNPattern() < 100) {
outWarning("GTRX multistate model will estimate " + convertIntToString(getNumRateEntries()-1) + " substitution rates that might be overfitting for the length of your alignment/partition!");
outWarning("Please only use GTRX with large data and always test for model fit!");
}
name = "GTRX";
} else {
// if name does not match, read the user-defined model
readParameters(model_name);
Expand Down Expand Up @@ -122,20 +124,22 @@ string ModelMorphology::getName() {
}

string ModelMorphology::getNameParams(bool show_fixed_params) {
if (num_params == 0) return name;
ostringstream retname;
size_t pos_plus = name.find('+');
if (pos_plus != string::npos) {
retname << name.substr(0, pos_plus) << '{';
retname << name.substr(0, pos_plus);
} else {
retname << name << '{';
retname << name;
}
int nrates = getNumRateEntries();
for (int i = 0; i < nrates; i++) {
if (i>0) retname << ',';
retname << rates[i];
if (num_params > 0 || show_fixed_params) {
retname << '{';
int nrates = getNumRateEntries();
for (int i = 0; i < nrates; i++) {
if (i > 0) retname << ',';
retname << rates[i];
}
retname << '}';
}
retname << '}';
if (pos_plus != string::npos) {
retname << name.substr(pos_plus);
} else {
Expand Down
17 changes: 17 additions & 0 deletions utils/tools.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1487,6 +1487,7 @@ void parseArg(int argc, char *argv[], Params &params) {
params.ignore_checkpoint = false;
params.checkpoint_dump_interval = 60;
params.force_unfinished = false;
params.force_aa_mix_finder = false;
params.print_all_checkpoints = false;
params.suppress_output_flags = 0;
params.ufboot2corr = false;
Expand Down Expand Up @@ -3729,6 +3730,17 @@ void parseArg(int argc, char *argv[], Params &params) {
params.opt_qmix_criteria = 2; // using information critera instead of likelihood-ratio test for estimation of number of classes for Q-Mixture model
*/
continue;
}
if (strcmp(argv[cnt], "-morph-mix") == 0 || strcmp(argv[cnt], "--morph-mix") == 0) {
cnt++;
if (cnt >= argc)
throw "Use -morph-mix <0|1>";
int in_option = convert_int(argv[cnt]);
if (in_option < 0 || in_option > 1)
throw "Wrong option for -morpho-mix. Only 0 or 1 is allowed.";
if (in_option == 1)
params.morph_mix_finder = true;
continue;
}
if (strcmp(argv[cnt], "-a") == 0) {
cnt++;
Expand Down Expand Up @@ -5475,6 +5487,11 @@ void parseArg(int argc, char *argv[], Params &params) {
params.force_unfinished = true;
continue;
}

if (strcmp(argv[cnt], "-force-aa-mix-finder") == 0 || strcmp(argv[cnt], "--force-aa-mix-finder") == 0) {
params.force_aa_mix_finder = true;
continue;
}

if (strcmp(argv[cnt], "-cptime") == 0 || strcmp(argv[cnt], "--cptime") == 0) {
cnt++;
Expand Down
8 changes: 8 additions & 0 deletions utils/tools.h
Original file line number Diff line number Diff line change
Expand Up @@ -1707,6 +1707,11 @@ class Params {
*/
bool opt_rhas_again;

/**
whether MixtureFinder optimizes the Q-mixture model for morphological data
*/
bool morph_mix_finder;

/**
The method to optimize (and estimating the number of classes in) the Q-mixture model
Method 1 (Old method)
Expand Down Expand Up @@ -2453,6 +2458,9 @@ class Params {
/** true if ignoring the "finished" flag in checkpoint file */
bool force_unfinished;

/** true if forcing IQ-TREE to run MixtureFinder for amino acid data */
bool force_aa_mix_finder;

/** TRUE to print checkpoints to 1.ckp.gz, 2.ckp.gz,... */
bool print_all_checkpoints;

Expand Down
Loading