Skip to content

Conversation

@fdchevalier
Copy link

Hi MSG team,

First, thank you for developing/maintaining this pipeline.

We have generated F2 hybrids from a cross between two species. Unfortunately we got scarce data and I am using this pipeline to rebuild genotypes. The genome of these species is ~380 Mb in size with 8 chromosomes (80 Mb max). The current assembly has most of the scaffolds assembled into complete chromosomes and additional tiny scaffolds.

When running the pipeline, I got an error when extracting the reference alleles after filtering for common reads. The error was _csv.Error: field larger than field limit (131072) which is generated when a field exceeds 128 KB of text. While the exact origin of this error is unknown, this could be due to high number of variants on a long chromosome. The proposed solution sets the limit to the maximum possible value.

Let me know if you need more information.

Error "_csv.Error: field larger than field limit (131072)" is generated
when a field exceed 128 kB of text. While the exact origin of this error
is unknown, this could be due to high number of variants on a long
chromosome. The proposed solution sets the limit to the maximum possible
value.
@dstern
Copy link
Collaborator

dstern commented Feb 15, 2022

This script is run only if you set pepthresh in msg.cfg. We have rarely used this option and therefore not entirely debugged. It is not necessary for pipeline.
Recommend commenting out pepthresh and seeing if pipeline completes.
Also, if you have many small scaffolds, probably not useful to run all scaffolds. Huge I/O burden to running all scaffolds. You will find pipeline runs MUCH faster if you ignore small scaffolds.
In msg.cfg, set
chroms=all
to the chromosomes you want to analyze. For example, in Drosophila,
chroms = 2L,2R,3L,3R,4,X

@fdchevalier
Copy link
Author

Thank you @dstern for the quick answer.

I indeed use pepthresh because it is in the toy example. I don't know what this does and it is not documented in the readme. So if this is not critical/useful, might be good to comment it in the example.

Regarding the small scaffolds, I should probably remove (some of) them indeed. We still have some of decent size (N90 is 25 Mb). So what would be your recommendation based on size.

Thank you.

@dstern
Copy link
Collaborator

dstern commented Feb 15, 2022

Don't remove the sequences from your genome files. You want the full sequence available for read mapping. But just specify the chromosomes to analyze. Which size to include depends on your genome and question. Probably just want to do some trial and error.

@fdchevalier
Copy link
Author

Sorry for the bad wording. My plan was to remove the scaffolds from the analysis by following your advice, not from the genome. I will play with the pipeline settings then. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants