Solve CSV field error #45

fdchevalier · 2022-02-15T15:50:05Z

Hi MSG team,

First, thank you for developing/maintaining this pipeline.

We have generated F2 hybrids from a cross between two species. Unfortunately we got scarce data and I am using this pipeline to rebuild genotypes. The genome of these species is ~380 Mb in size with 8 chromosomes (80 Mb max). The current assembly has most of the scaffolds assembled into complete chromosomes and additional tiny scaffolds.

When running the pipeline, I got an error when extracting the reference alleles after filtering for common reads. The error was _csv.Error: field larger than field limit (131072) which is generated when a field exceeds 128 KB of text. While the exact origin of this error is unknown, this could be due to high number of variants on a long chromosome. The proposed solution sets the limit to the maximum possible value.

Let me know if you need more information.

Error "_csv.Error: field larger than field limit (131072)" is generated when a field exceed 128 kB of text. While the exact origin of this error is unknown, this could be due to high number of variants on a long chromosome. The proposed solution sets the limit to the maximum possible value.

dstern · 2022-02-15T16:59:00Z

This script is run only if you set pepthresh in msg.cfg. We have rarely used this option and therefore not entirely debugged. It is not necessary for pipeline.
Recommend commenting out pepthresh and seeing if pipeline completes.
Also, if you have many small scaffolds, probably not useful to run all scaffolds. Huge I/O burden to running all scaffolds. You will find pipeline runs MUCH faster if you ignore small scaffolds.
In msg.cfg, set
chroms=all
to the chromosomes you want to analyze. For example, in Drosophila,
chroms = 2L,2R,3L,3R,4,X

fdchevalier · 2022-02-15T17:29:47Z

Thank you @dstern for the quick answer.

I indeed use pepthresh because it is in the toy example. I don't know what this does and it is not documented in the readme. So if this is not critical/useful, might be good to comment it in the example.

Regarding the small scaffolds, I should probably remove (some of) them indeed. We still have some of decent size (N90 is 25 Mb). So what would be your recommendation based on size.

Thank you.

dstern · 2022-02-15T17:36:14Z

Don't remove the sequences from your genome files. You want the full sequence available for read mapping. But just specify the chromosomes to analyze. Which size to include depends on your genome and question. Probably just want to do some trial and error.

fdchevalier · 2022-02-15T17:48:19Z

Sorry for the bad wording. My plan was to remove the scaffolds from the analysis by following your advice, not from the genome. I will play with the pipeline settings then. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve CSV field error #45

Solve CSV field error #45

Uh oh!

fdchevalier commented Feb 15, 2022

Uh oh!

dstern commented Feb 15, 2022

Uh oh!

fdchevalier commented Feb 15, 2022

Uh oh!

dstern commented Feb 15, 2022

Uh oh!

fdchevalier commented Feb 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Solve CSV field error #45

Are you sure you want to change the base?

Solve CSV field error #45

Uh oh!

Conversation

fdchevalier commented Feb 15, 2022

Uh oh!

dstern commented Feb 15, 2022

Uh oh!

fdchevalier commented Feb 15, 2022

Uh oh!

dstern commented Feb 15, 2022

Uh oh!

fdchevalier commented Feb 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants