Skip to content

RaquelPinho/GENEscraping

Repository files navigation

GENEscraping

The goal of GENEscraping is to use web scraping to retrieve sequences from genomic regions contained in a table of genomic coordinates in R and write fasta files of the individual sequences or combine in one file. There is also an option to write tags at the middle of the genomic coordinates, on the fasta files to make it suitable for use in the design of rhAmpSeq panel primer design. The web scraping functions use RSelenium to connect to web drivers and retrive web sites information. WARNING! It will use and open automatically the internet browser chosen.

Installation

You can install the development version of GENEscraping from GitHub with:

# install.packages("devtools")
devtools::install_github("RaquelPinho/GENEscraping")

Example

This is a basic example of a genomic coordinate table that can be used to retrieve the fasta sequences:

library(GENEscraping)
coord_table <- tibble::tribble(
                              ~Name, ~Chr, ~NCBI_ID, ~start, ~end,
                              "target1", "chr1", "NC_010443.5", 39157405, 39157425,
                              "target2", "chr16", "NC_010458.4", 27277933, 27277956,
                              "target3", "chr3", "NC_010445.4", 46206427, 46206449
                               )
knitr::kable(coord_table)
Name Chr NCBI_ID start end
target1 chr1 NC_010443.5 39157405 39157425
target2 chr16 NC_010458.4 27277933 27277956
target3 chr3 NC_010445.4 46206427 46206449

In this example table, I am using regions of the porcine genome. To get the list of websites containing the regions in the table you can use the function get_coord_website. You can also choose if you want to retrieve the regions as stated in the table or add n nucleotides upstream and downstream of the regions present in the table. In the example, 250 nts were added, flanking the regions on the coordinates.

weblist <- get_coord_website(coord_table = coord_table, flank_n = 250)
weblist
#> $target1
#> [1] "https://www.ncbi.nlm.nih.gov/nuccore/NC_010443.5?report=fasta&from=39157155&to=39157675"
#> 
#> $target2
#> [1] "https://www.ncbi.nlm.nih.gov/nuccore/NC_010458.4?report=fasta&from=27277683&to=27278206"
#> 
#> $target3
#> [1] "https://www.ncbi.nlm.nih.gov/nuccore/NC_010445.4?report=fasta&from=46206177&to=46206699"

After we have the urls for each of the coordinates, we can now collect the fasta sequences. This function use RSelenium to open NCBI urls and extract the fasta sequence from them. WARNING! It will automatically open the browser chosen to retrieve the information. You can use the code binman::list_versions("chromedriver") to know the most updated version of chrome to use and update the chromever parameter of the get_fasta function.

fasta_list <- get_fasta(weblist = weblist, browser = "chrome",  verbose = FALSE)
fasta_list 
#> [[1]]
#> [[1]][[1]]
#> [1] ">NC_010443.5:39157155-39157675 Sus scrofa isolate TJ Tabasco breed Duroc chromosome 1, Sscrofa11.1, whole genome shotgun sequence target1"
#> 
#> [[1]][[2]]
#> [1] "TGCGTTCATGACTCCAAGAGAAACCTTTGATGAGCACCAAATGTTAGACCTGTCCTTCCTGGATGTTTATAAAAATTTATCCTAGAATGTATGCATAATCTTATTCACTGATAAGATGGTTCAATGGGAAAAACAATATTATCGGAAGCCATCTCTTAAAATGGCTTACCAAGTATGCTAAATTGTTCAGTTTGTCCTAAACATAACCCTGGAAAATCCATCTGAAATTTCACAGGTTATTATTTTTTTTAACCCTCCAGACCTTTTGAGGTGTGGCAAATGGATTTTATTGAGGTGCCATCATCTCAAGGTTGTAAATATTTATTGATAACAATTTGTATGTTCTCTCATTGGTTGAAGGATTTTCCTTGTTACACAGCCATGGCCACAGCGGTACATAAAGTCTTTTTGAGAAAAGTTTTTCCTACTTGAGGAATACCCTCTGAATAATGACAGAGGTTCCCATTTTAGTCAATAAGTAATTTCAATCTGTTTGTAAAATCAGGCTTACTTTATAACAT"
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> [1] ">NC_010458.4:27277683-27278206 Sus scrofa isolate TJ Tabasco breed Duroc chromosome 16, Sscrofa11.1, whole genome shotgun sequence target2"
#> 
#> [[2]][[2]]
#> [1] "AGGATAGTTTTGCTTCCTTGTCCCCCCTTTCTTCTGTCATTTGCCCAAATTCTACCTCTTGTCAAAGATAATGGTTCCTCCCTACCTCTGCTAACCTTTGGATTTTTAAAATGTTGCCAACAATGAAAGTGTATTTATAAAATTAATACATTTTAGAAATTTCCAATCACCAGATTAAAGGGACAGATGTGAACACAGCACTAACAGTGTAAATTATAAAGGACACTTAAACACGTATGCTCACCTTGCATCCTTCCAAGACCCTGTGAGGTGCCTGTGGTGGTCCTCTTTTGCACAGGAGGCATTGAGGTTAAGAGAATTTGAGAAGATTGTGGTATGTCACCCAACAAGGGGCAGAGCTAGAACTAGAACTCAGTCTGGCTGATGGAGAAGCCCCCATGTGTTCCTCTAACATGCTATGCTGCCTCCCAGCGATGGTGTATTCACTCCTTAGTAAGGCGTAAATAAAAACTTCCATAGTAAATAGTGTTCAGTTTTATGGTACTATAGACGTAGTAACATCA"
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] ">NC_010445.4:46206177-46206699 Sus scrofa isolate TJ Tabasco breed Duroc chromosome 3, Sscrofa11.1, whole genome shotgun sequence target3"
#> 
#> [[3]][[2]]
#> [1] "TTGGGGAGAGGGGAAGGGATCAGAGTGGAGAAGGAGGTGTAGATGCAGGAAGCCCCAAAAGCTACCATGCCGTGGGTCAGACTCACAGCCCCAGGACTCGCTCCAGCTTCTTCGCTACCACCTGGCCCAGGCGCCAGCCAACTGGGTCGCTGCCCTTCTCCAGCTACCTGGAGCCCCAAGCTCCCAGCTGGGAGGCCCCCGCAGGTCTTGTGTGTCTTTGGCTCCCTGGCTGTAGGGAAATATTTGTGTTCGCACCTCAGAGATCTTGGAGGGCAAAGCCAGGGCGAGTCTGTGTCTGGGCCACAGTTGATGCTGCTGGGGCCTGGGGAGTGTCCCCCTCCTGCATCAGGGCAGTGGGTCTGCACTCAGCAAGCTAGAAGGAGTCTTTGTTCTTTTTCCAGGCAGCACCCCCCCACACACACACACCACAGTCTGAGCTGCTAGTGATTCAAGGGTTGTTCTCTGAGGATCTGGATGCCCCTCCTGACATGGTGCTTTTGGAGCTGGGAGGGAGCCCAAAGGC"

Now that we have the fasta sequences, we can tag the fasta sequence at target locations, for that a target sequence contained in the sequences need to be provided. In this case we will tag the middle point of the target. The tag is a duplication of the 2 nucleotides at the site: AAATGGTCT[TC/TC]GATTAAT. In this case we will use a data.frame containing the sequence at the genomic coordinates on coord_table:

target_table <- tibble::tribble(
                ~Name,~Chr,~NCBI_ID,~start,~end,~target,
                "target1","chr1","NC_010443.5",39157405,39157425,"ACACCTCAAAAGGTCTGGAGGGT",
                "target2","chr16","NC_010458.4",27277933,27277956,"GCACCTCACAGGGTCTTGGAAGG",
                "target3","chr3","NC_010445.4",46206427,46206449,"GCACCTC-AGAGATCTTGGAGGG"
)
target_table
#> # A tibble: 3 × 6
#>   Name    Chr   NCBI_ID        start      end target                 
#>   <chr>   <chr> <chr>          <dbl>    <dbl> <chr>                  
#> 1 target1 chr1  NC_010443.5 39157405 39157425 ACACCTCAAAAGGTCTGGAGGGT
#> 2 target2 chr16 NC_010458.4 27277933 27277956 GCACCTCACAGGGTCTTGGAAGG
#> 3 target3 chr3  NC_010445.4 46206427 46206449 GCACCTC-AGAGATCTTGGAGGG

We can use it to tag the lists in the fasta_list:

fasta_list_tagged <- tag_fasta(
                            fasta_list = fasta_list,
                            target = target_table,
                            tag_site = 4
                            )
fasta_list_tagged
#> [[1]]
#> [[1]][[1]]
#> [1] ">NC_010443.5:39157155-39157675 Sus scrofa isolate TJ Tabasco breed Duroc chromosome 1, Sscrofa11.1, whole genome shotgun sequence target1"
#> 
#> [[1]][[2]]
#> [1] "TGCGTTCATGACTCCAAGAGAAACCTTTGATGAGCACCAAATGTTAGACCTGTCCTTCCTGGATGTTTATAAAAATTTATCCTAGAATGTATGCATAATCTTATTCACTGATAAGATGGTTCAATGGGAAAAACAATATTATCGGAAGCCATCTCTTAAAATGGCTTACCAAGTATGCTAAATTGTTCAGTTTGTCCTAAACATAACCCTGGAAAATCCATCTGAAATTTCACAGGTTATTATTTTTTTTAACCCTCCAGACCTTTTGAGGTGTGGCAAATGGATTTTATTGAGGTGCCATCATCTCAAGGTTGTAAATATTTATTGATAACAATTTGTATGTTCTCTCATTGGTTGAAGGATTTTCCTTGTTACACAGCCATGGCCACAGCGGTACATAAAGTCTTTTTGAGAAAAGTTTTTCCTACTTGAGGAATACCCTCTGAATAATGACAGAGGTTCCCATTTTAGTCAATAAGTAATTTCAATCTGTTTGTAAAATCAGGCTTACTTTATAACAT"
#> 
#> [[1]][[3]]
#> [1] "TGCGTTCATGACTCCAAGAGAAACCTTTGATGAGCACCAAATGTTAGACCTGTCCTTCCTGGATGTTTATAAAAATTTATCCTAGAATGTATGCATAATCTTATTCACTGATAAGATGGTTCAATGGGAAAAACAATATTATCGGAAGCCATCTCTTAAAATGGCTTACCAAGTATGCTAAATTGTTCAGTTTGTCCTAAACATAACCCTGGAAAATCCATCTGAAATTTCACAGGTTATTATTTTTTTTAACCCTCCAGACCTTTTGAG[GT/GT]GTGGCAAATGGATTTTATTGAGGTGCCATCATCTCAAGGTTGTAAATATTTATTGATAACAATTTGTATGTTCTCTCATTGGTTGAAGGATTTTCCTTGTTACACAGCCATGGCCACAGCGGTACATAAAGTCTTTTTGAGAAAAGTTTTTCCTACTTGAGGAATACCCTCTGAATAATGACAGAGGTTCCCATTTTAGTCAATAAGTAATTTCAATCTGTTTGTAAAATCAGGCTTACTTTATAACAT"
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> [1] ">NC_010458.4:27277683-27278206 Sus scrofa isolate TJ Tabasco breed Duroc chromosome 16, Sscrofa11.1, whole genome shotgun sequence target2"
#> 
#> [[2]][[2]]
#> [1] "AGGATAGTTTTGCTTCCTTGTCCCCCCTTTCTTCTGTCATTTGCCCAAATTCTACCTCTTGTCAAAGATAATGGTTCCTCCCTACCTCTGCTAACCTTTGGATTTTTAAAATGTTGCCAACAATGAAAGTGTATTTATAAAATTAATACATTTTAGAAATTTCCAATCACCAGATTAAAGGGACAGATGTGAACACAGCACTAACAGTGTAAATTATAAAGGACACTTAAACACGTATGCTCACCTTGCATCCTTCCAAGACCCTGTGAGGTGCCTGTGGTGGTCCTCTTTTGCACAGGAGGCATTGAGGTTAAGAGAATTTGAGAAGATTGTGGTATGTCACCCAACAAGGGGCAGAGCTAGAACTAGAACTCAGTCTGGCTGATGGAGAAGCCCCCATGTGTTCCTCTAACATGCTATGCTGCCTCCCAGCGATGGTGTATTCACTCCTTAGTAAGGCGTAAATAAAAACTTCCATAGTAAATAGTGTTCAGTTTTATGGTACTATAGACGTAGTAACATCA"
#> 
#> [[2]][[3]]
#> [1] "AGGATAGTTTTGCTTCCTTGTCCCCCCTTTCTTCTGTCATTTGCCCAAATTCTACCTCTTGTCAAAGATAATGGTTCCTCCCTACCTCTGCTAACCTTTGGATTTTTAAAATGTTGCCAACAATGAAAGTGTATTTATAAAATTAATACATTTTAGAAATTTCCAATCACCAGATTAAAGGGACAGATGTGAACACAGCACTAACAGTGTAAATTATAAAGGACACTTAAACACGTATGCTCACCTTGCATCCTTCCAAGACCCTGTGAG[GT/GT]GCCTGTGGTGGTCCTCTTTTGCACAGGAGGCATTGAGGTTAAGAGAATTTGAGAAGATTGTGGTATGTCACCCAACAAGGGGCAGAGCTAGAACTAGAACTCAGTCTGGCTGATGGAGAAGCCCCCATGTGTTCCTCTAACATGCTATGCTGCCTCCCAGCGATGGTGTATTCACTCCTTAGTAAGGCGTAAATAAAAACTTCCATAGTAAATAGTGTTCAGTTTTATGGTACTATAGACGTAGTAACATCA"
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] ">NC_010445.4:46206177-46206699 Sus scrofa isolate TJ Tabasco breed Duroc chromosome 3, Sscrofa11.1, whole genome shotgun sequence target3"
#> 
#> [[3]][[2]]
#> [1] "TTGGGGAGAGGGGAAGGGATCAGAGTGGAGAAGGAGGTGTAGATGCAGGAAGCCCCAAAAGCTACCATGCCGTGGGTCAGACTCACAGCCCCAGGACTCGCTCCAGCTTCTTCGCTACCACCTGGCCCAGGCGCCAGCCAACTGGGTCGCTGCCCTTCTCCAGCTACCTGGAGCCCCAAGCTCCCAGCTGGGAGGCCCCCGCAGGTCTTGTGTGTCTTTGGCTCCCTGGCTGTAGGGAAATATTTGTGTTCGCACCTCAGAGATCTTGGAGGGCAAAGCCAGGGCGAGTCTGTGTCTGGGCCACAGTTGATGCTGCTGGGGCCTGGGGAGTGTCCCCCTCCTGCATCAGGGCAGTGGGTCTGCACTCAGCAAGCTAGAAGGAGTCTTTGTTCTTTTTCCAGGCAGCACCCCCCCACACACACACACCACAGTCTGAGCTGCTAGTGATTCAAGGGTTGTTCTCTGAGGATCTGGATGCCCCTCCTGACATGGTGCTTTTGGAGCTGGGAGGGAGCCCAAAGGC"
#> 
#> [[3]][[3]]
#> [1] "TTGGGGAGAGGGGAAGGGATCAGAGTGGAGAAGGAGGTGTAGATGCAGGAAGCCCCAAAAGCTACCATGCCGTGGGTCAGACTCACAGCCCCAGGACTCGCTCCAGCTTCTTCGCTACCACCTGGCCCAGGCGCCAGCCAACTGGGTCGCTGCCCTTCTCCAGCTACCTGGAGCCCCAAGCTCCCAGCTGGGAGGCCCCCGCAGGTCTTGTGTGTCTTTGGCTCCCTGGCTGTAGGGAAATATTTGTGTTCGCAC[CT/CT]CAGAGATCTTGGAGGGCAAAGCCAGGGCGAGTCTGTGTCTGGGCCACAGTTGATGCTGCTGGGGCCTGGGGAGTGTCCCCCTCCTGCATCAGGGCAGTGGGTCTGCACTCAGCAAGCTAGAAGGAGTCTTTGTTCTTTTTCCAGGCAGCACCCCCCCACACACACACACCACAGTCTGAGCTGCTAGTGATTCAAGGGTTGTTCTCTGAGGATCTGGATGCCCCTCCTGACATGGTGCTTTTGGAGCTGGGAGGGAGCCCAAAGGC"

Independently on if the fasta_list is tagged or not you use it to write fasta files using the function write_fasta.

# write_fasta(fasta_list = fasta_list, named = TRUE, contain_flag = TRUE,
#                        tagged = FALSE, combined = TRUE, path_to_file = path_to_file,
#                        width = 60, append = FALSE)

About

No description, website, or topics provided.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

Languages