Skip to content

Conversation

@prototaxites
Copy link
Collaborator

The primary function of this PR is to update MarkerScan to be compatible with v2 of the NCBI API, replacing the dependency on ncbi-datasets-pylib which is deprecated.

To do this, I have added a small class in scripts/NcbiApi.py which replicates the required functionality using the requests library. In implementing this into the various scripts, I've also made a few changes to improve efficiency, including determining the number of assemblies directly from the API rather than counting them in a loop.

In order to run, currently you must set the environment variable NCBI_API_KEY in order to access the API. I don't know enough Snakemake to modify the Snakefile, but I've also added command-line arguments (-k) to the affected Python scripts to allow this to be specified manually.

In trying to get the modified version of the pipeline to run, I have also made a number of small changes:

  • Update minimap2 from 2.17 to 2.30 - 2.17 does not contain the map-hifi option used in the pipeline (are there an alternate set of env files somewhere?)
  • Update Seqtk from 1.3 to 1.5
  • Update BUSCO from 5.2.2 to 6.0.0 (necessary as the BUSCO database files have been updated)
  • Change the BUSCO config file so it doesn't try and set up the paths to the required dependencies, as this was causing problems (these are available as part of the conda env anyway)
  • Fix tabulate=0.8.0 in the Dockerfile for compatibity with the pinned version of Snakemake

@muffato
Copy link
Member

muffato commented Nov 6, 2025

Hi Jim.

An issue I had with the pipeline is that it downloads everything in its own space, essentially duplicating downloads we already have in /data/tol/resources. Is there a way we could pass existing download paths as a parameters and force it to use the data there ? (without attempting to update / replace anything).
Here are the three databases we already have:

  • busco/ (would need to add the --offline mode when running it).
  • taxonomy/
  • silva/

@prototaxites
Copy link
Collaborator Author

I am sure that would be doable - but I don't know any Snakemake or really understand very well the codebase here, so it would take a bit of effort to resolve

Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Use subprocess.check_call rather than os.system to check the return code
  • Commands that don't need wildcards, redirections, etc, should be defined as an array rather than a string

Comment on lines +12 to +31
def get_request(self, command: str) -> requests.Response:
"""
Returns a requests response object for a given NCBI api query (command).

args:
command -> str: request to append to NCBI API URI
"""
if self.ncbi_api_key != "":
headers = {"Accept": "application/json", "api-key": f"{self.ncbi_api_key}"}
else:
headers = {"Accept": "application/json"}

response = requests.get(self.ncbi_api_uri + command, headers=headers)

if response.status_code != 200:
raise Exception(
f"Cannot connect to NCBI (status code '{str(response.status_code)}')'"
)

return response
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest you copy https://github.com/sanger-tol/blobtoolkit/blob/0.9.0/bin/generate_config.py#L34-L42 into here, so that you can recover from transient network errors. (maybe adapt it for the POST request in download_genomes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants