newssurvey

newssurvey is a proof-of-concept Python 3.12 application to write a survey report about a question or concern using a single supported news site. The news site is used to conduct searches and read articles. Currently only two sites are supported. Numerous calls are made to OpenAI LLMs, namely gpt-4o-mini and gpt-4o, to formulate the response. A funded OpenAI API key is required.

Sources

The supported sources are:

Name	Type	Observed LLM cost range per report in USD
medicalxpress	medical	1 to 6
physorg	science	1 to 22

The LLM cost per report varies by the number of source articles and output sections for the submitted user query. The cost is approximately 1 USD per 100 source articles (or 50 cited articles) per 10 output sections. Strictly speaking, the cost is unbounded and must be monitored and restricted via the OpenAI usage dashboard. The generation time per report is expected to be under an hour, also depending on the number of source articles.

Links

Caption	Link
Repo	https://github.com/impredicative/newssurvey
Changelog	https://github.com/impredicative/newssurvey/releases
Package	https://pypi.org/project/newssurvey

Approach

Each step in this workflow corresponds to an action taken by the LLM.

Get search terms: Search terms for the given user query and site are listed by the LLM. The user query is a question or concern applicable to the user chosen news site. Additional search terms are also obtained until convergence.
Get filtered search results: For each search term, a single page of search results is retrieved. More than one search type may be supported by the site, in which case all supported search types are used. Each result is composed of a title and possibly a blurb. The search results are filtered, one page at a time, for relevance by the LLM. This step is repeated for additional pages of search results until there are no relevant results for the page. After this, the full texts of all filtered search results are read.
List section names: The list of article titles is presented to the LLM, ordered by distance to the user query. The LLM provides a coherent single-level list of sections names. The list is then refined until convergence.
Rate articles for sections: For each article, the LLM numerically rates on a scale of 0 to 100 how well the article can contribute to each section.
Condense article by section: For each article and section pairing, limited to ones with nonzero ratings, the LLM condenses the article text.
Filter articles by section: For each section, its available condensed articles are filtered for subsequent usage. This filtering is different from the prior rating step because it is done for a section at a time versus for an article at a time. If there are more articles than can fit in the input context length, batching is used.
Get text by section: For each section, its condensed articles are concatenated together, ordered by their corresponding ratings, up to the maximum input context length of the LLM. The LLM formulates the text for each section. The section-specific citation numbers are replaced by globally consistent numbers.
Get response title: The LLM provides the response title using the list of section names.

The workflow is intended to be as simple as necessary, and without cycles between steps.

Limitations

Due to the LLM's context window limitation of 128K, only up to about 400 condensed articles can be used for writing a section. Efforts are made to use the most highly rated and filtered section-specific relevant articles that fit in this window.

Samples

These generated sample are available in HTML format.

Source	User query (simplified)	Output title
medicalxpress	nutrition for anxiety	Nutritional Approaches and Supplements for Managing Anxiety in Adults: An Evidence-Based Review
medicalxpress	daytime drowsiness	Addressing Daytime Drowsiness: Causes, Effects, and Solutions for Improved Nighttime Sleep
medicalxpress	acid reflux treatments	Comprehensive Approaches to Managing GERD: Lifestyle, Diet, and Innovative Therapies
physorg	dark matter theories	Comprehensive Exploration of Dark Matter Theories and Detection Approaches
physorg	multiverse theories	Exploring Multiverse Theories: Concepts, Evidence, and Implications

Setup

Common setup

In the working directory, create a file named .env, with the intended environment variable OPENAI_API_KEY=<your OpenAI API key>, or set it in a different way.
Continue the setup via GitHub or PyPI as below.

Setup via GitHub using devcontainer

Continue from the common setup steps.
Clone or download this repo.
Build and provision the defined devcontainer.

Setup via GitHub manually

Continue from the common setup steps.
Clone or download this repo.
Ensure that rye is installed and available.
In the repo directory, run rye sync --no-lock.

Setup via PyPI

Continue from the common setup steps.
Create and activate a Python 3.12 devcontainer or virtual environment.
Install via PyPI: pip install -U newssurvey.

Usage

Usage can be as a command-line application or as a Python library.

Usage tips

Refining the query text over a few iterations is often essential for receiving a sufficiently tailored response.
A two-person podcast style audio file can be freely created from the text output file using Google NotebookLM.
Only a single instance of the application must be run at a time, failing which throttles can aggressively be imposed by the source website and by OpenAI. This is also enforced at the application-level by the use of a lock file.
Do not browse the source website from the same IP address when a search is running, as this will result in throttling errors.

Usage as application

In the simplest case, run any one of these commands to interactively start the application. You will be prompted for the necessary information.

$ python -m newssurvey
$ rye run python -m newssurvey
$ rye run newssurvey

For non-interactive use, the usage help is copied below:

$ python -m newssurvey -h
Usage: python -m newssurvey [OPTIONS]

  Generate and write a response to a question or concern using a supported news source.

  The progress is printed to stdout. A nonzero exitcode exists if there is an error. A single instance is enforced.

Options:
  -s, --source TEXT               Name of supported news source. If not given, the user is prompted for it.
  -q, --query TEXT                Question or concern answerable by the news source. If a path to a file, the file
                                  text is read as text. If not given, the user is prompted for it.
  -m, --max-sections INTEGER RANGE
                                  Maximum number of sections to include in the response, between 5 and 100. Its
                                  recommended value, also the default, is 100.  [5<=x<=100]
  -f, --output-format TEXT        Output format of the response. It can be txt (for text), md (for markdown), gfm.md
                                  (for GitHub Flavored markdown), html, pdf, or json. If not specified, but if an
                                  output filename is specified via '--output-path', it is determined automatically
                                  from the file extension. If not specified, and if an output filename is not
                                  specified either, its default is txt.
  -o, --output-path PATH          Output directory path or file path. If intended as a directory path, it must exist,
                                  and the file name is auto-determined. If intended as a file path, its extension can
                                  be txt (for text), md (for markdown), gfm.md (for GitHub Flavored markdown), html,
                                  pdf, or json. If not specified, the output file is written to the current working
                                  directory with an auto-determined file name. The response is written to the file
                                  except if there is an error.
  -c, --confirm / -nc, --no-confirm
                                  Confirm as the workflow progresses. If `--confirm`, a confirmation is interactively
                                  sought as each step of the workflow progresses, and this is the default. If `--no-
                                  confirm`, the workflow progresses without any confirmation.
  -h, --help                      Show this message and exit.

Usage examples:

$ python -m newssurvey -s medicalxpress -q "safe strategies for weight loss" -f txt -o ~ -nc

$ python -m newssurvey -s medicalxpress -q ./my_detailed_medical_concern.txt -f html -o ~/output.html -c

$ python -m newssurvey -s physorg -q ./my_science_query.txt -f pdf -o ./work/ -m 10

Usage as library

>>> from newssurvey import generate_response
>>> import inspect

>>> print(inspect.signature(generate_response))
(source: str, query: str, max_sections: int = 100, output_format: Optional[str] = 'txt', confirm: bool = False) -> newssurvey.types.Response

>>> print(inspect.getdoc(generate_response))

Return a response for the given source and query.

The returned response contains the attributes: format, title, response.

The progress is printed to stdout.

Params:
* `source`: Name of supported news source.
* `query`: Question or concern answerable by the news source.
* `max_sections`: Maximum number of sections to include in the response, between 5 and 100. Its recommended value, also the default, is 100.
* `output_format`: Output format. It can be txt (for text), md (for markdown), gfm.md (for GitHub Flavored markdown), html, pdf, or json. Its default is txt.
* `confirm`: Confirm as the workflow progresses. If true, a confirmation is interactively sought as each step of the workflow progresses. Its default is false.

If failed, a subclass of the `newssurvey.exceptions.Error` exception is raised.

Cache

An extensive disk cache is stored locally to cache website and LLM outputs with a fixed expiration period. This is in the [src]/newssurvey/.diskcache directory. The expiration period is 1 week for website searches and 52 weeks for everything else, also subject to separate disk usage limits. To reuse the cache, rerun the same user query within this period. To bypass the cache, alter the user query, otherwise delete the appropriate cache subdirectory. Updates to the LLM prompts will also bypass the cache.

The LLM is prompted to always output in a basic text format. Following this, the text is structured into the user-requested output format without using the LLM. Rewriting the output into a new format is therefore possible offline until the earliest cache expiration, typically for 1 week.

Disclaimer

_{This software is provided as a proof-of-concept application and is distributed under the LGPL license. It is offered without any guarantees or warranties, either expressed or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.}

_{Users are responsible for ensuring that they have the necessary API keys, permissions, and access to third-party services such as the OpenAI API, which are required for full functionality. The costs associated with using the OpenAI API, including those outlined in this documentation, are subject to change and must be monitored independently by the user.}

_{The software relies on third-party services and content from news sites. The availability, accuracy, or relevance of content from these external sources cannot be guaranteed, nor can the continued accessibility of these services be ensured in the future. The accuracy and reliability of reports generated by the software depend on the quality of input queries, availability of articles, and the performance of language models, all of which are subject to change and influenced by external factors beyond the control of the software.}

_{While efforts have been made to optimize the performance and output of this software, users should independently verify any information generated, particularly if it is intended for use in professional, medical, scientific, technical, legal, or other high-stakes contexts. Use of this software is at your own risk. This software should not be used as the sole basis for any serious, life-impacting decisions. Always consult relevant professionals or authoritative sources directly for such purposes.}

_{By using this software, you agree that its developers and contributors shall not be held liable for any damages, costs, or losses arising from its use, including but not limited to direct, indirect, incidental, consequential, or punitive damages. Users are encouraged to thoroughly review its source code to understand the workings of the application and assess its suitability for their intended use.}

_{The authors do not claim ownership of any content generated using this software. Responsibility for the use of any and all generated content rests with the user. Users should exercise caution and due diligence to ensure that generated content does not infringe on the rights of third parties.}

_{This disclaimer is subject to change without notice. It is your responsibility to review it periodically for updates.}

Name		Name	Last commit message	Last commit date
Latest commit History 764 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
analysis		analysis
archive		archive
docs		docs
scripts		scripts
src/newssurvey		src/newssurvey
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

newssurvey

Sources

Links

Approach

Limitations

Samples

Setup

Common setup

Setup via GitHub using devcontainer

Setup via GitHub manually

Setup via PyPI

Usage

Usage tips

Usage as application

Usage as library

Cache

Disclaimer

About

Releases 18

Languages

License

impredicative/newssurvey

Folders and files

Latest commit

History

Repository files navigation

newssurvey

Sources

Links

Approach

Limitations

Samples

Setup

Common setup

Setup via GitHub using devcontainer

Setup via GitHub manually

Setup via PyPI

Usage

Usage tips

Usage as application

Usage as library

Cache

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 18

Languages