Open news data for machine learning, NLP, and sentiment analysis — provided free by Newsdata.io.
A curated collection of free, downloadable news datasets covering business,
sports, entertainment, health, science and technology, world politics, COVID-19,
the Ukraine conflict, and more — aggregated from a wide range of reliable
publishers across many languages and countries. Each archive ships in both
CSV and JSON, with rich metadata and AI-extracted annotations
(ai_tag, ai_region, ai_org, sentiment) so you can skip the
preprocessing and get straight to your model.
Use them as ready-made news corpora for:
- Machine learning — text classification, clustering, embeddings, LLM fine-tuning
- Natural language processing (NLP) — named entity recognition, topic modelling, summarization
- Sentiment analysis — labelled sentiment + per-sentiment confidence stats out of the box
- Trend analysis — surface what publishers are covering across time
- News aggregation and search demos
- Geopolitical and regional analytics —
ai_region/ai_orgenrichments included - Journalism and academic research
Every archive expands to a folder containing the same dataset in CSV and JSON form.
| Category | File | Size (compressed) |
|---|---|---|
| Business news | Business_News.zip |
~17 MB |
| Sports news | Sports_News.7z |
~19 MB |
| Entertainment news | Entertainment_News.zip |
~17 MB |
| Science and technology news | Science_Technology_News.zip |
~13 MB |
| Health news | Health_News.zip |
~4 MB |
| World politics news | World_Politics_News.zip |
~6 MB |
| COVID-19 news | Covid_News.zip |
~9 MB |
| COVID and vaccine news | Covid_and_Vaccine_News.zip |
~10 MB |
| Ukraine news | Ukraine_News.zip |
~5 MB |
Want a quick look at the schema before downloading? Tiny samples for the news
and crypto-news streams live in sample/ (CSV + JSON + XLSX,
under 500 KB each).
Datasets ship as CSV and JSON (samples also include XLSX). Every article
shares the same flat schema:
| Field | Type | Description |
|---|---|---|
article_id |
string | Stable unique identifier for the article |
title |
string | Article headline |
link |
string | URL to the original article |
description |
string | Short summary / lede |
content |
string | Full article text |
keywords |
string | Comma-separated keywords |
creator |
string | Author / byline |
pubDate |
datetime | Publication date (UTC) |
image_url |
string | Hero / lead image URL |
video_url |
string | Embedded video URL, when present |
source_id |
string | Publisher identifier |
source_priority |
int | Publisher priority rank |
source_url |
string | Publisher domain |
source_icon |
string | Publisher favicon URL |
language |
string | ISO 639-1 language code (en, es, …) |
country |
string | ISO 3166-1 alpha-2 country code (us, gb, …) |
category |
string | Top-level category |
ai_tag |
string | AI-extracted topic tag(s) |
ai_region |
string | AI-extracted geographic region |
ai_org |
string | AI-extracted organisation mention |
sentiment |
string | positive / neutral / negative |
sentiment_stats |
object | Per-sentiment confidence scores |
Clone the whole repository (or download a single archive from GitHub's file view) and unpack the category you need:
git clone https://github.com/newsdataapi/newsdata.io-free-datasets.git
cd newsdata.io-free-datasets
unzip Business_News.zip # for .zip archives
7z x Sports_News.7z # for .7z (install p7zip if needed)import pandas as pd
df = pd.read_csv("Business_News/Business_News.csv")
print(df.shape)
print(df[["title", "language", "country", "sentiment"]].head())from datasets import load_dataset
ds = load_dataset("csv", data_files="Business_News/Business_News.csv")
print(ds["train"].features)import fs from 'node:fs';
const data = JSON.parse(fs.readFileSync('Business_News/Business_News.json'));
console.log(data.length, 'articles');- Free and openly licensed — released under CC BY 4.0, usable in commercial and non-commercial projects with simple attribution
- Multi-category — nine distinct verticals from finance to sport to geopolitics
- Multi-language and multi-country — sources from publishers around the world
- AI-enriched out of the box — every article ships with AI-generated tags, region/organisation mentions, and sentiment scores, so you can skip a lot of preprocessing
- Two formats included — pick CSV for tabular tools (
pandas,R, BI), JSON for streaming and NoSQL workflows; samples also include XLSX - Schema-matched to a live API — when you need fresh data, the same schema is available in real time via the Newsdata.io REST API
These archives are static snapshots. For real-time and historical news in the same schema, use the Newsdata.io API or one of the official SDKs:
- REST API documentation — https://newsdata.io/documentation
- Python SDK —
newsdataapion PyPI - PHP SDK —
newsdataio/newsdataapion Packagist - Node.js SDK —
newsdata-nodejs-clienton npm - React SDK —
newsdataapion npm
Sign up for a free Newsdata.io API key to get started — no credit card required.
Newsdata.io provides high-quality real-time and historical news data through a developer-friendly REST API, sourced from thousands of reliable publishers across 200+ countries and 75+ languages. Our mission is to make news data easily accessible for analytics, research, journalism, and AI/ML applications.
These datasets are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license — see LICENSE for the full text.
You are free to share (copy and redistribute) and adapt (remix, transform, build upon) the datasets for any purpose, including commercial use, provided you give appropriate attribution to Newsdata.io and indicate if changes were made.
A simple attribution line is enough, e.g.:
News data provided by Newsdata.io, licensed under CC BY 4.0.
Note on the underlying news content: The articles aggregated in these datasets were originally authored by their respective publishers and remain subject to those publishers' own copyrights. The CC BY 4.0 license here applies to the dataset compilation (a "sui generis" database right under CC BY 4.0 §4) — please respect the original publishers' rights when redistributing or republishing individual articles.
Found these datasets useful? Star the repo to help other researchers and developers discover it. Feedback and dataset requests are welcome via Issues.