Skip to content

newsdataapi/newsdata.io-free-datasets

Repository files navigation

Free News Datasets

Open news data for machine learning, NLP, and sentiment analysis — provided free by Newsdata.io.

License: CC BY 4.0 Categories Formats

A curated collection of free, downloadable news datasets covering business, sports, entertainment, health, science and technology, world politics, COVID-19, the Ukraine conflict, and more — aggregated from a wide range of reliable publishers across many languages and countries. Each archive ships in both CSV and JSON, with rich metadata and AI-extracted annotations (ai_tag, ai_region, ai_org, sentiment) so you can skip the preprocessing and get straight to your model.

Use them as ready-made news corpora for:

  • Machine learning — text classification, clustering, embeddings, LLM fine-tuning
  • Natural language processing (NLP) — named entity recognition, topic modelling, summarization
  • Sentiment analysis — labelled sentiment + per-sentiment confidence stats out of the box
  • Trend analysis — surface what publishers are covering across time
  • News aggregation and search demos
  • Geopolitical and regional analyticsai_region / ai_org enrichments included
  • Journalism and academic research

Available datasets

Every archive expands to a folder containing the same dataset in CSV and JSON form.

Category File Size (compressed)
Business news Business_News.zip ~17 MB
Sports news Sports_News.7z ~19 MB
Entertainment news Entertainment_News.zip ~17 MB
Science and technology news Science_Technology_News.zip ~13 MB
Health news Health_News.zip ~4 MB
World politics news World_Politics_News.zip ~6 MB
COVID-19 news Covid_News.zip ~9 MB
COVID and vaccine news Covid_and_Vaccine_News.zip ~10 MB
Ukraine news Ukraine_News.zip ~5 MB

Want a quick look at the schema before downloading? Tiny samples for the news and crypto-news streams live in sample/ (CSV + JSON + XLSX, under 500 KB each).

File formats and schema

Datasets ship as CSV and JSON (samples also include XLSX). Every article shares the same flat schema:

Field Type Description
article_id string Stable unique identifier for the article
title string Article headline
link string URL to the original article
description string Short summary / lede
content string Full article text
keywords string Comma-separated keywords
creator string Author / byline
pubDate datetime Publication date (UTC)
image_url string Hero / lead image URL
video_url string Embedded video URL, when present
source_id string Publisher identifier
source_priority int Publisher priority rank
source_url string Publisher domain
source_icon string Publisher favicon URL
language string ISO 639-1 language code (en, es, …)
country string ISO 3166-1 alpha-2 country code (us, gb, …)
category string Top-level category
ai_tag string AI-extracted topic tag(s)
ai_region string AI-extracted geographic region
ai_org string AI-extracted organisation mention
sentiment string positive / neutral / negative
sentiment_stats object Per-sentiment confidence scores

Quick start

Clone the whole repository (or download a single archive from GitHub's file view) and unpack the category you need:

git clone https://github.com/newsdataapi/newsdata.io-free-datasets.git
cd newsdata.io-free-datasets

unzip Business_News.zip          # for .zip archives
7z x Sports_News.7z              # for .7z (install p7zip if needed)

Load with Python (pandas)

import pandas as pd

df = pd.read_csv("Business_News/Business_News.csv")
print(df.shape)
print(df[["title", "language", "country", "sentiment"]].head())

Load with Hugging Face Datasets

from datasets import load_dataset

ds = load_dataset("csv", data_files="Business_News/Business_News.csv")
print(ds["train"].features)

Load with Node.js

import fs from 'node:fs';

const data = JSON.parse(fs.readFileSync('Business_News/Business_News.json'));
console.log(data.length, 'articles');

Why these datasets?

  • Free and openly licensed — released under CC BY 4.0, usable in commercial and non-commercial projects with simple attribution
  • Multi-category — nine distinct verticals from finance to sport to geopolitics
  • Multi-language and multi-country — sources from publishers around the world
  • AI-enriched out of the box — every article ships with AI-generated tags, region/organisation mentions, and sentiment scores, so you can skip a lot of preprocessing
  • Two formats included — pick CSV for tabular tools (pandas, R, BI), JSON for streaming and NoSQL workflows; samples also include XLSX
  • Schema-matched to a live API — when you need fresh data, the same schema is available in real time via the Newsdata.io REST API

Need real-time or custom news data?

These archives are static snapshots. For real-time and historical news in the same schema, use the Newsdata.io API or one of the official SDKs:

Sign up for a free Newsdata.io API key to get started — no credit card required.

About Newsdata.io

Newsdata.io provides high-quality real-time and historical news data through a developer-friendly REST API, sourced from thousands of reliable publishers across 200+ countries and 75+ languages. Our mission is to make news data easily accessible for analytics, research, journalism, and AI/ML applications.

License

These datasets are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license — see LICENSE for the full text.

You are free to share (copy and redistribute) and adapt (remix, transform, build upon) the datasets for any purpose, including commercial use, provided you give appropriate attribution to Newsdata.io and indicate if changes were made.

A simple attribution line is enough, e.g.:

News data provided by Newsdata.io, licensed under CC BY 4.0.

Note on the underlying news content: The articles aggregated in these datasets were originally authored by their respective publishers and remain subject to those publishers' own copyrights. The CC BY 4.0 license here applies to the dataset compilation (a "sui generis" database right under CC BY 4.0 §4) — please respect the original publishers' rights when redistributing or republishing individual articles.


Found these datasets useful? Star the repo to help other researchers and developers discover it. Feedback and dataset requests are welcome via Issues.

Releases

No releases published

Packages

 
 
 

Contributors