Convert social/chat archives into normalized threads and export to Markdown, OAI JSONL, JSON (normalized items), and ShareGPT. Modular TypeScript CLI and library with extensible sources β transforms β outputs.
- Idiomatic CLI (clig.dev principles)
- Modular architecture:
- sources: Twitter/X archives and Bluesky repo CAR exports (text-first; blobs soon), ChatGPT, etc. next
- transforms: filtering, grouping into threads/conversations, text cleaning
- outputs: Markdown, OAI JSONL, JSONL (normalized items), ShareGPT
- Library API to compose your own pipeline or plug in proprietary adapters
- Copies referenced media into an images/ folder
- JSONL artifacts for easy inspection and future checkpointing
Turn your archives into:
- Readable Markdown
- OAI-compatible JSONL for training/eval
- A normalized JSONL dump for inspection and reuse
Today it supports:
- Twitter/X β Local archive exports (ZIP extracted)
- Bluesky β AT Protocol CAR file exports with optional API enrichment
- Glowfic β Collaborative fiction threads, sections, or boards via URL
Next: ChatGPT, Reddit, Hugging Face datasets.
This library started life as a Python script. This is a TypeScript rewrite where development will continue. It has powered projects like deeperfates.com, keltham.lol, and youaretheassistantnow.com.
More context: https://deepfates.com/convert-your-twitter-archive-into-training-data
Requirements:
- Node.js 18+ (tested with recent LTS)
- For direct execution:
tsx(installed automatically withnpx)
Run with tsx (no build needed):
npx tsx splice.ts --source /path/to/twitter-archive --out ./out
Run the published CLI (after install):
npx splice --source /path/to/twitter-archive --out ./out
Build then run with Node:
npm install
npm run build
node dist/cli/splice.js --source /path/to/twitter-archive --out ./out
Dev/watch mode:
npm run dev -- --source /path/to/twitter-archive --out ./out
Help (equivalent to --help):
splice β convert a Twitter archive to Markdown, OAI JSONL, and/or JSON
Usage:
splice --source <path> --out <dir> [--format markdown oai json sharegpt] [--system-message <text>]
[--since <iso>] [--until <iso>] [--min-length <n>] [--exclude-rt] [--only-threads] [--with-media]
[--enrich] [--dry-run] [--stats-json] [--log-level <level>] [--json-stdout] [--quiet|-q] [--verbose] [--version|-V]
splice --glowfic <url> --out <dir> --assistant <name> [--assistant-regex <pattern>]
splice --glowfic-board <url> --out <dir> --all-characters [--min-posts <n>]
Options:
--source <path> Path to Twitter archive directory or Bluesky .car file
--out <dir> Output directory
--format <fmt...> One or more formats: markdown, oai, json, sharegpt (default: markdown oai json)
--system-message <text> System message for OAI JSONL (default: "You have been uploaded to the internet")
Alias: --system
--since <iso> Include items on/after this ISO date
--until <iso> Include items on/before this ISO date
--min-length <n> Minimum text length
--exclude-rt Exclude retweets (RT ...)
--only-threads Output threads only
--with-media Only include items that have media
--enrich Fetch thread context from API (Bluesky only)
--dry-run, -n Plan only; don't write files
--stats-json Write a stats.json summary
--log-level <level> debug|info|warn|error (default: info)
--json-stdout Emit normalized items JSONL to stdout; logs to stderr
--quiet, -q Errors only
--verbose Debug logging
--version, -V Show version
--help, -h Show help
Glowfic Options:
--glowfic <url> Glowfic thread/section/board URL to ingest
--assistant <name> Character name for assistant role (case-insensitive)
--assistant-regex <pat> Regex pattern for assistant matching
--glowfic-board <url> Board URL for multi-character export
--all-characters Export datasets for all characters on board
--min-posts <n> Minimum posts for character inclusion (default: 10)
Environment:
SPLICE_SYSTEM_MESSAGE Alternative way to set the OAI system message
(flag value takes precedence)
Exit codes:
- 0: success
- 1: runtime error
- 2: invalid arguments or source detection failed
Stdout/Stderr:
- Primary logs go to stderr (so you can safely pipe stdout)
- Data files are written to the output directory
Convert to both Markdown and OAI JSONL:
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out
Markdown only:
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format markdown
OAI only with custom system message:
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format oai --system-message "You are helpful."
JSON only (normalized items):
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format json
All formats:
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format markdown oai json sharegpt
Filters and selection:
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format markdown --since 2024-01-01 --until 2024-12-31 --min-length 40 --exclude-rt --only-threads --with-media
Stats JSON summary:
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format oai --stats-json
Stream normalized items to stdout:
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --json-stdout | head -n 5
Use environment variable for system message:
SPLICE_SYSTEM_MESSAGE="Be concise." npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format oai
Dry run with debug logs (no files written):
npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --dry-run --log-level debug
Bluesky CAR export:
npx tsx splice.ts --source ~/Downloads/my-bsky-repo.car --out ./out
Bluesky with thread context enrichment (fetches parent posts from API):
npx tsx splice.ts --source ~/Downloads/my-bsky-repo.car --out ./out --enrich
Glowfic thread (single character as assistant):
npx tsx splice.ts --glowfic https://glowfic.com/posts/5506 --out ./out --assistant "Carissa"
Glowfic board (all characters, HuggingFace dataset format):
npx tsx splice.ts --glowfic-board https://glowfic.com/boards/215 --out ./out --all-characters --min-posts 20
Extract the archive ZIP to a directory containing:
data/manifest.jsdata/tweets_media/(optional, for media assets)- YTD
.jsfiles fortweetsandlikedata
We ingest tweets, likes, and media files prefixed with <tweetId>-*.
Export your repository from Settings β Advanced β Export Content. Pass --source path/to/repo.car.
- Use
--enrichto fetch parent posts from the public API for full conversation context - Media blobs are referenced but not downloaded yet
Pass a thread, section, or board URL: --glowfic https://glowfic.com/posts/5506
- Requires
--assistant <name>to specify which character is the assistant - For multi-character datasets:
--glowfic-board <url> --all-characters
On a successful run, youβll see:
out/threads/β one Markdown file per detected thread, named likeYYYYMMDD-thread-<slug>.mdout/tweets/β one Markdown file per non-thread tweet, named likeYYYYMMDD-tweet-<slug>.mdout/images/β copied media files referenced by the Markdownout/conversations_oai.jsonlβ OAI JSONL file with conversations built from threads and reply chainsout/normalized_items.jsonlβ JSONL dump of normalized ContentItem records (one item per line)out/sharegpt.jsonβ ShareGPT export (array) for loaders that expect ShareGPT formatout/stats.jsonβ summary (counts, threads/conversations, date range)
Notes:
- Thread filenames are derived from the top postβs first words (sanitized).
- The OAI JSONL file includes a top-level βsystemβ message (configurable).
- src/core β shared types, arg parsing, logger, utilities
- src/sources β input adapters (twitter.ts)
- src/transforms β filters, grouping, conversation mapping
- src/outputs β writers for markdown/oai/json/sharegpt/stats
- src/cli β CLI entrypoint wiring sources β transforms β outputs
The code is structured so you can add new sources, transforms, or outputs without touching unrelated parts.
You can import and compose pieces in your own app:
import {
ingestTwitter,
applyFilters,
indexById,
groupThreadsAndConversations,
writeOAI,
} from "@deepfates/splice";
const items = await ingestTwitter("/path/to/archive", (l, m) => console.error(`[${l}] ${m}`));
const filtered = applyFilters(items, { minLength: 20, excludeRt: true, withMedia: false });
const all = indexById(filtered);
const { threads, conversations } = groupThreadsAndConversations(all);
await writeOAI(threads, conversations, "./out", "You have been uploaded to the internet", (l, m) => console.error(`[${l}] ${m}`), false);Pluggable adapters (build proprietary ones privately and upstream later if you want):
- SourceAdapter:
detect(pathOrUri),ingest(pathOrUri, logger) β ContentItem[] - OutputAdapter:
write(args, ctx)where args may includeitems,threads,conversations,systemMessage, and ctx providesoutDir,dryRun, andlogger
Install deps:
npm install
Run with tsx:
npm run start -- --source /path/to/twitter-archive --out ./out
Watch mode:
npm run dev -- --source /path/to/twitter-archive --out ./out
Build (emits dist/cli/splice.js and sets up the splice bin; library API at dist/index.js):
npm run build
Run the built CLI:
node dist/cli/splice.js --source /path/to/twitter-archive --out ./out
Run the full test suite (includes integration tests for Markdown, OAI JSONL with system message, media copying, and normalized JSONL):
npm test
Watch tests:
npm run test:watch
- More inputs: Reddit, ChatGPT, HF datasets
- Checkpointing and resumable pipelines (JSONL-based manifests)
- More outputs: SQLite/Parquet/CSV
- Blob fetching for Bluesky media
- Better selection: persona/character filters, time ranges
- Improved role attribution and metadata preservation
MIT. See LICENSE.
See the blog post above for context. CLI UX follows clig.dev-style conventions.