Skip to content

deepfates/splice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

43 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🫚 splice

Convert social/chat archives into normalized threads and export to Markdown, OAI JSONL, JSON (normalized items), and ShareGPT. Modular TypeScript CLI and library with extensible sources β†’ transforms β†’ outputs.

  • Idiomatic CLI (clig.dev principles)
  • Modular architecture:
    • sources: Twitter/X archives and Bluesky repo CAR exports (text-first; blobs soon), ChatGPT, etc. next
    • transforms: filtering, grouping into threads/conversations, text cleaning
    • outputs: Markdown, OAI JSONL, JSONL (normalized items), ShareGPT
  • Library API to compose your own pipeline or plug in proprietary adapters
  • Copies referenced media into an images/ folder
  • JSONL artifacts for easy inspection and future checkpointing

Why

Turn your archives into:

  • Readable Markdown
  • OAI-compatible JSONL for training/eval
  • A normalized JSONL dump for inspection and reuse

Today it supports:

  • Twitter/X β€” Local archive exports (ZIP extracted)
  • Bluesky β€” AT Protocol CAR file exports with optional API enrichment
  • Glowfic β€” Collaborative fiction threads, sections, or boards via URL

Next: ChatGPT, Reddit, Hugging Face datasets.

This library started life as a Python script. This is a TypeScript rewrite where development will continue. It has powered projects like deeperfates.com, keltham.lol, and youaretheassistantnow.com.

More context: https://deepfates.com/convert-your-twitter-archive-into-training-data

Quick start (CLI)

Requirements:

  • Node.js 18+ (tested with recent LTS)
  • For direct execution: tsx (installed automatically with npx)

Run with tsx (no build needed):

npx tsx splice.ts --source /path/to/twitter-archive --out ./out

Run the published CLI (after install):

npx splice --source /path/to/twitter-archive --out ./out

Build then run with Node:

npm install
npm run build
node dist/cli/splice.js --source /path/to/twitter-archive --out ./out

Dev/watch mode:

npm run dev -- --source /path/to/twitter-archive --out ./out

Usage

Help (equivalent to --help):

splice β€” convert a Twitter archive to Markdown, OAI JSONL, and/or JSON

Usage:
  splice --source <path> --out <dir> [--format markdown oai json sharegpt] [--system-message <text>]
         [--since <iso>] [--until <iso>] [--min-length <n>] [--exclude-rt] [--only-threads] [--with-media]
         [--enrich] [--dry-run] [--stats-json] [--log-level <level>] [--json-stdout] [--quiet|-q] [--verbose] [--version|-V]

  splice --glowfic <url> --out <dir> --assistant <name> [--assistant-regex <pattern>]
  splice --glowfic-board <url> --out <dir> --all-characters [--min-posts <n>]

Options:
  --source <path>            Path to Twitter archive directory or Bluesky .car file
  --out <dir>                Output directory
  --format <fmt...>          One or more formats: markdown, oai, json, sharegpt (default: markdown oai json)
  --system-message <text>    System message for OAI JSONL (default: "You have been uploaded to the internet")
                             Alias: --system
  --since <iso>              Include items on/after this ISO date
  --until <iso>              Include items on/before this ISO date
  --min-length <n>           Minimum text length
  --exclude-rt               Exclude retweets (RT ...)
  --only-threads             Output threads only
  --with-media               Only include items that have media
  --enrich                   Fetch thread context from API (Bluesky only)
  --dry-run, -n              Plan only; don't write files
  --stats-json               Write a stats.json summary
  --log-level <level>        debug|info|warn|error (default: info)
  --json-stdout              Emit normalized items JSONL to stdout; logs to stderr
  --quiet, -q                Errors only
  --verbose                  Debug logging
  --version, -V              Show version
  --help, -h                 Show help

Glowfic Options:
  --glowfic <url>            Glowfic thread/section/board URL to ingest
  --assistant <name>         Character name for assistant role (case-insensitive)
  --assistant-regex <pat>    Regex pattern for assistant matching
  --glowfic-board <url>      Board URL for multi-character export
  --all-characters           Export datasets for all characters on board
  --min-posts <n>            Minimum posts for character inclusion (default: 10)

Environment:
  SPLICE_SYSTEM_MESSAGE      Alternative way to set the OAI system message
                             (flag value takes precedence)

Exit codes:

  • 0: success
  • 1: runtime error
  • 2: invalid arguments or source detection failed

Stdout/Stderr:

  • Primary logs go to stderr (so you can safely pipe stdout)
  • Data files are written to the output directory

Examples

Convert to both Markdown and OAI JSONL:

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out

Markdown only:

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format markdown

OAI only with custom system message:

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format oai --system-message "You are helpful."

JSON only (normalized items):

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format json

All formats:

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format markdown oai json sharegpt

Filters and selection:

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format markdown --since 2024-01-01 --until 2024-12-31 --min-length 40 --exclude-rt --only-threads --with-media

Stats JSON summary:

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format oai --stats-json

Stream normalized items to stdout:

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --json-stdout | head -n 5

Use environment variable for system message:

SPLICE_SYSTEM_MESSAGE="Be concise." npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --format oai

Dry run with debug logs (no files written):

npx tsx splice.ts --source ~/Downloads/my-twitter-archive --out ./out --dry-run --log-level debug

Bluesky CAR export:

npx tsx splice.ts --source ~/Downloads/my-bsky-repo.car --out ./out

Bluesky with thread context enrichment (fetches parent posts from API):

npx tsx splice.ts --source ~/Downloads/my-bsky-repo.car --out ./out --enrich

Glowfic thread (single character as assistant):

npx tsx splice.ts --glowfic https://glowfic.com/posts/5506 --out ./out --assistant "Carissa"

Glowfic board (all characters, HuggingFace dataset format):

npx tsx splice.ts --glowfic-board https://glowfic.com/boards/215 --out ./out --all-characters --min-posts 20

Sources

Twitter/X

Extract the archive ZIP to a directory containing:

  • data/manifest.js
  • data/tweets_media/ (optional, for media assets)
  • YTD .js files for tweets and like data

We ingest tweets, likes, and media files prefixed with <tweetId>-*.

Bluesky

Export your repository from Settings β†’ Advanced β†’ Export Content. Pass --source path/to/repo.car.

  • Use --enrich to fetch parent posts from the public API for full conversation context
  • Media blobs are referenced but not downloaded yet

Glowfic

Pass a thread, section, or board URL: --glowfic https://glowfic.com/posts/5506

  • Requires --assistant <name> to specify which character is the assistant
  • For multi-character datasets: --glowfic-board <url> --all-characters

Output layout

On a successful run, you’ll see:

  • out/threads/ β€” one Markdown file per detected thread, named like YYYYMMDD-thread-<slug>.md
  • out/tweets/ β€” one Markdown file per non-thread tweet, named like YYYYMMDD-tweet-<slug>.md
  • out/images/ β€” copied media files referenced by the Markdown
  • out/conversations_oai.jsonl β€” OAI JSONL file with conversations built from threads and reply chains
  • out/normalized_items.jsonl β€” JSONL dump of normalized ContentItem records (one item per line)
  • out/sharegpt.json β€” ShareGPT export (array) for loaders that expect ShareGPT format
  • out/stats.json β€” summary (counts, threads/conversations, date range)

Notes:

  • Thread filenames are derived from the top post’s first words (sanitized).
  • The OAI JSONL file includes a top-level β€œsystem” message (configurable).

Architecture (for contributors)

  • src/core β€” shared types, arg parsing, logger, utilities
  • src/sources β€” input adapters (twitter.ts)
  • src/transforms β€” filters, grouping, conversation mapping
  • src/outputs β€” writers for markdown/oai/json/sharegpt/stats
  • src/cli β€” CLI entrypoint wiring sources β†’ transforms β†’ outputs

The code is structured so you can add new sources, transforms, or outputs without touching unrelated parts.

Library usage

You can import and compose pieces in your own app:

import {
  ingestTwitter,
  applyFilters,
  indexById,
  groupThreadsAndConversations,
  writeOAI,
} from "@deepfates/splice";

const items = await ingestTwitter("/path/to/archive", (l, m) => console.error(`[${l}] ${m}`));
const filtered = applyFilters(items, { minLength: 20, excludeRt: true, withMedia: false });
const all = indexById(filtered);
const { threads, conversations } = groupThreadsAndConversations(all);
await writeOAI(threads, conversations, "./out", "You have been uploaded to the internet", (l, m) => console.error(`[${l}] ${m}`), false);

Pluggable adapters (build proprietary ones privately and upstream later if you want):

  • SourceAdapter: detect(pathOrUri), ingest(pathOrUri, logger) β†’ ContentItem[]
  • OutputAdapter: write(args, ctx) where args may include items, threads, conversations, systemMessage, and ctx provides outDir, dryRun, and logger

Development

Install deps:

npm install

Run with tsx:

npm run start -- --source /path/to/twitter-archive --out ./out

Watch mode:

npm run dev -- --source /path/to/twitter-archive --out ./out

Build (emits dist/cli/splice.js and sets up the splice bin; library API at dist/index.js):

npm run build

Run the built CLI:

node dist/cli/splice.js --source /path/to/twitter-archive --out ./out

Testing

Run the full test suite (includes integration tests for Markdown, OAI JSONL with system message, media copying, and normalized JSONL):

npm test

Watch tests:

npm run test:watch

Roadmap (short)

  • More inputs: Reddit, ChatGPT, HF datasets
  • Checkpointing and resumable pipelines (JSONL-based manifests)
  • More outputs: SQLite/Parquet/CSV
  • Blob fetching for Bluesky media
  • Better selection: persona/character filters, time ranges
  • Improved role attribution and metadata preservation

License

MIT. See LICENSE.

Acknowledgements

See the blog post above for context. CLI UX follows clig.dev-style conventions.

About

import your conversation history into AI 🫚

Resources

License

Stars

Watchers

Forks

Packages

No packages published