Skip to content

elliottophellia/cari

Repository files navigation

cari

A Rust implementation of the popular HTML parsing utility pup.

Description

cari is a command-line utility for parsing and extracting data from HTML documents using CSS selectors. It serves as a drop-in replacement for pup, built on the actively maintained scraper/html5ever stack for modern CSS selection and DOM handling.

Features

  • Parse HTML from stdin or files
  • Extract data using CSS selectors
  • Multiple output formats:
    • Prettified HTML (default)
    • Plain text extraction
    • JSON representation
    • Attribute values
  • Support for pseudo-selectors:
    • :first-child, :last-child, :nth-child(n)
    • :first-of-type, :last-of-type, :nth-of-type(n)
    • :not(), :contains(), :parent-of()
    • :empty, :only-child, :only-of-type
  • Colorized output
  • Configurable indentation
  • Depth limiting for output
  • Character set support

Installation

From Source

To build and install cari from source:

git clone https://github.com/elliottophellia/cari.git
cd cari
cargo build --release

The compiled binary will be available at target/release/cari.

Using Cargo

cargo install cari

Usage

USAGE:
    cari [OPTIONS] [SELECTORS]... [DISPLAY]

OPTIONS:
    -h, --help              Show this help message
    -v, --version           Show version information
    -t, --text              Print text content
    -j, --json              Print as JSON
    --html                  Print as HTML (default)
    -a, --attr NAME         Print attribute value
    -i, --indent LEVEL      Indent output (spaces)
    -n, --number            Print the number of elements selected
    -p, --plain             Don't escape HTML
    --pre                   Preserve preformatted text
    -c, --color             Colorize output
    --no-color              Disable colorized output
    -f, --file FILE         Read input from file instead of stdin
    -l, --limit DEPTH       Restrict number of levels printed
    --charset CHARSET       Specify input charset (utf-8, latin-1, ascii)
    --escape-html           Escape HTML in text output (default)
    --no-escape-html        Do not escape HTML in text output

DISPLAY FUNCTIONS:
    text{}                  Print text content
    html{}                  Print as HTML
    json{}                  Print as JSON
    attr{NAME}              Print attribute value

Examples

Extract text from title elements

cat index.html | cari 'title'

Extract href attributes from links

cat index.html | cari 'a attr{href}'

Output JSON representation of paragraphs in divs

cat index.html | cari 'div > p' json{}

Extract src attributes from images

# Flag style
cat index.html | cari 'img' --attr src

# pup style
cat index.html | cari 'img' attr{src}

Read from file instead of stdin

cari -f index.html 'title'

Limit output depth

cari -l 2 -f page.html 'body'

Use in pipelines

curl -s https://example.com | cari 'main > p text{}'

Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

Changelog

See CHANGELOG.md for a detailed history of changes between versions.

License

                    GNU GENERAL PUBLIC LICENSE
                       Version 3, 29 June 2007

 Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

About

A Rust implementation of the popular HTML parsing utility pup.

Resources

License

Contributing

Stars

Watchers

Forks

Languages