A Rust implementation of the popular HTML parsing utility pup.
cari is a command-line utility for parsing and extracting data from HTML documents using CSS selectors. It serves as a drop-in replacement for pup, built on the actively maintained scraper/html5ever stack for modern CSS selection and DOM handling.
- Parse HTML from stdin or files
- Extract data using CSS selectors
- Multiple output formats:
- Prettified HTML (default)
- Plain text extraction
- JSON representation
- Attribute values
- Support for pseudo-selectors:
:first-child,:last-child,:nth-child(n):first-of-type,:last-of-type,:nth-of-type(n):not(),:contains(),:parent-of():empty,:only-child,:only-of-type
- Colorized output
- Configurable indentation
- Depth limiting for output
- Character set support
To build and install cari from source:
git clone https://github.com/elliottophellia/cari.git
cd cari
cargo build --releaseThe compiled binary will be available at target/release/cari.
cargo install cariUSAGE:
cari [OPTIONS] [SELECTORS]... [DISPLAY]
OPTIONS:
-h, --help Show this help message
-v, --version Show version information
-t, --text Print text content
-j, --json Print as JSON
--html Print as HTML (default)
-a, --attr NAME Print attribute value
-i, --indent LEVEL Indent output (spaces)
-n, --number Print the number of elements selected
-p, --plain Don't escape HTML
--pre Preserve preformatted text
-c, --color Colorize output
--no-color Disable colorized output
-f, --file FILE Read input from file instead of stdin
-l, --limit DEPTH Restrict number of levels printed
--charset CHARSET Specify input charset (utf-8, latin-1, ascii)
--escape-html Escape HTML in text output (default)
--no-escape-html Do not escape HTML in text output
DISPLAY FUNCTIONS:
text{} Print text content
html{} Print as HTML
json{} Print as JSON
attr{NAME} Print attribute value
cat index.html | cari 'title'cat index.html | cari 'a attr{href}'cat index.html | cari 'div > p' json{}# Flag style
cat index.html | cari 'img' --attr src
# pup style
cat index.html | cari 'img' attr{src}cari -f index.html 'title'cari -l 2 -f page.html 'body'curl -s https://example.com | cari 'main > p text{}'Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
See CHANGELOG.md for a detailed history of changes between versions.
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.