HTML Content / Article Extractor in Go
GoOse is a powerful Go library and command-line tool for extracting article content and metadata from HTML pages. This is a Go port of the original "Goose" library, completely rewritten and modernized for contemporary Go development.
Key Features:
- π Extract clean article text from web pages
- π° Extract article metadata (title, description, keywords, images)
- πΌοΈ Advanced image extraction and top image detection
- π₯ Video content detection and extraction
- π Multi-language support with stopwords
- π§ Command-line interface for easy integration
- π¦ Clean library API for programmatic use
- β‘ High performance with concurrent processing support
Originally licensed to Gravity.com under the Apache License 2.0. Go port written by Antonio Linari.
go get github.com/advancedlogic/GoOse
# Install directly
go install github.com/advancedlogic/GoOse/cmd/goose@latest
# Or build from source
git clone https://github.com/advancedlogic/GoOse.git
cd GoOse
make build
# Binary will be available at ./bin/goose
# Extract article from URL (text output)
goose convert https://example.com/article
# Extract article with JSON output
goose convert https://example.com/article --format json
# Save output to file
goose convert https://example.com/article --output article.txt
# Show version
goose version
# Show help
goose help
package main
import (
"fmt"
"log"
"github.com/advancedlogic/GoOse/pkg/goose"
)
func main() {
// Create a new GoOse instance
g := goose.New()
// Extract from URL
article, err := g.ExtractFromURL("https://edition.cnn.com/2012/07/08/opinion/banzi-ted-open-source/index.html")
if err != nil {
log.Fatal(err)
}
// Print extracted content
fmt.Println("Title:", article.Title)
fmt.Println("Description:", article.MetaDescription)
fmt.Println("Keywords:", article.MetaKeywords)
fmt.Println("Content:", article.CleanedText)
fmt.Println("URL:", article.FinalURL)
fmt.Println("Top Image:", article.TopImage)
fmt.Println("Authors:", article.Authors)
fmt.Println("Publish Date:", article.PublishDate)
}
package main
import (
"github.com/advancedlogic/GoOse/pkg/goose"
)
func main() {
// Create configuration
config := goose.Configuration{
Debug: false,
TargetLanguage: "en",
UserAgent: "MyApp/1.0",
Timeout: 30, // seconds
}
// Create GoOse with custom configuration
g := goose.NewWithConfig(config)
// Extract from raw HTML
html := "<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>"
article, err := g.ExtractFromRawHTML(html, "https://example.com")
if err != nil {
// Handle error
}
// Use the extracted article
_ = article
}
GoOse follows standard Go project layout:
βββ cmd/goose/ # CLI application
βββ pkg/goose/ # Public library API
βββ internal/ # Private application code
β βββ crawler/ # Web crawling logic
β βββ extractor/ # Content extraction
β βββ parser/ # HTML parsing utilities
β βββ types/ # Shared data types
β βββ utils/ # Utility functions
βββ docs/ # Documentation
βββ sites/ # Test HTML files
βββ Makefile # Build automation
- Go 1.21 or later
- Make (for build automation)
-
Clone the repository:
git clone https://github.com/advancedlogic/GoOse.git cd GoOse
-
Install dependencies:
make deps
-
Build the project:
make build
-
Run tests:
make test
-
Run all quality checks:
make qa
make help # Show all available commands
make build # Build the CLI binary
make install # Install CLI to GOPATH/bin
make test # Run all tests
make test-race # Run tests with race detection
make coverage # Generate coverage report
make format # Format source code
make lint # Run linters
make qa # Run all quality checks
make clean # Clean build artifacts
make tidy # Clean up go.mod and go.sum
- Make changes to the code
- Run
make format
to format your code - Run
make qa
to ensure quality - Run
make test
to verify functionality - Commit your changes
goose.Goose
- Main extractor instancegoose.Article
- Extracted article datagoose.Configuration
- Extractor configuration
goose.New()
- Create new extractor with default configgoose.NewWithConfig(config)
- Create extractor with custom configExtractFromURL(url)
- Extract article from URLExtractFromRawHTML(html, url)
- Extract from HTML string
For complete API documentation, run:
go doc github.com/advancedlogic/GoOse/pkg/goose
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Make your changes following the coding standards
- Run the full test suite (
make qa
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
Please ensure your code:
- β
Passes all tests (
make test
) - β
Follows Go formatting standards (
make format
) - β
Passes linting checks (
make lint
) - β Has appropriate test coverage
- β Includes documentation for public APIs
- β Modern Go modules support
- β CLI interface with Cobra
- β Comprehensive test coverage
- β Standard Go project layout
- β Build automation with Make
- Enhanced error handling and logging
- Plugin architecture for custom extractors
- Performance optimizations
- Additional output formats (XML, YAML)
- Docker containerization
- Advanced image processing
- Batch processing capabilities
Licensed under the Apache License, Version 2.0. See LICENSE for details.
- @Martin Angers for goquery
- @Fatih Arslan for set
- Go Team for the amazing language and net/html
- Original Goose contributors at Gravity.com
- Community contributors for ongoing improvements