Skip to content

advancedlogic/GoOse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GoOse

HTML Content / Article Extractor in Go

Build Status Coverage Status Go Report Card GoDoc

Description

GoOse is a powerful Go library and command-line tool for extracting article content and metadata from HTML pages. This is a Go port of the original "Goose" library, completely rewritten and modernized for contemporary Go development.

Key Features:

  • πŸš€ Extract clean article text from web pages
  • πŸ“° Extract article metadata (title, description, keywords, images)
  • πŸ–ΌοΈ Advanced image extraction and top image detection
  • πŸŽ₯ Video content detection and extraction
  • 🌐 Multi-language support with stopwords
  • πŸ”§ Command-line interface for easy integration
  • πŸ“¦ Clean library API for programmatic use
  • ⚑ High performance with concurrent processing support

Originally licensed to Gravity.com under the Apache License 2.0. Go port written by Antonio Linari.

Installation

As a Library

go get github.com/advancedlogic/GoOse

As a CLI Tool

# Install directly
go install github.com/advancedlogic/GoOse/cmd/goose@latest

# Or build from source
git clone https://github.com/advancedlogic/GoOse.git
cd GoOse
make build
# Binary will be available at ./bin/goose

Quick Start

Command Line Usage

# Extract article from URL (text output)
goose convert https://example.com/article

# Extract article with JSON output
goose convert https://example.com/article --format json

# Save output to file
goose convert https://example.com/article --output article.txt

# Show version
goose version

# Show help
goose help

Library Usage

package main

import (
	"fmt"
	"log"

	"github.com/advancedlogic/GoOse/pkg/goose"
)

func main() {
	// Create a new GoOse instance
	g := goose.New()
	
	// Extract from URL
	article, err := g.ExtractFromURL("https://edition.cnn.com/2012/07/08/opinion/banzi-ted-open-source/index.html")
	if err != nil {
		log.Fatal(err)
	}

	// Print extracted content
	fmt.Println("Title:", article.Title)
	fmt.Println("Description:", article.MetaDescription)
	fmt.Println("Keywords:", article.MetaKeywords)
	fmt.Println("Content:", article.CleanedText)
	fmt.Println("URL:", article.FinalURL)
	fmt.Println("Top Image:", article.TopImage)
	fmt.Println("Authors:", article.Authors)
	fmt.Println("Publish Date:", article.PublishDate)
}

Advanced Configuration

package main

import (
	"github.com/advancedlogic/GoOse/pkg/goose"
)

func main() {
	// Create configuration
	config := goose.Configuration{
		Debug:          false,
		TargetLanguage: "en",
		UserAgent:      "MyApp/1.0",
		Timeout:        30, // seconds
	}
	
	// Create GoOse with custom configuration
	g := goose.NewWithConfig(config)
	
	// Extract from raw HTML
	html := "<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>"
	article, err := g.ExtractFromRawHTML(html, "https://example.com")
	if err != nil {
		// Handle error
	}
	
	// Use the extracted article
	_ = article
}

Project Structure

GoOse follows standard Go project layout:

β”œβ”€β”€ cmd/goose/          # CLI application
β”œβ”€β”€ pkg/goose/          # Public library API
β”œβ”€β”€ internal/           # Private application code
β”‚   β”œβ”€β”€ crawler/        # Web crawling logic
β”‚   β”œβ”€β”€ extractor/      # Content extraction
β”‚   β”œβ”€β”€ parser/         # HTML parsing utilities
β”‚   β”œβ”€β”€ types/          # Shared data types
β”‚   └── utils/          # Utility functions
β”œβ”€β”€ docs/               # Documentation
β”œβ”€β”€ sites/              # Test HTML files
└── Makefile           # Build automation

Development

Prerequisites

  • Go 1.21 or later
  • Make (for build automation)

Getting Started

  1. Clone the repository:

    git clone https://github.com/advancedlogic/GoOse.git
    cd GoOse
  2. Install dependencies:

    make deps
  3. Build the project:

    make build
  4. Run tests:

    make test
  5. Run all quality checks:

    make qa

Available Make Commands

make help          # Show all available commands
make build         # Build the CLI binary
make install       # Install CLI to GOPATH/bin
make test          # Run all tests
make test-race     # Run tests with race detection
make coverage      # Generate coverage report
make format        # Format source code
make lint          # Run linters
make qa            # Run all quality checks
make clean         # Clean build artifacts
make tidy          # Clean up go.mod and go.sum

Development Workflow

  1. Make changes to the code
  2. Run make format to format your code
  3. Run make qa to ensure quality
  4. Run make test to verify functionality
  5. Commit your changes

API Reference

Main Types

  • goose.Goose - Main extractor instance
  • goose.Article - Extracted article data
  • goose.Configuration - Extractor configuration

Key Methods

  • goose.New() - Create new extractor with default config
  • goose.NewWithConfig(config) - Create extractor with custom config
  • ExtractFromURL(url) - Extract article from URL
  • ExtractFromRawHTML(html, url) - Extract from HTML string

For complete API documentation, run:

go doc github.com/advancedlogic/GoOse/pkg/goose

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes following the coding standards
  4. Run the full test suite (make qa)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Please ensure your code:

  • βœ… Passes all tests (make test)
  • βœ… Follows Go formatting standards (make format)
  • βœ… Passes linting checks (make lint)
  • βœ… Has appropriate test coverage
  • βœ… Includes documentation for public APIs

Roadmap

Current Status

  • βœ… Modern Go modules support
  • βœ… CLI interface with Cobra
  • βœ… Comprehensive test coverage
  • βœ… Standard Go project layout
  • βœ… Build automation with Make

Planned Improvements

  • Enhanced error handling and logging
  • Plugin architecture for custom extractors
  • Performance optimizations
  • Additional output formats (XML, YAML)
  • Docker containerization
  • Advanced image processing
  • Batch processing capabilities

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

  • @Martin Angers for goquery
  • @Fatih Arslan for set
  • Go Team for the amazing language and net/html
  • Original Goose contributors at Gravity.com
  • Community contributors for ongoing improvements

About

Html Content / Article Extractor in Golang

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 29

Languages