Skip to content

Content Analysis

Rumen Damyanov edited this page Sep 22, 2025 · 1 revision

Content Analysis

The PHP-SEO package includes a powerful content analyzer that extracts meaningful information from your HTML content to generate optimized SEO elements.

Overview

The ContentAnalyzer class processes HTML content and extracts:

  • Headings (H1-H6) with hierarchy
  • Images with alt text and metadata
  • Internal and external links
  • Keywords and key phrases
  • Content metrics (word count, language detection)
  • Main content identification

How Content Analysis Works

1. HTML Parsing

The analyzer uses PHP's DOMDocument to parse HTML content safely:

use Rumenx\PhpSeo\Analyzers\ContentAnalyzer;

$analyzer = new ContentAnalyzer();
$content = '<h1>My Page</h1><p>Content here...</p>';

$analysis = $analyzer->analyze($content, [
    'title' => 'Page Title',
    'url' => 'https://example.com/page'
]);

2. Content Extraction

Headings Analysis

// Extract all headings with levels
$headings = $analysis['headings'];
// Result: [
//     ['level' => 1, 'text' => 'Main Title'],
//     ['level' => 2, 'text' => 'Section Title'],
//     ...
// ]

Image Analysis

// Extract images with metadata
$images = $analysis['images'];
// Result: [
//     [
//         'src' => 'image.jpg',
//         'alt' => 'Image description',
//         'title' => 'Image title',
//         'width' => 800,
//         'height' => 600
//     ]
// ]

Link Analysis

// Extract all links
$links = $analysis['links'];
// Result: [
//     [
//         'url' => 'https://example.com',
//         'text' => 'Link text',
//         'type' => 'external' // or 'internal'
//     ]
// ]

3. Content Metrics

// Content statistics
$metrics = [
    'word_count' => $analysis['word_count'],
    'character_count' => $analysis['character_count'],
    'language' => $analysis['language'],
    'content_type' => $analysis['content_type']
];

Main Content Detection

The analyzer identifies the main content using multiple strategies:

1. Semantic HTML Tags

  • <main> element (highest priority)
  • <article> element
  • Content within <section> tags

2. Content Density Analysis

  • Paragraphs with substantial text
  • Areas with high text-to-markup ratio
  • Content blocks with multiple sentences

3. Heuristic Analysis

  • Largest text blocks
  • Content outside navigation/sidebar areas
  • Text following heading structures

Example:

$mainContent = $analysis['main_content'];
// Contains the most relevant content for SEO analysis

Keyword Extraction

The analyzer extracts keywords using:

1. Text Frequency Analysis

  • Word frequency counting
  • Stop word filtering
  • Stemming and normalization

2. Contextual Analysis

  • Words near headings
  • Bold/emphasized text
  • Link anchor text

3. Metadata Integration

  • Meta keywords (if present)
  • Alt text from images
  • Title attributes

Example:

$keywords = $analysis['keywords'];
// Result: ['seo', 'optimization', 'content', 'analysis']

Content Type Detection

The analyzer detects content types:

switch ($analysis['content_type']) {
    case 'text/html':
        // Standard HTML content
        break;
    case 'text/markdown':
        // Markdown content
        break;
    case 'text/plain':
        // Plain text content
        break;
}

Language Detection

Basic language detection based on:

  • HTML lang attribute
  • Content analysis patterns
  • Character encoding detection
$language = $analysis['language']; // 'en', 'es', 'fr', etc.

Advanced Analysis Options

Custom Content Selectors

$analyzer = new ContentAnalyzer([
    'main_content_selector' => 'article.content',
    'exclude_selectors' => ['.sidebar', '.navigation']
]);

Analysis Depth Control

$analysis = $analyzer->analyze($content, $metadata, [
    'extract_images' => true,
    'extract_links' => true,
    'extract_keywords' => true,
    'analyze_readability' => false
]);

Integration with SEO Generation

The content analysis feeds directly into SEO generators:

use Rumenx\PhpSeo\SeoManager;

$seo = new SeoManager();
$analysis = $seo->analyze($content, $metadata);

// Analysis data is automatically used by generators
$title = $seo->generateTitle(); // Uses headings and keywords
$description = $seo->generateDescription(); // Uses main content
$metaTags = $seo->generateMetaTags(); // Uses all analysis data

Performance Considerations

Caching

  • Analysis results are automatically cached
  • Cache keys based on content hash
  • Configurable cache TTL

Memory Management

  • Large documents are processed in chunks
  • DOM memory is freed after processing
  • Configurable memory limits

Processing Limits

$analyzer = new ContentAnalyzer([
    'max_content_length' => 100000,  // 100KB limit
    'max_headings' => 50,           // Limit heading extraction
    'max_images' => 20,             // Limit image extraction
    'max_links' => 100              // Limit link extraction
]);

Error Handling

The analyzer handles malformed HTML gracefully:

try {
    $analysis = $analyzer->analyze($content);
} catch (ContentAnalysisException $e) {
    // Handle analysis errors
    $fallback = $analyzer->getBasicAnalysis($content);
}

Best Practices

1. Content Quality

  • Use semantic HTML structure
  • Include proper heading hierarchy
  • Add meaningful alt text to images
  • Use descriptive link text

2. Performance

  • Cache analysis results for static content
  • Limit analysis scope for large documents
  • Use async processing for bulk analysis

3. SEO Optimization

  • Ensure main content is easily identifiable
  • Use structured heading hierarchy (H1 → H2 → H3)
  • Include relevant keywords naturally
  • Optimize image metadata

Examples

Basic Content Analysis

$content = '
<article>
    <h1>SEO Best Practices</h1>
    <p>Content optimization is crucial...</p>
    <h2>Title Optimization</h2>
    <p>Your page title should...</p>
    <img src="seo-guide.jpg" alt="SEO optimization guide">
</article>
';

$analysis = $analyzer->analyze($content);
echo "Found {$analysis['word_count']} words";
echo "Main heading: {$analysis['headings'][0]['text']}";

Advanced Analysis with Metadata

$analysis = $analyzer->analyze($content, [
    'title' => 'SEO Guide',
    'url' => 'https://example.com/seo-guide',
    'author' => 'SEO Expert',
    'published_at' => '2024-01-01'
]);

// Analysis includes metadata context
$enrichedKeywords = $analysis['keywords']; // Includes title words

The content analyzer is the foundation of intelligent SEO generation, providing rich context for creating optimized titles, descriptions, and meta tags.

Clone this wiki locally