-
-
Notifications
You must be signed in to change notification settings - Fork 1
Content Analysis
Rumen Damyanov edited this page Sep 22, 2025
·
1 revision
The PHP-SEO package includes a powerful content analyzer that extracts meaningful information from your HTML content to generate optimized SEO elements.
The ContentAnalyzer class processes HTML content and extracts:
- Headings (H1-H6) with hierarchy
- Images with alt text and metadata
- Internal and external links
- Keywords and key phrases
- Content metrics (word count, language detection)
- Main content identification
The analyzer uses PHP's DOMDocument to parse HTML content safely:
use Rumenx\PhpSeo\Analyzers\ContentAnalyzer;
$analyzer = new ContentAnalyzer();
$content = '<h1>My Page</h1><p>Content here...</p>';
$analysis = $analyzer->analyze($content, [
'title' => 'Page Title',
'url' => 'https://example.com/page'
]);// Extract all headings with levels
$headings = $analysis['headings'];
// Result: [
// ['level' => 1, 'text' => 'Main Title'],
// ['level' => 2, 'text' => 'Section Title'],
// ...
// ]// Extract images with metadata
$images = $analysis['images'];
// Result: [
// [
// 'src' => 'image.jpg',
// 'alt' => 'Image description',
// 'title' => 'Image title',
// 'width' => 800,
// 'height' => 600
// ]
// ]// Extract all links
$links = $analysis['links'];
// Result: [
// [
// 'url' => 'https://example.com',
// 'text' => 'Link text',
// 'type' => 'external' // or 'internal'
// ]
// ]// Content statistics
$metrics = [
'word_count' => $analysis['word_count'],
'character_count' => $analysis['character_count'],
'language' => $analysis['language'],
'content_type' => $analysis['content_type']
];The analyzer identifies the main content using multiple strategies:
-
<main>element (highest priority) -
<article>element - Content within
<section>tags
- Paragraphs with substantial text
- Areas with high text-to-markup ratio
- Content blocks with multiple sentences
- Largest text blocks
- Content outside navigation/sidebar areas
- Text following heading structures
Example:
$mainContent = $analysis['main_content'];
// Contains the most relevant content for SEO analysisThe analyzer extracts keywords using:
- Word frequency counting
- Stop word filtering
- Stemming and normalization
- Words near headings
- Bold/emphasized text
- Link anchor text
- Meta keywords (if present)
- Alt text from images
- Title attributes
Example:
$keywords = $analysis['keywords'];
// Result: ['seo', 'optimization', 'content', 'analysis']The analyzer detects content types:
switch ($analysis['content_type']) {
case 'text/html':
// Standard HTML content
break;
case 'text/markdown':
// Markdown content
break;
case 'text/plain':
// Plain text content
break;
}Basic language detection based on:
- HTML
langattribute - Content analysis patterns
- Character encoding detection
$language = $analysis['language']; // 'en', 'es', 'fr', etc.$analyzer = new ContentAnalyzer([
'main_content_selector' => 'article.content',
'exclude_selectors' => ['.sidebar', '.navigation']
]);$analysis = $analyzer->analyze($content, $metadata, [
'extract_images' => true,
'extract_links' => true,
'extract_keywords' => true,
'analyze_readability' => false
]);The content analysis feeds directly into SEO generators:
use Rumenx\PhpSeo\SeoManager;
$seo = new SeoManager();
$analysis = $seo->analyze($content, $metadata);
// Analysis data is automatically used by generators
$title = $seo->generateTitle(); // Uses headings and keywords
$description = $seo->generateDescription(); // Uses main content
$metaTags = $seo->generateMetaTags(); // Uses all analysis data- Analysis results are automatically cached
- Cache keys based on content hash
- Configurable cache TTL
- Large documents are processed in chunks
- DOM memory is freed after processing
- Configurable memory limits
$analyzer = new ContentAnalyzer([
'max_content_length' => 100000, // 100KB limit
'max_headings' => 50, // Limit heading extraction
'max_images' => 20, // Limit image extraction
'max_links' => 100 // Limit link extraction
]);The analyzer handles malformed HTML gracefully:
try {
$analysis = $analyzer->analyze($content);
} catch (ContentAnalysisException $e) {
// Handle analysis errors
$fallback = $analyzer->getBasicAnalysis($content);
}- Use semantic HTML structure
- Include proper heading hierarchy
- Add meaningful alt text to images
- Use descriptive link text
- Cache analysis results for static content
- Limit analysis scope for large documents
- Use async processing for bulk analysis
- Ensure main content is easily identifiable
- Use structured heading hierarchy (H1 → H2 → H3)
- Include relevant keywords naturally
- Optimize image metadata
$content = '
<article>
<h1>SEO Best Practices</h1>
<p>Content optimization is crucial...</p>
<h2>Title Optimization</h2>
<p>Your page title should...</p>
<img src="seo-guide.jpg" alt="SEO optimization guide">
</article>
';
$analysis = $analyzer->analyze($content);
echo "Found {$analysis['word_count']} words";
echo "Main heading: {$analysis['headings'][0]['text']}";$analysis = $analyzer->analyze($content, [
'title' => 'SEO Guide',
'url' => 'https://example.com/seo-guide',
'author' => 'SEO Expert',
'published_at' => '2024-01-01'
]);
// Analysis includes metadata context
$enrichedKeywords = $analysis['keywords']; // Includes title wordsThe content analyzer is the foundation of intelligent SEO generation, providing rich context for creating optimized titles, descriptions, and meta tags.