Skip to content

[Bug]: Incorrect rendering of inline code inside of links #583

Open
@dmurat

Description

crawl4ai version

0.4.248b3

Expected Behavior

Correct rendering of links containing inline code. For example:

<a href="https://docs.spring.io/spring-framework/docs/6.2.x/javadoc-api/org/springframework/context/annotation/Configuration.html" class="apiref"><code>@Configuration</code></a>

should be rendered as

[`@Configuration`](https://docs.spring.io/spring-framework/docs/6.2.x/javadoc-api/org/springframework/context/annotation/Configuration.html)

Current Behavior

Currently, the rendering of links with inline code outputs inline code first, followed by correct but empty links like in

`@Configuration`[](https://docs.spring.io/spring-framework/docs/6.2.x/javadoc-api/org/springframework/context/annotation/Configuration.html)

Is this reproducible?

Yes

Inputs Causing the Bug

- URL: https://docs.spring.io/spring-boot/how-to/security.html
- css _selector: "article.doc > *:not(.breadcrumbs-container):not(aside):not(nav)"
- excluded_selector: ".source-toolbox, .ulist.tablist, .tab:not(.is-selected), .tabpanel.is-hidden"

Steps to Reproduce

Code snippets

crawler_run_config = CrawlerRunConfig(
    scraping_strategy=CustomWebScrapingStrategy(),
    css_selector=css_selector,
    excluded_selector=excluded_css_selector or "",
    exclude_external_links=True,
    exclude_external_images=True,
    markdown_generator=DefaultMarkdownGenerator(
        options={
            "skip_internal_links": True,
            "single_line_break": False,
            "protect_links": False,
            "pad_tables": True
        }
    ),
    process_iframes=False,
    magic=True,
    cache_mode=CacheMode.BYPASS,
    verbose=True,
)


crawl_result = await crawler.arun(
    url=url,
    config=crawler_run_config,
)

OS

macOS

Python version

3.12.7

Browser

Chrome

Browser version

Version 132.0.6834.160 (Official Build) (arm64)

Error logs & Screenshots (if applicable)

No response

Activity

dmurat

dmurat commented on Jan 29, 2025

@dmurat
Author

Here is the outline of the fix I'm currently using, and it works ok as far as I can see:

class CustomHTML2Text(HTML2Text):
    def __init__(self, *args, handle_code_in_pre=False, **kwargs):
        super().__init__(*args, **kwargs)
        ...
        self.inside_link = False  # Add this to track if we're inside a link
        ...

    # fmt: off
    def handle_tag(self, tag, attrs, start):
        # Handle links
        if tag == "a":
            if start:
                self.inside_link = True
            else:
                self.inside_link = False

            super().handle_tag(tag, attrs, start)
            return

        ...
        # Handle pre tags
        if tag == 'pre':
            ...

        elif tag == 'code':
            if self.inside_pre and not self.handle_code_in_pre:
                return

            if start:
                if not self.inside_link:
                    self.o("`")  # Only output backtick if not inside a link
                self.inside_code = True
            else:
                if not self.inside_link:
                    self.o("`")  # Only output backtick if not inside a link
                self.inside_code = False

            # If inside a link, let the parent class handle the content
            if self.inside_link:
                super().handle_tag(tag, attrs, start) 

        else:
            super().handle_tag(tag, attrs, start)

    ...

HTH

added
📌 Root causedidentified the root cause of bug
and removed on Jan 31, 2025
aravindkarnam

aravindkarnam commented on Jan 31, 2025

@aravindkarnam
Collaborator

@dmurat Thanks for point this out and for your suggestion. Looks like you already fixed it. Could you raise a PR for this?

dmurat

dmurat commented on Jan 31, 2025

@dmurat
Author

@aravindkarnam Sure, I can try. One question though, are there any existing tests where I can look for examples?

aravindkarnam

aravindkarnam commented on Jan 31, 2025

@aravindkarnam
Collaborator

@dmurat There are several examples in /tests folder.

aravindkarnam

aravindkarnam commented on Feb 10, 2025

@aravindkarnam
Collaborator

@dmurat Were you able to make any progress on this?

dmurat

dmurat commented on Feb 10, 2025

@dmurat
Author

@aravindkarnam Sry, didn't find time. Maybe during this or next week if you can wait.

7 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

☕ LowPriority - Low⚙️ Under TestBug fix / Feature request that's under testing🐞 BugSomething isn't working💪 - IntermediateDifficulty level - Intermediate📌 Root causedidentified the root cause of bug

Projects

  • Status

    To Assign

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    [Bug]: Incorrect rendering of inline code inside of links · Issue #583 · unclecode/crawl4ai