feat: better parsing of tables #291

RohitR311 · 2024-12-23T21:03:48Z

Enhancing Scrape List to include table scraping.

Summary by CodeRabbit

New Features
- Enhanced scraping functionality with improved handling of complex HTML structures, particularly tables.
- Improved selector generation for table cells, ensuring uniqueness based on their position.
Bug Fixes
- Enhanced error handling for selector generation functions with clearer logging.
Documentation
- Updated function signatures for clarity on usage.

coderabbitai · 2024-12-23T21:03:56Z

Walkthrough

The pull request introduces significant modifications to the web scraping functionality across two files: maxun-core/src/browserSide/scraper.js and server/src/workflow-management/selector.ts. Key updates include enhancements to the scrapeList function for improved handling of table data, the addition of helper functions for value extraction and ancestor identification, and modifications to selector generation for <td> and <th> elements, ensuring unique selectors based on their position within tables.

Changes

File	Change Summary
`maxun-core/src/browserSide/scraper.js`	- Enhanced `scrapeList` function with new helper functions `extractValue` and `findTableAncestor` - Categorized scraped fields into `tableFields` and `nonTableFields` - Improved data extraction logic for tables and non-table elements - Updated function signature for `scrapeList`
`server/src/workflow-management/selector.ts`	- Modified `getNonUniqueSelectors` and `getChildSelectors` to specifically handle `<td>` and `<th>` elements - Implemented unique selectors using `:nth-child()` for table cells - Enhanced error handling with `console.error` logging - Function signatures updated without changes in their signatures

Sequence Diagram

sequenceDiagram
    participant Scraper as Scraper
    participant Browser as Browser
    participant Selector as Selector

    Scraper->>Browser: Initiate scraping
    Browser->>Selector: Generate precise selectors
    Selector-->>Browser: Return table cell selectors
    Browser->>Browser: Categorize fields (table/non-table)
    Browser->>Browser: Process each row/element
    Browser-->>Scraper: Return scraped data

Poem

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f83029 and 5ac88c6.

📒 Files selected for processing (2)

maxun-core/src/browserSide/scraper.js (1 hunks)
server/src/workflow-management/selector.ts (3 hunks)

🔇 Additional comments (4)

maxun-core/src/browserSide/scraper.js (2)

291-307: Validate attribute presence
When scraping table fields, if the designated attribute does not exist, “value” may be undefined, resulting in null entries. Consider robust error handling or default values for missing attributes.

265-275: Avoid overshadowing for 'TD', 'TH', 'TR' detection
Consider whether the tag check should also account for 'tbody' or 'table' context. Nested tables, row/column spans, or varying table structures may cause logic mismatches.

Run this script to search for additional usage of table-specific tags to ensure no overshadowing:

✅ Verification successful

Let me analyze the search results and generate additional verification scripts to check for table-related context.

The first search gave us broad context about table tags usage. Now let's look specifically at the scraper implementation and any related table handling code to make a more informed decision about whether we need to account for tbody/table tags.

Let me check the specific implementation around the tableFields usage to understand the context better.

Let me fix the regex and try again to get the context around tableFields usage.

Based on the code context, I can now provide a final response:

Table tag detection is appropriately scoped
The current implementation correctly focuses on TD, TH, and TR tags which are the essential table cell and row elements. Including TBODY or TABLE tags is unnecessary because:

The code's purpose is to identify individual field containers within a table structure
TD/TH/TR tags are sufficient for capturing table-structured data as they represent the actual data containers
The parent table context is already handled through the listSelector parameter which identifies the containing element

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# This searches for potential references to table structures beyond TD, TH, and TR
rg -A 3 -i '(table|tbody|td|th|tr)'

Length of output: 65585

Script:

#!/bin/bash
# Search for table-related code in the scraper file
ast-grep --pattern 'const tableFields = {$$$}'

# Search for any table-related functions or methods
ast-grep --pattern 'function $_($$table$$) {
  $$$
}'

# Search for table-related imports
rg -l "import.*table" --type js --type ts

Length of output: 273

Script:

#!/bin/bash
# Get the context around tableFields usage
rg -B 10 -A 10 "const tableFields = {}" maxun-core/src/browserSide/scraper.js

# Check if there are any table-related methods or functions in the same file
ast-grep --pattern 'function $name($params) {
  $$$
  table
  $$$
}'

Length of output: 276

Script:

#!/bin/bash
# Get the context around tableFields usage with fixed regex
rg -B 10 -A 10 "const tableFields = \{\}" maxun-core/src/browserSide/scraper.js

# Also check for any table-related field processing
rg -B 5 -A 10 "tableFields\[" maxun-core/src/browserSide/scraper.js

Length of output: 1696

server/src/workflow-management/selector.ts (2)

1007-1012: Ensure robust scanning of table descendants
When generating selectors for child “td” elements, watch out for multiple nested tables. The current approach might overindex if a parent’s siblings contain more child nodes.

872-878: Edge cases for table cell indexing
The approach using “:nth-child(position)” effectively targets the column but may not handle rowspans or colspans that shift child indices in nested tables. Consider verifying your approach with advanced table structures.

coderabbitai · 2024-12-23T21:05:48Z

maxun-core/src/browserSide/scraper.js

+            for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
+                const element = parent.querySelector(selector);
+
+                if (element) {
+                    let value;
                    if (attribute === 'innerText') {
-                        record[label] = fieldElement.innerText.trim();
+                        value = element.innerText.trim();
                    } else if (attribute === 'innerHTML') {
-                        record[label] = fieldElement.innerHTML.trim();
-                    } else if (attribute === 'src') {
-                        // Handle relative 'src' URLs
-                        const src = fieldElement.getAttribute('src');
-                        record[label] = src ? new URL(src, window.location.origin).href : null;
-                    } else if (attribute === 'href') {
-                        // Handle relative 'href' URLs
-                        const href = fieldElement.getAttribute('href');
-                        record[label] = href ? new URL(href, window.location.origin).href : null;
+                        value = element.innerHTML.trim();
+                    } else if (attribute === 'src' || attribute === 'href') {
+                        const attrValue = element.getAttribute(attribute);
+                        value = attrValue ? new URL(attrValue, window.location.origin).href : null;
                    } else {
-                        record[label] = fieldElement.getAttribute(attribute);
+                        value = element.getAttribute(attribute);
                    }
+                    record[label] = value;
                }
            }
-            scrapedData.push(record);
-        }
+


🛠️ Refactor suggestion

Enhance non-table field queries
Note that using parent.querySelector may only fetch the first matching element. If multiple elements match the same selector, subsequent ones will be ignored. Consider using querySelectorAll or clarifying that only the first match is necessary.

-for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) { - const element = parent.querySelector(selector); +for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) { + const elements = Array.from(parent.querySelectorAll(selector)); + if (elements.length === 0) { + continue; + } + const element = elements[0]; // or handle all elements

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ac88c6 and b411faf.

📒 Files selected for processing (1)

server/src/workflow-management/selector.ts (9 hunks)

🔇 Additional comments (1)

server/src/workflow-management/selector.ts (1)

1043-1048: Reuse the shared table cell handling utility.

This is another occurrence of the duplicated table cell handling logic. Use the previously suggested getTableCellSelector utility function here as well.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b411faf and 3176fa2.

📒 Files selected for processing (1)

maxun-core/src/browserSide/scraper.js (1 hunks)

🔇 Additional comments (1)

maxun-core/src/browserSide/scraper.js (1)

265-342: Verify scraping behavior with different table structures

Please ensure testing covers various table structures and edge cases:

Tables with merged cells
Nested tables
Tables with missing cells
Tables with mixed content types (text, links, images)

coderabbitai · 2024-12-24T12:02:38Z

maxun-core/src/browserSide/scraper.js

+    const parentElements = Array.from(document.querySelectorAll(listSelector));
    const scrapedData = [];

-    while (scrapedData.length < limit) {
-        let parentElements = Array.from(document.querySelectorAll(listSelector));
+    for (const parent of parentElements) {
+        // Get the first field's elements to determine how many items we have
+        const firstField = Object.values(fields)[0];
+        const baseElements = Array.from(parent.querySelectorAll(firstField.selector));


⚠️ Potential issue

Add validation for field configuration

The code assumes the first field and its selector are always valid. Add validation to handle edge cases:

const parentElements = Array.from(document.querySelectorAll(listSelector)); const scrapedData = []; +if (Object.keys(fields).length === 0) { + return scrapedData; +} + const firstField = Object.values(fields)[0]; +if (!firstField?.selector) { + throw new Error('Invalid field configuration: missing selector'); +} + const baseElements = Array.from(parent.querySelectorAll(firstField.selector));

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-24T12:02:39Z

maxun-core/src/browserSide/scraper.js

+            // Process non-table fields
+            for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
+                const elements = Array.from(parent.querySelectorAll(selector));
+                // Use the same index to maintain correspondence between fields
+                const element = elements[i];
+
+                if (element) {
+                    let value;
                    if (attribute === 'innerText') {
-                        record[label] = fieldElement.innerText.trim();
+                        value = element.innerText.trim();
                    } else if (attribute === 'innerHTML') {
-                        record[label] = fieldElement.innerHTML.trim();
-                    } else if (attribute === 'src') {
-                        // Handle relative 'src' URLs
-                        const src = fieldElement.getAttribute('src');
-                        record[label] = src ? new URL(src, window.location.origin).href : null;
-                    } else if (attribute === 'href') {
-                        // Handle relative 'href' URLs
-                        const href = fieldElement.getAttribute('href');
-                        record[label] = href ? new URL(href, window.location.origin).href : null;
+                        value = element.innerHTML.trim();
+                    } else if (attribute === 'src' || attribute === 'href') {
+                        const attrValue = element.getAttribute(attribute);
+                        value = attrValue ? new URL(attrValue, window.location.origin).href : null;
                    } else {
-                        record[label] = fieldElement.getAttribute(attribute);
+                        value = element.getAttribute(attribute);
                    }
+                    record[label] = value;
                }
            }


🛠️ Refactor suggestion

Apply extracted field processing logic

Apply the same helper function here to maintain consistency and reduce code duplication.

// Process non-table fields for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) { const elements = Array.from(parent.querySelectorAll(selector)); const element = elements[i]; - if (element) { - let value; - if (attribute === 'innerText') { - value = element.innerText.trim(); - } else if (attribute === 'innerHTML') { - value = element.innerHTML.trim(); - } else if (attribute === 'src' || attribute === 'href') { - const attrValue = element.getAttribute(attribute); - value = attrValue ? new URL(attrValue, window.location.origin).href : null; - } else { - value = element.getAttribute(attribute); - } - record[label] = value; - } + const value = processFieldValue(element, attribute); + if (value !== null) { + record[label] = value; + } }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-24T12:02:39Z

maxun-core/src/browserSide/scraper.js

+            // Process table fields
+            for (const [label, { selector, attribute }] of Object.entries(tableFields)) {
+                const elements = Array.from(parent.querySelectorAll(selector));
+                // Use the same index to maintain correspondence between fields
+                const element = elements[i];

-                const firstMatch = document.querySelector(listSelector);
-                if (firstMatch) {
-                    // Get classes from the first matching element
-                    const firstMatchClasses = Array.from(firstMatch.classList);
-
-                    // Find similar elements by matching most of their classes
-                    parentElements = allChildren.filter(element => {
-                        const elementClasses = Array.from(element.classList);
-
-                        // Element should share at least 70% of classes with the first match
-                        const commonClasses = firstMatchClasses.filter(cls => 
-                            elementClasses.includes(cls));
-                        return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7);
-                    });
+                if (element) {
+                    let value;
+                    if (attribute === 'innerText') {
+                        value = element.innerText.trim();
+                    } else if (attribute === 'innerHTML') {
+                        value = element.innerHTML.trim();
+                    } else if (attribute === 'src' || attribute === 'href') {
+                        const attrValue = element.getAttribute(attribute);
+                        value = attrValue ? new URL(attrValue, window.location.origin).href : null;
+                    } else {
+                        value = element.getAttribute(attribute);
+                    }
+                    record[label] = value;
                }
            }


🛠️ Refactor suggestion

Extract common field processing logic

The attribute processing logic is duplicated between table and non-table fields. Extract it into a helper function:

+function processFieldValue(element, attribute) { + if (!element) return null; + + switch (attribute) { + case 'innerText': + return element.innerText.trim(); + case 'innerHTML': + return element.innerHTML.trim(); + case 'src': + case 'href': + const attrValue = element.getAttribute(attribute); + try { + return attrValue ? new URL(attrValue, window.location.origin).href : null; + } catch (error) { + console.warn(`Invalid URL: ${attrValue}`); + return null; + } + default: + return element.getAttribute(attribute); + } +} + // Process table fields for (const [label, { selector, attribute }] of Object.entries(tableFields)) { const elements = Array.from(parent.querySelectorAll(selector)); // Use the same index to maintain correspondence between fields const element = elements[i]; - if (element) { - let value; - if (attribute === 'innerText') { - value = element.innerText.trim(); - } else if (attribute === 'innerHTML') { - value = element.innerHTML.trim(); - } else if (attribute === 'src' || attribute === 'href') { - const attrValue = element.getAttribute(attribute); - value = attrValue ? new URL(attrValue, window.location.origin).href : null; - } else { - value = element.getAttribute(attribute); - } - record[label] = value; - } + const value = processFieldValue(element, attribute); + if (value !== null) { + record[label] = value; + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Process table fields

for (const [label, { selector, attribute }] of Object.entries(tableFields)) {

const elements = Array.from(parent.querySelectorAll(selector));

// Use the same index to maintain correspondence between fields

const element = elements[i];

const firstMatch = document.querySelector(listSelector);

if (firstMatch) {

// Get classes from the first matching element

const firstMatchClasses = Array.from(firstMatch.classList);

// Find similar elements by matching most of their classes

parentElements = allChildren.filter(element => {

const elementClasses = Array.from(element.classList);

// Element should share at least 70% of classes with the first match

const commonClasses = firstMatchClasses.filter(cls =>

elementClasses.includes(cls));

return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7);

});

if (element) {

let value;

if (attribute === 'innerText') {

value = element.innerText.trim();

} else if (attribute === 'innerHTML') {

value = element.innerHTML.trim();

} else if (attribute === 'src' || attribute === 'href') {

const attrValue = element.getAttribute(attribute);

value = attrValue ? new URL(attrValue, window.location.origin).href : null;

} else {

value = element.getAttribute(attribute);

}

record[label] = value;

}

}

function processFieldValue(element, attribute) {

if (!element) return null;

switch (attribute) {

case 'innerText':

return element.innerText.trim();

case 'innerHTML':

return element.innerHTML.trim();

case 'src':

case 'href':

const attrValue = element.getAttribute(attribute);

try {

return attrValue ? new URL(attrValue, window.location.origin).href : null;

} catch (error) {

console.warn(`Invalid URL: ${attrValue}`);

return null;

}

default:

return element.getAttribute(attribute);

}

}

// Process table fields

for (const [label, { selector, attribute }] of Object.entries(tableFields)) {

const elements = Array.from(parent.querySelectorAll(selector));

// Use the same index to maintain correspondence between fields

const element = elements[i];

const value = processFieldValue(element, attribute);

if (value !== null) {

record[label] = value;

}

}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3176fa2 and 50bcd4b.

📒 Files selected for processing (1)

maxun-core/src/browserSide/scraper.js (1 hunks)

🧰 Additional context used

🪛 Biome (1.9.4)

maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

coderabbitai · 2024-12-25T16:51:04Z

maxun-core/src/browserSide/scraper.js

+              const record = {};
+
+              for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
+                const elements = Array.from(parent.querySelectorAll(selector));


⚠️ Potential issue

Fix undefined variable reference

The code uses an undefined parent variable instead of the available container variable.

-const elements = Array.from(parent.querySelectorAll(selector)); +const elements = Array.from(container.querySelectorAll(selector));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const elements = Array.from(parent.querySelectorAll(selector));

const elements = Array.from(container.querySelectorAll(selector));

coderabbitai · 2024-12-25T16:51:04Z

maxun-core/src/browserSide/scraper.js

+    function getCellIndex(td) {
+        let index = 0;
+        let sibling = td;
+        while (sibling = sibling.previousElementSibling) {
+            index++;
+        }
+        return index;
+    }


🛠️ Refactor suggestion

Refactor the while loop condition

The assignment in the while condition was flagged by static analysis. Consider restructuring for better readability.

function getCellIndex(td) { let index = 0; let sibling = td; - while (sibling = sibling.previousElementSibling) { + while (sibling.previousElementSibling) { + sibling = sibling.previousElementSibling; index++; } return index; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function getCellIndex(td) {

let index = 0;

let sibling = td;

while (sibling = sibling.previousElementSibling) {

index++;

}

return index;

}

function getCellIndex(td) {

let index = 0;

let sibling = td;

while (sibling.previousElementSibling) {

sibling = sibling.previousElementSibling;

index++;

}

return index;

}

🧰 Tools

🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

coderabbitai · 2024-12-25T16:51:04Z

maxun-core/src/browserSide/scraper.js

+    // Helper function to extract values from elements
+    function extractValue(element, attribute) {
+        if (!element) return null;
+
+        if (attribute === 'innerText') {
+            return element.innerText.trim();
+        } else if (attribute === 'innerHTML') {
+            return element.innerHTML.trim();
+        } else if (attribute === 'src' || attribute === 'href') {
+            const attrValue = element.getAttribute(attribute);
+            return attrValue ? new URL(attrValue, window.location.origin).href : null;
+        }
+        return element.getAttribute(attribute);
+    }


🛠️ Refactor suggestion

Add error handling for URL construction

The URL construction could throw errors for invalid URLs. Consider adding try-catch block to handle malformed URLs gracefully.

} else if (attribute === 'src' || attribute === 'href') { const attrValue = element.getAttribute(attribute); - return attrValue ? new URL(attrValue, window.location.origin).href : null; + if (!attrValue) return null; + try { + return new URL(attrValue, window.location.origin).href; + } catch (error) { + console.warn(`Invalid URL: ${attrValue}`); + return null; + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Helper function to extract values from elements

function extractValue(element, attribute) {

if (!element) return null;

if (attribute === 'innerText') {

return element.innerText.trim();

} else if (attribute === 'innerHTML') {

return element.innerHTML.trim();

} else if (attribute === 'src' || attribute === 'href') {

const attrValue = element.getAttribute(attribute);

return attrValue ? new URL(attrValue, window.location.origin).href : null;

}

return element.getAttribute(attribute);

}

// Helper function to extract values from elements

function extractValue(element, attribute) {

if (!element) return null;

if (attribute === 'innerText') {

return element.innerText.trim();

} else if (attribute === 'innerHTML') {

return element.innerHTML.trim();

} else if (attribute === 'src' || attribute === 'href') {

const attrValue = element.getAttribute(attribute);

if (!attrValue) return null;

try {

return new URL(attrValue, window.location.origin).href;

} catch (error) {

console.warn(`Invalid URL: ${attrValue}`);

return null;

}

}

return element.getAttribute(attribute);

}

coderabbitai · 2024-12-25T16:51:05Z

maxun-core/src/browserSide/scraper.js

+                                  let currentElement = td;
+                                  while (currentElement && currentElement.children.length > 0) {
+                                      let foundContentChild = false;
+                                      for (const child of currentElement.children) {
+                                          if (extractValue(child, attribute)) {
+                                              currentElement = child;
+                                              foundContentChild = true;
+                                              break;
+                                          }
+                                      }
+                                      if (!foundContentChild) break;
+                                  }
+                                  element = currentElement;
+                                }


🛠️ Refactor suggestion

Add depth limit to content child traversal

The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.

+const MAX_CONTENT_SEARCH_DEPTH = 3; +let depth = 0; let currentElement = td; -while (currentElement && currentElement.children.length > 0) { +while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) { let foundContentChild = false; for (const child of currentElement.children) { if (extractValue(child, attribute)) { currentElement = child; foundContentChild = true; break; } } if (!foundContentChild) break; + depth++; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let currentElement = td;

while (currentElement && currentElement.children.length > 0) {

let foundContentChild = false;

for (const child of currentElement.children) {

if (extractValue(child, attribute)) {

currentElement = child;

foundContentChild = true;

break;

}

}

if (!foundContentChild) break;

}

element = currentElement;

}

const MAX_CONTENT_SEARCH_DEPTH = 3;

let depth = 0;

let currentElement = td;

while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {

let foundContentChild = false;

for (const child of currentElement.children) {

if (extractValue(child, attribute)) {

currentElement = child;

foundContentChild = true;

break;

}

}

if (!foundContentChild) break;

depth++;

}

element = currentElement;

}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

🧹 Nitpick comments (2)

maxun-core/src/browserSide/scraper.js (2)

265-274: Consider using Set for tag lookup

The tag checking logic can be optimized by using a Set for constant-time lookups.

-    const tableFields = {};
-    const nonTableFields = {};
-    
-    for (const [label, field] of Object.entries(fields)) {
-        if (['TD', 'TH', 'TR'].includes(field.tag)) {
-            tableFields[label] = field;
-        } else {
-            nonTableFields[label] = field;
-        }
-    }
+    const TABLE_TAGS = new Set(['TD', 'TH', 'TR']);
+    const tableFields = {};
+    const nonTableFields = {};
+    
+    for (const [label, field] of Object.entries(fields)) {
+        if (TABLE_TAGS.has(field.tag)) {
+            tableFields[label] = field;
+        } else {
+            nonTableFields[label] = field;
+        }
+    }

332-342: Optimize limit enforcement

The current implementation continues processing even after reaching the limit. Consider breaking out of the inner loop when the limit is reached.

     if (Object.keys(record).length > 0) {
         scrapedData.push(record);
+        if (scrapedData.length >= limit) {
+            break;
+        }
     }
 }

-if (scrapedData.length >= limit) {
-    scrapedData.length = limit;
-    break;
-}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50bcd4b and 641b655.

📒 Files selected for processing (1)

maxun-core/src/browserSide/scraper.js (1 hunks)

… table-scrape

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

maxun-core/src/browserSide/scraper.js (1)

307-330: Optimize table header detection performance

The current implementation traverses the DOM multiple times. Consider caching the results of header detection.

Apply this diff to improve performance:

+function memoizeHeaderCheck(row) {
+    const cache = new WeakMap();
+    return (selector) => {
+        if (cache.has(row)) return cache.get(row);
+        const result = hasThElement(row, { test: { selector } });
+        cache.set(row, result);
+        return result;
+    };
+}

 function hasThElement(row, tableFields) {
+    const checkHeader = memoizeHeaderCheck(row);
     for (const [label, { selector }] of Object.entries(tableFields)) {
-        const element = row.querySelector(selector);
-        if (element) {
-            let current = element;
-            while (current && current !== row) {
-                if (current.tagName === 'TH') {
-                    return true;
-                }
-                current = current.parentElement;
-            }
-        }
+        if (checkHeader(selector)) return true;
     }
     return false;
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 641b655 and 33de0f1.

📒 Files selected for processing (1)

maxun-core/src/browserSide/scraper.js (1 hunks)

🧰 Additional context used

🪛 Biome (1.9.4)

maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (5)

maxun-core/src/browserSide/scraper.js (5)

265-278: Add error handling for URL construction

The URL construction could throw errors for invalid URLs. This issue was previously identified and remains unaddressed.

Apply this diff to handle malformed URLs gracefully:

 } else if (attribute === 'src' || attribute === 'href') {
     const attrValue = element.getAttribute(attribute);
-    return attrValue ? new URL(attrValue, window.location.origin).href : null;
+    if (!attrValue) return null;
+    try {
+        return new URL(attrValue, window.location.origin).href;
+    } catch (error) {
+        console.warn(`Invalid URL: ${attrValue}`);
+        return null;
+    }
 }

280-296: LGTM! Well-structured ancestor search

The function efficiently finds table ancestors with a reasonable depth limit and early returns.

298-305: Improve readability of the while loop

The assignment in the while condition was flagged by static analysis and makes the code less readable.

Apply this diff to restructure the loop:

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }

🧰 Tools

🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

408-420: Add depth limit to content child traversal

The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.

Apply this diff to add a depth limit:

+const MAX_CONTENT_SEARCH_DEPTH = 3;
+let depth = 0;
 let currentElement = td;
-while (currentElement && currentElement.children.length > 0) {
+while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
     let foundContentChild = false;
     for (const child of currentElement.children) {
         if (extractValue(child, attribute)) {
             currentElement = child;
             foundContentChild = true;
             break;
         }
     }
     if (!foundContentChild) break;
+    depth++;
 }

372-375: Add validation for field configuration

The code assumes the first field and its selector are always valid.

Apply this diff to handle edge cases:

 if (Object.keys(tableFields).length > 0) {
     const firstField = Object.values(tableFields)[0];
+    if (!firstField?.selector) {
+        throw new Error('Invalid field configuration: missing selector');
+    }
     const firstElement = container.querySelector(firstField.selector);

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

maxun-core/src/browserSide/scraper.js (2)

298-305: 🛠️ Refactor suggestion

Improve readability of the while loop

The assignment in the while condition was flagged by static analysis and makes the code harder to read.

Apply this diff to improve the implementation:

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }

🧰 Tools

🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

432-445: 🛠️ Refactor suggestion

Add depth limit to content traversal

The deep traversal for finding content children could lead to performance issues with deeply nested tables.

Apply this diff to improve the implementation:

+                      const MAX_CONTENT_SEARCH_DEPTH = 3;
+                      let depth = 0;
                       let currentElement = td;
-                      while (currentElement && currentElement.children.length > 0) {
+                      while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
                           let foundContentChild = false;
                           for (const child of currentElement.children) {
                               if (extractValue(child, attribute)) {
                                   currentElement = child;
                                   foundContentChild = true;
                                   break;
                               }
                           }
                           if (!foundContentChild) break;
+                          depth++;
                       }

🧹 Nitpick comments (2)

maxun-core/src/browserSide/scraper.js (2)

280-296: Optimize table ancestor lookup with early returns

The function can be simplified and made more efficient with early returns.

Apply this diff to improve the implementation:

 function findTableAncestor(element) {
     let currentElement = element;
     const MAX_DEPTH = 5;
     let depth = 0;
     
     while (currentElement && depth < MAX_DEPTH) {
-        if (currentElement.tagName === 'TD') {
-            return { type: 'TD', element: currentElement };
-        } else if (currentElement.tagName === 'TR') {
-            return { type: 'TR', element: currentElement };
-        }
+        switch (currentElement.tagName) {
+            case 'TD': return { type: 'TD', element: currentElement };
+            case 'TR': return { type: 'TR', element: currentElement };
+        }
         currentElement = currentElement.parentElement;
         depth++;
     }
     return null;
 }

336-359: Improve container matching logic

The container matching logic can be improved for better maintainability and readability.

Apply this diff to improve the implementation:

+    const CLASS_MATCH_THRESHOLD = 0.7;
     if (limit > 1 && containers.length <= 1) {
-      const [containerSelector, _] = listSelector.split('>').map(s => s.trim());
+      const [containerSelector] = listSelector.split('>').map(s => s.trim());
       const container = document.querySelector(containerSelector);
       
       if (container) {
         const allChildren = Array.from(container.children);
         
         const firstMatch = document.querySelector(listSelector);
         if (firstMatch) {
             const firstMatchClasses = Array.from(firstMatch.classList);
             
             containers = allChildren.filter(element => {
                 const elementClasses = Array.from(element.classList);
                 const commonClasses = firstMatchClasses.filter(cls => 
                     elementClasses.includes(cls));
-                return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7);
+                return commonClasses.length >= Math.floor(firstMatchClasses.length * CLASS_MATCH_THRESHOLD);
             });
         }
       }
     }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33de0f1 and 31d9730.

📒 Files selected for processing (1)

maxun-core/src/browserSide/scraper.js (1 hunks)

🧰 Additional context used

🪛 Biome (1.9.4)

maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (4)

maxun-core/src/browserSide/scraper.js (4)

265-278: Enhance error handling and simplify attribute handling logic

The function could benefit from better error handling and a more maintainable structure.

Apply this diff to improve the implementation:

 function extractValue(element, attribute) {
     if (!element) return null;
     
-    if (attribute === 'innerText') {
-        return element.innerText.trim();
-    } else if (attribute === 'innerHTML') {
-        return element.innerHTML.trim();
-    } else if (attribute === 'src' || attribute === 'href') {
-        const attrValue = element.getAttribute(attribute);
-        return attrValue ? new URL(attrValue, window.location.origin).href : null;
-    }
-    return element.getAttribute(attribute);
+    switch (attribute) {
+        case 'innerText':
+            return element.innerText.trim();
+        case 'innerHTML':
+            return element.innerHTML.trim();
+        case 'src':
+        case 'href':
+            const attrValue = element.getAttribute(attribute);
+            if (!attrValue) return null;
+            try {
+                return new URL(attrValue, window.location.origin).href;
+            } catch (error) {
+                console.warn(`Invalid URL: ${attrValue}`);
+                return null;
+            }
+        default:
+            return element.getAttribute(attribute);
+    }
 }

361-387: LGTM! Field classification logic is well-structured

The field classification logic properly handles both table and non-table fields, with good error handling for missing elements.

464-485: LGTM! Non-table data processing is efficient

The non-table data processing logic is well-implemented, with proper limit handling and efficient element selection.

488-490: LGTM! Results merging is straightforward

The merging of table and non-table data is implemented correctly.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

maxun-core/src/browserSide/scraper.js (3)

307-321: Optimize table header detection

The current implementation traverses the DOM tree multiple times for each field. Consider caching the TH check results.

 function hasThElement(row, tableFields) {
+    // Cache TH elements check
+    const hasThChildren = row.getElementsByTagName('TH').length > 0;
+    if (hasThChildren) return true;
+
     for (const [label, { selector }] of Object.entries(tableFields)) {
         const element = row.querySelector(selector);
         if (element) {
             let current = element;
             while (current && current !== row) {
                 if (current.tagName === 'TH') {
                     return true;
                 }
                 current = current.parentElement;
             }
         }
     }
     return false;
 }

336-359: Enhance container matching robustness

The current implementation has two potential issues:

The class matching threshold (0.7) is hardcoded
Empty class lists aren't handled explicitly

+    const CLASS_MATCH_THRESHOLD = 0.7; // Make configurable
+
     if (limit > 1 && containers.length <= 1) {
       const [containerSelector, _] = listSelector.split('>').map(s => s.trim());
       const container = document.querySelector(containerSelector);
       
       if (container) {
         const allChildren = Array.from(container.children);
         
         const firstMatch = document.querySelector(listSelector);
         if (firstMatch) {
             const firstMatchClasses = Array.from(firstMatch.classList);
+            
+            // Skip class matching if no classes present
+            if (firstMatchClasses.length === 0) {
+                return [firstMatch];
+            }
             
             containers = allChildren.filter(element => {
                 const elementClasses = Array.from(element.classList);
 
                 const commonClasses = firstMatchClasses.filter(cls => 
                     elementClasses.includes(cls));
-                return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7);
+                return commonClasses.length >= Math.floor(firstMatchClasses.length * CLASS_MATCH_THRESHOLD);
             });
         }
       }
     }

488-490: Add final limit validation

Ensure the combined results don't exceed the specified limit.

     // Merge and limit the results
     const scrapedData = [...tableData, ...nonTableData];
-    return scrapedData;
+    return scrapedData.slice(0, limit);

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31d9730 and fcc71e0.

📒 Files selected for processing (1)

maxun-core/src/browserSide/scraper.js (1 hunks)

🧰 Additional context used

🪛 Biome (1.9.4)

maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (3)

maxun-core/src/browserSide/scraper.js (3)

464-485: LGTM! Non-table data processing looks good

The implementation correctly handles limits and field extraction.

298-305: 🛠️ Refactor suggestion

Refactor the while loop for better readability

The assignment within the while condition makes the code harder to read and maintain. This was flagged by static analysis.

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }

🧰 Tools

🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

433-443: 🛠️ Refactor suggestion

Add depth limit to content child traversal

The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.

+                      const MAX_CONTENT_DEPTH = 3;
+                      let depth = 0;
                       while (currentElement && currentElement.children.length > 0) {
                           let foundContentChild = false;
                           for (const child of currentElement.children) {
                               if (extractValue(child, attribute)) {
                                   currentElement = child;
                                   foundContentChild = true;
                                   break;
                               }
                           }
                           if (!foundContentChild) break;
+                          if (++depth >= MAX_CONTENT_DEPTH) break;
                       }

coderabbitai · 2024-12-25T22:10:38Z

maxun-core/src/browserSide/scraper.js

+              if (cellIndex >= 0) {
+                const td = currentRow.children[cellIndex];
+                if (td) {
+                  element = td.querySelector(selector);
+
+                  if (!element && selector.split(">").pop().includes('td:nth-child')) {
+                      element = td;
+                  }
+


🛠️ Refactor suggestion

Add validation for cellIndex access

The code accesses currentRow.children[cellIndex] without validating the index, which could lead to undefined behavior.

if (cellIndex >= 0) { + const cells = currentRow.children; + if (cellIndex >= cells.length) { + console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`); + continue; + } const td = currentRow.children[cellIndex];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (cellIndex >= 0) {

const td = currentRow.children[cellIndex];

if (td) {

element = td.querySelector(selector);

if (!element && selector.split(">").pop().includes('td:nth-child')) {

element = td;

}

if (cellIndex >= 0) {

const cells = currentRow.children;

if (cellIndex >= cells.length) {

console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`);

continue;

}

const td = currentRow.children[cellIndex];

if (td) {

element = td.querySelector(selector);

if (!element && selector.split(">").pop().includes('td:nth-child')) {

element = td;

}

coderabbitai · 2024-12-25T22:10:38Z

maxun-core/src/browserSide/scraper.js

+    containers.forEach((container, containerIndex) => {
+        for (const [label, field] of Object.entries(fields)) {
+            const sampleElement = container.querySelector(field.selector);
+
+            if (sampleElement) {
+                const ancestor = findTableAncestor(sampleElement);
+                if (ancestor) {
+                    containerFields[containerIndex].tableFields[label] = {
+                        ...field,
+                        tableContext: ancestor.type,
+                        cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
+                    };
+                } else {
+                    containerFields[containerIndex].nonTableFields[label] = field;
                }
+            } else {
+                containerFields[containerIndex].nonTableFields[label] = field;
            }
        }
+    });


🛠️ Refactor suggestion

Add error handling for invalid selectors

The field classification process should handle invalid selectors gracefully to prevent runtime errors.

containers.forEach((container, containerIndex) => { for (const [label, field] of Object.entries(fields)) { + try { const sampleElement = container.querySelector(field.selector); if (sampleElement) { const ancestor = findTableAncestor(sampleElement); if (ancestor) { containerFields[containerIndex].tableFields[label] = { ...field, tableContext: ancestor.type, cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1 }; } else { containerFields[containerIndex].nonTableFields[label] = field; } } else { containerFields[containerIndex].nonTableFields[label] = field; } + } catch (error) { + console.warn(`Invalid selector for field "${label}": ${field.selector}`); + containerFields[containerIndex].nonTableFields[label] = field; + } } });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

containers.forEach((container, containerIndex) => {

for (const [label, field] of Object.entries(fields)) {

const sampleElement = container.querySelector(field.selector);

if (sampleElement) {

const ancestor = findTableAncestor(sampleElement);

if (ancestor) {

containerFields[containerIndex].tableFields[label] = {

...field,

tableContext: ancestor.type,

cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1

};

} else {

containerFields[containerIndex].nonTableFields[label] = field;

}

} else {

containerFields[containerIndex].nonTableFields[label] = field;

}

}

});

containers.forEach((container, containerIndex) => {

for (const [label, field] of Object.entries(fields)) {

try {

const sampleElement = container.querySelector(field.selector);

if (sampleElement) {

const ancestor = findTableAncestor(sampleElement);

if (ancestor) {

containerFields[containerIndex].tableFields[label] = {

...field,

tableContext: ancestor.type,

cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1

};

} else {

containerFields[containerIndex].nonTableFields[label] = field;

}

} else {

containerFields[containerIndex].nonTableFields[label] = field;

}

} catch (error) {

console.warn(`Invalid selector for field "${label}": ${field.selector}`);

containerFields[containerIndex].nonTableFields[label] = field;

}

}

});

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

server/src/workflow-management/selector.ts (2)

94-124: Improve readability of parent element traversal logic.

The parent element traversal logic is complex and could benefit from:

Better variable names (e.g., fullyContained -> isParentFullyContainingChild).
Documentation explaining the geometric calculations.
Extraction of the containment check into a separate function.

+const isParentContainingChild = (parent: HTMLElement, child: HTMLElement): boolean => {
+  const parentRect = parent.getBoundingClientRect();
+  const childRect = child.getBoundingClientRect();
+
+  const isFullyContained =
+    parentRect.left <= childRect.left &&
+    parentRect.right >= childRect.right &&
+    parentRect.top <= childRect.top &&
+    parentRect.bottom >= childRect.bottom;
+
+  const containmentRatio = (childRect.width * childRect.height) /
+    (parentRect.width * parentRect.height);
+  const hasSignificantOverlap = containmentRatio > 0.5;
+
+  return isFullyContained && hasSignificantOverlap;
+};

 // Replace the inline logic with:
-const parentRect = element.parentElement.getBoundingClientRect();
-const childRect = element.getBoundingClientRect();
-
-const fullyContained =
-  parentRect.left <= childRect.left &&
-  parentRect.right >= childRect.right &&
-  parentRect.top <= childRect.top &&
-  parentRect.bottom >= childRect.bottom;
-
-const significantOverlap =
-  (childRect.width * childRect.height) /
-  (parentRect.width * parentRect.height) > 0.5;
-
-if (fullyContained && significantOverlap) {
+if (isParentContainingChild(element.parentElement, element)) {

912-918: Improve table cell selector generation.

While the nth-child selector generation is correct, consider:

Validating that the parent is a valid table row (<tr>).
Adding error handling for malformed table structures.

 if (selector === 'td' && element.parentElement) {
+  if (element.parentElement.tagName !== 'TR') {
+    console.warn('Table cell found outside of a table row');
+    return selector;
+  }
   const siblings = Array.from(element.parentElement.children);
   const position = siblings.indexOf(element) + 1;
   return `${selector}:nth-child(${position})`;
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcc71e0 and 97efd15.

📒 Files selected for processing (1)

server/src/workflow-management/selector.ts (6 hunks)

🔇 Additional comments (1)

server/src/workflow-management/selector.ts (1)

87-92: 🛠️ Refactor suggestion

Extract table cell handling into a shared utility function.

This table cell handling logic is duplicated across multiple functions. Consider extracting it into a shared utility function to improve maintainability and reduce code duplication.

+const findTableAncestor = (element: HTMLElement): HTMLElement => {
+  if ((element.tagName === 'TD' || element.tagName === 'TH') && element.closest('table')) {
+    return element.closest('table')!;
+  }
+  return element;
+};

 // Replace duplicated code with:
-if (element.tagName === 'TD' || element.tagName === 'TH') {
-  const tableParent = element.closest('table');
-  if (tableParent) {
-    element = tableParent;
-  }
-}
+element = findTableAncestor(element);

coderabbitai · 2024-12-26T13:48:26Z

server/src/workflow-management/selector.ts

+            if (element.tagName === 'TD' || element.tagName === 'TH') {
+              const tableParent = element.closest('table');
+              if (tableParent) {
+                element = tableParent;
+              }
+            }

-              if (fullyContained && significantOverlap) {
-                element = element.parentElement;
-              } else {
-                break;
+            if (element.tagName !== 'TABLE') {
+              while (element.parentElement) {
+                if (element.tagName.toLowerCase() === 'body' || 
+                    element.tagName.toLowerCase() === 'html') {
+                  break;
+                }
+
+                const parentRect = element.parentElement.getBoundingClientRect();
+                const childRect = element.getBoundingClientRect();
+
+                const fullyContained =
+                  parentRect.left <= childRect.left &&
+                  parentRect.right >= childRect.right &&
+                  parentRect.top <= childRect.top &&
+                  parentRect.bottom >= childRect.bottom;
+
+                const significantOverlap =
+                  (childRect.width * childRect.height) /
+                  (parentRect.width * parentRect.height) > 0.5;
+
+                if (fullyContained && significantOverlap) {
+                  const nextParent = element.parentElement;
+                  if (nextParent.tagName.toLowerCase() !== 'body' && 
+                      nextParent.tagName.toLowerCase() !== 'html') {
+                    element = nextParent;
+                  } else {
+                    break;
+                  }
+                } else {
+                  break;
+                }


🛠️ Refactor suggestion

Eliminate code duplication in getRect function.

The table cell handling and parent traversal logic is duplicated from the getElementInformation function. This violates the DRY principle and makes maintenance harder.

Consider:

Using the proposed findTableAncestor utility function.

Extracting the parent traversal logic into a shared function.

Creating a shared module for these utility functions.

coderabbitai · 2024-12-26T13:48:27Z

server/src/workflow-management/selector.ts

+          if (selector === 'td' && element.parentElement) {
+            const siblings = Array.from(element.parentElement.children);
+            const position = siblings.indexOf(element) + 1;
+            return `${selector}:nth-child(${position})`;
+          }
+


🛠️ Refactor suggestion

Consolidate duplicate table cell selector logic.

The table cell selector generation logic is duplicated in three different places. This makes the code harder to maintain and increases the risk of inconsistencies.

Extract this logic into a shared utility function:

+const getTableCellSelector = (element: HTMLElement): string | null => { + if (element.tagName.toLowerCase() !== 'td' || !element.parentElement) { + return null; + } + if (element.parentElement.tagName !== 'TR') { + console.warn('Table cell found outside of a table row'); + return null; + } + const siblings = Array.from(element.parentElement.children); + const position = siblings.indexOf(element) + 1; + return `td:nth-child(${position})`; +}; // Replace all occurrences with: -if (selector === 'td' && element.parentElement) { - const siblings = Array.from(element.parentElement.children); - const position = siblings.indexOf(element) + 1; - return `${selector}:nth-child(${position})`; -} +const cellSelector = getTableCellSelector(element); +if (cellSelector) { + return cellSelector; +}

Also applies to: 1067-1072

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

maxun-core/src/browserSide/scraper.js (2)

280-296: Optimize table ancestor search with early returns

The function can be simplified using early returns for better readability:

 function findTableAncestor(element) {
     let currentElement = element;
     const MAX_DEPTH = 5;
     let depth = 0;
     
     while (currentElement && depth < MAX_DEPTH) {
-        if (currentElement.tagName === 'TD') {
-            return { type: 'TD', element: currentElement };
-        } else if (currentElement.tagName === 'TR') {
-            return { type: 'TR', element: currentElement };
-        }
+        switch (currentElement.tagName) {
+            case 'TD': return { type: 'TD', element: currentElement };
+            case 'TR': return { type: 'TR', element: currentElement };
+        }
         currentElement = currentElement.parentElement;
         depth++;
     }
     return null;
 }

452-466: Add depth limit to content child traversal

The deep traversal for finding content children lacks a depth limit:

+                      const MAX_CONTENT_SEARCH_DEPTH = 3;
+                      let depth = 0;
                       let currentElement = td;
-                      while (currentElement && currentElement.children.length > 0) {
+                      while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
                           let foundContentChild = false;
                           for (const child of currentElement.children) {
                               if (extractValue(child, attribute)) {
                                   currentElement = child;
                                   foundContentChild = true;
                                   break;
                               }
                           }
                           if (!foundContentChild) break;
+                          depth++;
                       }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97efd15 and fe38f5a.

📒 Files selected for processing (1)

maxun-core/src/browserSide/scraper.js (1 hunks)

🧰 Additional context used

🪛 Biome (1.9.4)

maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (6)

maxun-core/src/browserSide/scraper.js (6)

485-507: LGTM! Clean implementation of non-table data processing

The non-table data processing implementation is clean, efficient, and properly handles limits.

509-511: LGTM! Effective data merging

The data merging implementation is concise and effective.

265-278: 🛠️ Refactor suggestion

Add error handling for URL construction

The URL construction could throw errors for invalid URLs. Consider adding try-catch block:

 } else if (attribute === 'src' || attribute === 'href') {
     const attrValue = element.getAttribute(attribute);
-    return attrValue ? new URL(attrValue, window.location.origin).href : null;
+    if (!attrValue) return null;
+    try {
+        return new URL(attrValue, window.location.origin).href;
+    } catch (error) {
+        console.warn(`Invalid URL: ${attrValue}`);
+        return null;
+    }
 }

298-305: 🛠️ Refactor suggestion

Refactor the while loop condition

The assignment in the while condition was flagged by static analysis. Consider restructuring for better readability:

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }

🧰 Tools

🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

438-446: 🛠️ Refactor suggestion

Add validation for cellIndex access

The code accesses currentRow.children[cellIndex] without validating the index:

               if (cellIndex >= 0) {
+                const cells = currentRow.children;
+                if (cellIndex >= cells.length) {
+                    console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`);
+                    continue;
+                }
                 const td = currentRow.children[cellIndex];

389-407: 🛠️ Refactor suggestion

Add error handling for invalid selectors

The field classification process should handle invalid selectors gracefully:

     containers.forEach((container, containerIndex) => {
         for (const [label, field] of Object.entries(fields)) {
+            try {
                 const sampleElement = container.querySelector(field.selector);
                 
                 if (sampleElement) {
                     const ancestor = findTableAncestor(sampleElement);
                     if (ancestor) {
                         containerFields[containerIndex].tableFields[label] = {
                             ...field,
                             tableContext: ancestor.type,
                             cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
                         };
                     } else {
                         containerFields[containerIndex].nonTableFields[label] = field;
                     }
                 } else {
                     containerFields[containerIndex].nonTableFields[label] = field;
                 }
+            } catch (error) {
+                console.warn(`Invalid selector for field "${label}": ${field.selector}`);
+                containerFields[containerIndex].nonTableFields[label] = field;
+            }
         }
     });

RohitR311 added 2 commits December 24, 2024 02:30

feat: add nth-child selectors for td tag

feb30b9

feat: add scraping logic for tabular data in scrapeList

5ac88c6

amhsirak changed the title ~~feat: add table scraping functionality~~ feat: better parsing of tables Dec 23, 2024

amhsirak added Type: Enhancement Improvements to existing features Status: Work In Progess This issue/PR is actively being worked on labels Dec 23, 2024

coderabbitai bot reviewed Dec 23, 2024

View reviewed changes

feat: add condition to ignore html and body tags on highlighting

b411faf

coderabbitai bot reviewed Dec 24, 2024

View reviewed changes

feat: add loop to handle nested scraping

3176fa2

coderabbitai bot reviewed Dec 24, 2024

View reviewed changes

feat: add logic to seperate table fields

50bcd4b

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

fix: revert

641b655

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

RohitR311 added 3 commits December 25, 2024 23:09

feat: revert non-table scraping logic

991b554

Merge branch 'table-scrape' of https://github.com/getmaxun/maxun into…

25a0d1a

… table-scrape

feat: separate table and non-table data logic

33de0f1

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

feat: add logic to match 70% classes

31d9730

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

feat: add limit to table scraping

fcc71e0

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

feat: add check for table highlighting

97efd15

coderabbitai bot reviewed Dec 26, 2024

View reviewed changes

feat: improve 70% class matching logic for scraping

fe38f5a

coderabbitai bot reviewed Dec 26, 2024

View reviewed changes

amhsirak approved these changes Dec 26, 2024

View reviewed changes

amhsirak merged commit 8e2357d into develop Dec 26, 2024
1 check passed

amhsirak removed the Status: Work In Progess This issue/PR is actively being worked on label Dec 26, 2024

coderabbitai bot mentioned this pull request Jan 2, 2025

feat: shadow dom selection #296

Merged

amhsirak deleted the table-scrape branch January 2, 2025 17:28

This was referenced Jan 26, 2025

feat: optimize robots and runs table ui #382

Merged

feat: better selection #396

Merged

coderabbitai bot mentioned this pull request Mar 4, 2025

feat: improve highlighting and data extraction #462

Merged

	const elements = Array.from(parent.querySelectorAll(selector));
	const elements = Array.from(container.querySelectorAll(selector));

feat: better parsing of tables #291

feat: better parsing of tables #291

Uh oh!

Conversation

RohitR311 commented Dec 23, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 26, 2024

Choose a reason for hiding this comment

Uh oh!

RohitR311 commented Dec 23, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)