Skip to content

feat: better parsing of tables #291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 26, 2024
Merged

feat: better parsing of tables #291

merged 13 commits into from
Dec 26, 2024

Conversation

RohitR311
Copy link
Collaborator

@RohitR311 RohitR311 commented Dec 23, 2024

Enhancing Scrape List to include table scraping.

Summary by CodeRabbit

  • New Features

    • Enhanced scraping functionality with improved handling of complex HTML structures, particularly tables.
    • Improved selector generation for table cells, ensuring uniqueness based on their position.
  • Bug Fixes

    • Enhanced error handling for selector generation functions with clearer logging.
  • Documentation

    • Updated function signatures for clarity on usage.

Copy link

coderabbitai bot commented Dec 23, 2024

Walkthrough

The pull request introduces significant modifications to the web scraping functionality across two files: maxun-core/src/browserSide/scraper.js and server/src/workflow-management/selector.ts. Key updates include enhancements to the scrapeList function for improved handling of table data, the addition of helper functions for value extraction and ancestor identification, and modifications to selector generation for <td> and <th> elements, ensuring unique selectors based on their position within tables.

Changes

File Change Summary
maxun-core/src/browserSide/scraper.js - Enhanced scrapeList function with new helper functions extractValue and findTableAncestor
- Categorized scraped fields into tableFields and nonTableFields
- Improved data extraction logic for tables and non-table elements
- Updated function signature for scrapeList
server/src/workflow-management/selector.ts - Modified getNonUniqueSelectors and getChildSelectors to specifically handle <td> and <th> elements
- Implemented unique selectors using :nth-child() for table cells
- Enhanced error handling with console.error logging
- Function signatures updated without changes in their signatures

Sequence Diagram

sequenceDiagram
    participant Scraper as Scraper
    participant Browser as Browser
    participant Selector as Selector

    Scraper->>Browser: Initiate scraping
    Browser->>Selector: Generate precise selectors
    Selector-->>Browser: Return table cell selectors
    Browser->>Browser: Categorize fields (table/non-table)
    Browser->>Browser: Process each row/element
    Browser-->>Scraper: Return scraped data
Loading

Poem

🐰 Hop, hop, through the web's wild maze,
Selectors sharp as rabbit's gaze,
Tables dance, cells align just right,
Scraping magic takes its flight!
Code leaps forward with playful might 🌟


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@amhsirak amhsirak changed the title feat: add table scraping functionality feat: better parsing of tables Dec 23, 2024
@amhsirak amhsirak added Type: Enhancement Improvements to existing features Status: Work In Progess This issue/PR is actively being worked on labels Dec 23, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
maxun-core/src/browserSide/scraper.js (2)

280-288: Double-check row retrieval for partial or irregular tables
When retrieving rows from the first table field, if the source HTML table has merged columns or irregular row spans, limiting row count by the first field’s result might skip relevant data from sibling columns. Consider validating each table field’s row length to handle mismatched row counts.


336-341: Potential data truncation
If multiple table rows are processed before checking the limit, you may lose consistency in records. Consider verifying partial row sets if the limit is reached mid-scrape.

server/src/workflow-management/selector.ts (1)

947-952: Maintain consistency for “td” indexing logic
These lines replicate the logic from lines 872-878. Ensure they remain consistent in both code blocks. In addition, consider systematically extracting this logic into a helper function if used across multiple methods.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f83029 and 5ac88c6.

📒 Files selected for processing (2)
  • maxun-core/src/browserSide/scraper.js (1 hunks)
  • server/src/workflow-management/selector.ts (3 hunks)
🔇 Additional comments (4)
maxun-core/src/browserSide/scraper.js (2)

291-307: Validate attribute presence
When scraping table fields, if the designated attribute does not exist, “value” may be undefined, resulting in null entries. Consider robust error handling or default values for missing attributes.


265-275: Avoid overshadowing for 'TD', 'TH', 'TR' detection
Consider whether the tag check should also account for 'tbody' or 'table' context. Nested tables, row/column spans, or varying table structures may cause logic mismatches.

Run this script to search for additional usage of table-specific tags to ensure no overshadowing:

✅ Verification successful

Let me analyze the search results and generate additional verification scripts to check for table-related context.

The first search gave us broad context about table tags usage. Now let's look specifically at the scraper implementation and any related table handling code to make a more informed decision about whether we need to account for tbody/table tags.


Let me check the specific implementation around the tableFields usage to understand the context better.


Let me fix the regex and try again to get the context around tableFields usage.


Based on the code context, I can now provide a final response:

Table tag detection is appropriately scoped
The current implementation correctly focuses on TD, TH, and TR tags which are the essential table cell and row elements. Including TBODY or TABLE tags is unnecessary because:

  1. The code's purpose is to identify individual field containers within a table structure
  2. TD/TH/TR tags are sufficient for capturing table-structured data as they represent the actual data containers
  3. The parent table context is already handled through the listSelector parameter which identifies the containing element
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# This searches for potential references to table structures beyond TD, TH, and TR
rg -A 3 -i '(table|tbody|td|th|tr)'

Length of output: 65585


Script:

#!/bin/bash
# Search for table-related code in the scraper file
ast-grep --pattern 'const tableFields = {$$$}'

# Search for any table-related functions or methods
ast-grep --pattern 'function $_($$table$$) {
  $$$
}'

# Search for table-related imports
rg -l "import.*table" --type js --type ts

Length of output: 273


Script:

#!/bin/bash
# Get the context around tableFields usage
rg -B 10 -A 10 "const tableFields = {}" maxun-core/src/browserSide/scraper.js

# Check if there are any table-related methods or functions in the same file
ast-grep --pattern 'function $name($params) {
  $$$
  table
  $$$
}'

Length of output: 276


Script:

#!/bin/bash
# Get the context around tableFields usage with fixed regex
rg -B 10 -A 10 "const tableFields = \{\}" maxun-core/src/browserSide/scraper.js

# Also check for any table-related field processing
rg -B 5 -A 10 "tableFields\[" maxun-core/src/browserSide/scraper.js

Length of output: 1696

server/src/workflow-management/selector.ts (2)

1007-1012: Ensure robust scanning of table descendants
When generating selectors for child “td” elements, watch out for multiple nested tables. The current approach might overindex if a parent’s siblings contain more child nodes.


872-878: Edge cases for table cell indexing
The approach using “:nth-child(position)” effectively targets the column but may not handle rowspans or colspans that shift child indices in nested tables. Consider verifying your approach with advanced table structures.

Comment on lines 312 to 330
for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
const element = parent.querySelector(selector);

if (element) {
let value;
if (attribute === 'innerText') {
record[label] = fieldElement.innerText.trim();
value = element.innerText.trim();
} else if (attribute === 'innerHTML') {
record[label] = fieldElement.innerHTML.trim();
} else if (attribute === 'src') {
// Handle relative 'src' URLs
const src = fieldElement.getAttribute('src');
record[label] = src ? new URL(src, window.location.origin).href : null;
} else if (attribute === 'href') {
// Handle relative 'href' URLs
const href = fieldElement.getAttribute('href');
record[label] = href ? new URL(href, window.location.origin).href : null;
value = element.innerHTML.trim();
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
value = attrValue ? new URL(attrValue, window.location.origin).href : null;
} else {
record[label] = fieldElement.getAttribute(attribute);
value = element.getAttribute(attribute);
}
record[label] = value;
}
}
scrapedData.push(record);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance non-table field queries
Note that using parent.querySelector may only fetch the first matching element. If multiple elements match the same selector, subsequent ones will be ignored. Consider using querySelectorAll or clarifying that only the first match is necessary.

-for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
-    const element = parent.querySelector(selector);
+for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
+    const elements = Array.from(parent.querySelectorAll(selector));
+    if (elements.length === 0) {
+        continue;
+    }
+    const element = elements[0]; // or handle all elements

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)

896-902: Add validation for proper table structure.

While the position-based selector for table cells is good, consider adding validation for proper table structure to ensure reliable selectors.

 if (selector === 'td' && element.parentElement) {
+  // Validate proper table structure
+  if (!element.closest('table')) {
+    return selector;
+  }
   // Find position among td siblings
   const siblings = Array.from(element.parentElement.children);
   const position = siblings.indexOf(element) + 1;
   return `${selector}:nth-child(${position})`;
 }

Line range hint 938-964: Extract repeated DOM boundary check into a utility function.

The body/html tag check is duplicated across multiple places. Consider extracting it into a reusable utility function.

+const isBodyOrHtmlTag = (element: HTMLElement): boolean => {
+  const tagName = element.tagName.toLowerCase();
+  return tagName === 'body' || tagName === 'html';
+};

-if (element.tagName.toLowerCase() === 'body' || 
-    element.tagName.toLowerCase() === 'html') {
+if (isBodyOrHtmlTag(element)) {
   break;
 }

 // ... later in the code ...

-if (nextParent.tagName.toLowerCase() !== 'body' && 
-    nextParent.tagName.toLowerCase() !== 'html') {
+if (!isBodyOrHtmlTag(nextParent)) {
   element = nextParent;
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ac88c6 and b411faf.

📒 Files selected for processing (1)
  • server/src/workflow-management/selector.ts (9 hunks)
🔇 Additional comments (1)
server/src/workflow-management/selector.ts (1)

1043-1048: Reuse the shared table cell handling utility.

This is another occurrence of the duplicated table cell handling logic. Use the previously suggested getTableCellSelector utility function here as well.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
maxun-core/src/browserSide/scraper.js (4)

265-274: Enhance field categorization robustness

The field categorization logic could be improved in the following ways:

  1. Make the tag check case-insensitive
  2. Define table tags as a constant
+const TABLE_TAGS = ['TD', 'TH', 'TR'];
+
 const tableFields = {};
 const nonTableFields = {};
 
 for (const [label, field] of Object.entries(fields)) {
-    if (['TD', 'TH', 'TR'].includes(field.tag)) {
+    if (field.tag && TABLE_TAGS.includes(field.tag.toUpperCase())) {
         tableFields[label] = field;
     } else {
         nonTableFields[label] = field;
     }
 }

284-285: Optimize processing with early exit

Add an early exit when no items are found to avoid unnecessary loop setup.

+if (baseElements.length === 0) {
+    return scrapedData;
+}
+
 // Process each item up to the limit
 for (let i = 0; i < Math.min(baseElements.length, limit); i++) {

332-334: Clarify empty record handling

Make the empty record check more explicit by checking for meaningful values.

-if (Object.keys(record).length > 0) {
+// Only add records that have at least one non-null value
+if (Object.values(record).some(value => value !== null && value !== '')) {
     scrapedData.push(record);
 }

337-340: Optimize limit handling

The current implementation modifies the array length after collection. Consider checking the limit during collection:

-if (scrapedData.length >= limit) {
-    scrapedData.length = limit;
-    break;
-}
+// Break early if we've reached the limit
+if (scrapedData.length >= limit) break;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b411faf and 3176fa2.

📒 Files selected for processing (1)
  • maxun-core/src/browserSide/scraper.js (1 hunks)
🔇 Additional comments (1)
maxun-core/src/browserSide/scraper.js (1)

265-342: Verify scraping behavior with different table structures

Please ensure testing covers various table structures and edge cases:

  1. Tables with merged cells
  2. Nested tables
  3. Tables with missing cells
  4. Tables with mixed content types (text, links, images)

Comment on lines 276 to 282
const parentElements = Array.from(document.querySelectorAll(listSelector));
const scrapedData = [];

while (scrapedData.length < limit) {
let parentElements = Array.from(document.querySelectorAll(listSelector));
for (const parent of parentElements) {
// Get the first field's elements to determine how many items we have
const firstField = Object.values(fields)[0];
const baseElements = Array.from(parent.querySelectorAll(firstField.selector));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add validation for field configuration

The code assumes the first field and its selector are always valid. Add validation to handle edge cases:

 const parentElements = Array.from(document.querySelectorAll(listSelector));
 const scrapedData = [];

+if (Object.keys(fields).length === 0) {
+    return scrapedData;
+}
+
 const firstField = Object.values(fields)[0];
+if (!firstField?.selector) {
+    throw new Error('Invalid field configuration: missing selector');
+}
+
 const baseElements = Array.from(parent.querySelectorAll(firstField.selector));

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 310 to 330
// Process non-table fields
for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
const elements = Array.from(parent.querySelectorAll(selector));
// Use the same index to maintain correspondence between fields
const element = elements[i];

if (element) {
let value;
if (attribute === 'innerText') {
record[label] = fieldElement.innerText.trim();
value = element.innerText.trim();
} else if (attribute === 'innerHTML') {
record[label] = fieldElement.innerHTML.trim();
} else if (attribute === 'src') {
// Handle relative 'src' URLs
const src = fieldElement.getAttribute('src');
record[label] = src ? new URL(src, window.location.origin).href : null;
} else if (attribute === 'href') {
// Handle relative 'href' URLs
const href = fieldElement.getAttribute('href');
record[label] = href ? new URL(href, window.location.origin).href : null;
value = element.innerHTML.trim();
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
value = attrValue ? new URL(attrValue, window.location.origin).href : null;
} else {
record[label] = fieldElement.getAttribute(attribute);
value = element.getAttribute(attribute);
}
record[label] = value;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Apply extracted field processing logic

Apply the same helper function here to maintain consistency and reduce code duplication.

 // Process non-table fields
 for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
     const elements = Array.from(parent.querySelectorAll(selector));
     const element = elements[i];
-    if (element) {
-        let value;
-        if (attribute === 'innerText') {
-            value = element.innerText.trim();
-        } else if (attribute === 'innerHTML') {
-            value = element.innerHTML.trim();
-        } else if (attribute === 'src' || attribute === 'href') {
-            const attrValue = element.getAttribute(attribute);
-            value = attrValue ? new URL(attrValue, window.location.origin).href : null;
-        } else {
-            value = element.getAttribute(attribute);
-        }
-        record[label] = value;
-    }
+    const value = processFieldValue(element, attribute);
+    if (value !== null) {
+        record[label] = value;
+    }
 }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 288 to 308
// Process table fields
for (const [label, { selector, attribute }] of Object.entries(tableFields)) {
const elements = Array.from(parent.querySelectorAll(selector));
// Use the same index to maintain correspondence between fields
const element = elements[i];

const firstMatch = document.querySelector(listSelector);
if (firstMatch) {
// Get classes from the first matching element
const firstMatchClasses = Array.from(firstMatch.classList);

// Find similar elements by matching most of their classes
parentElements = allChildren.filter(element => {
const elementClasses = Array.from(element.classList);

// Element should share at least 70% of classes with the first match
const commonClasses = firstMatchClasses.filter(cls =>
elementClasses.includes(cls));
return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7);
});
if (element) {
let value;
if (attribute === 'innerText') {
value = element.innerText.trim();
} else if (attribute === 'innerHTML') {
value = element.innerHTML.trim();
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
value = attrValue ? new URL(attrValue, window.location.origin).href : null;
} else {
value = element.getAttribute(attribute);
}
record[label] = value;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Extract common field processing logic

The attribute processing logic is duplicated between table and non-table fields. Extract it into a helper function:

+function processFieldValue(element, attribute) {
+    if (!element) return null;
+    
+    switch (attribute) {
+        case 'innerText':
+            return element.innerText.trim();
+        case 'innerHTML':
+            return element.innerHTML.trim();
+        case 'src':
+        case 'href':
+            const attrValue = element.getAttribute(attribute);
+            try {
+                return attrValue ? new URL(attrValue, window.location.origin).href : null;
+            } catch (error) {
+                console.warn(`Invalid URL: ${attrValue}`);
+                return null;
+            }
+        default:
+            return element.getAttribute(attribute);
+    }
+}
+
 // Process table fields
 for (const [label, { selector, attribute }] of Object.entries(tableFields)) {
     const elements = Array.from(parent.querySelectorAll(selector));
     // Use the same index to maintain correspondence between fields
     const element = elements[i];
-    if (element) {
-        let value;
-        if (attribute === 'innerText') {
-            value = element.innerText.trim();
-        } else if (attribute === 'innerHTML') {
-            value = element.innerHTML.trim();
-        } else if (attribute === 'src' || attribute === 'href') {
-            const attrValue = element.getAttribute(attribute);
-            value = attrValue ? new URL(attrValue, window.location.origin).href : null;
-        } else {
-            value = element.getAttribute(attribute);
-        }
-        record[label] = value;
-    }
+    const value = processFieldValue(element, attribute);
+    if (value !== null) {
+        record[label] = value;
+    }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Process table fields
for (const [label, { selector, attribute }] of Object.entries(tableFields)) {
const elements = Array.from(parent.querySelectorAll(selector));
// Use the same index to maintain correspondence between fields
const element = elements[i];
const firstMatch = document.querySelector(listSelector);
if (firstMatch) {
// Get classes from the first matching element
const firstMatchClasses = Array.from(firstMatch.classList);
// Find similar elements by matching most of their classes
parentElements = allChildren.filter(element => {
const elementClasses = Array.from(element.classList);
// Element should share at least 70% of classes with the first match
const commonClasses = firstMatchClasses.filter(cls =>
elementClasses.includes(cls));
return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7);
});
if (element) {
let value;
if (attribute === 'innerText') {
value = element.innerText.trim();
} else if (attribute === 'innerHTML') {
value = element.innerHTML.trim();
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
value = attrValue ? new URL(attrValue, window.location.origin).href : null;
} else {
value = element.getAttribute(attribute);
}
record[label] = value;
}
}
function processFieldValue(element, attribute) {
if (!element) return null;
switch (attribute) {
case 'innerText':
return element.innerText.trim();
case 'innerHTML':
return element.innerHTML.trim();
case 'src':
case 'href':
const attrValue = element.getAttribute(attribute);
try {
return attrValue ? new URL(attrValue, window.location.origin).href : null;
} catch (error) {
console.warn(`Invalid URL: ${attrValue}`);
return null;
}
default:
return element.getAttribute(attribute);
}
}
// Process table fields
for (const [label, { selector, attribute }] of Object.entries(tableFields)) {
const elements = Array.from(parent.querySelectorAll(selector));
// Use the same index to maintain correspondence between fields
const element = elements[i];
const value = processFieldValue(element, attribute);
if (value !== null) {
record[label] = value;
}
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
maxun-core/src/browserSide/scraper.js (4)

281-296: Extract MAX_DEPTH as a named constant

Consider extracting the magic number 5 as a named constant at the module level for better maintainability.

+const TABLE_ANCESTOR_MAX_DEPTH = 5;
+
 function findTableAncestor(element) {
     let currentElement = element;
-    const MAX_DEPTH = 5;
     let depth = 0;
     
-    while (currentElement && depth < MAX_DEPTH) {
+    while (currentElement && depth < TABLE_ANCESTOR_MAX_DEPTH) {

307-321: Optimize TH element detection

The current implementation traverses up the DOM tree for each field. Consider caching the result or using getElementsByTagName('TH') for better performance.

 function hasThElement(row, tableFields) {
+    // Quick check first
+    if (row.getElementsByTagName('TH').length > 0) return true;
+
     for (const [label, { selector }] of Object.entries(tableFields)) {
         const element = row.querySelector(selector);
         if (element) {

385-429: Consider extracting element finding logic

The nested logic for finding elements is complex and could benefit from being extracted into a separate function for better maintainability.

+function findElementInTableCell(td, selector, attribute) {
+    // Direct selector match
+    let element = td.querySelector(selector);
+    
+    // Try selector without nth-child
+    if (!element && selector.split(">").pop().includes('td:nth-child')) {
+        element = td;
+    }
+
+    // Try tag-only selector
+    if (!element) {
+        const tagOnlySelector = selector.split('.')[0];
+        element = td.querySelector(tagOnlySelector);
+    }
+    
+    // Deep content search
+    if (!element) {
+        element = findContentChild(td, attribute);
+    }
+    
+    return element;
+}

 if (cellIndex >= 0) {
     const td = currentRow.children[cellIndex];
     if (td) {
-        element = td.querySelector(selector);
-        // ... existing fallback logic
+        element = findElementInTableCell(td, selector, attribute);
     }
 }

332-461: Consider implementing a strategy pattern for field extraction

The current implementation mixes table and non-table field handling logic in a single function. Consider implementing a strategy pattern to separate these concerns:

  1. Create separate strategies for table and non-table fields
  2. Move the field classification logic into a factory
  3. Implement a common interface for field extraction

This would improve maintainability and make it easier to add new field types in the future.

Would you like me to provide a detailed example of how to implement this pattern?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3176fa2 and 50bcd4b.

📒 Files selected for processing (1)
  • maxun-core/src/browserSide/scraper.js (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

const record = {};

for (const [label, { selector, attribute }] of Object.entries(nonTableFields)) {
const elements = Array.from(parent.querySelectorAll(selector));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix undefined variable reference

The code uses an undefined parent variable instead of the available container variable.

-const elements = Array.from(parent.querySelectorAll(selector));
+const elements = Array.from(container.querySelectorAll(selector));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const elements = Array.from(parent.querySelectorAll(selector));
const elements = Array.from(container.querySelectorAll(selector));

Comment on lines +298 to +305
function getCellIndex(td) {
let index = 0;
let sibling = td;
while (sibling = sibling.previousElementSibling) {
index++;
}
return index;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor the while loop condition

The assignment in the while condition was flagged by static analysis. Consider restructuring for better readability.

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function getCellIndex(td) {
let index = 0;
let sibling = td;
while (sibling = sibling.previousElementSibling) {
index++;
}
return index;
}
function getCellIndex(td) {
let index = 0;
let sibling = td;
while (sibling.previousElementSibling) {
sibling = sibling.previousElementSibling;
index++;
}
return index;
}
🧰 Tools
🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

Comment on lines +265 to +278
// Helper function to extract values from elements
function extractValue(element, attribute) {
if (!element) return null;

if (attribute === 'innerText') {
return element.innerText.trim();
} else if (attribute === 'innerHTML') {
return element.innerHTML.trim();
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
return attrValue ? new URL(attrValue, window.location.origin).href : null;
}
return element.getAttribute(attribute);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for URL construction

The URL construction could throw errors for invalid URLs. Consider adding try-catch block to handle malformed URLs gracefully.

 } else if (attribute === 'src' || attribute === 'href') {
     const attrValue = element.getAttribute(attribute);
-    return attrValue ? new URL(attrValue, window.location.origin).href : null;
+    if (!attrValue) return null;
+    try {
+        return new URL(attrValue, window.location.origin).href;
+    } catch (error) {
+        console.warn(`Invalid URL: ${attrValue}`);
+        return null;
+    }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Helper function to extract values from elements
function extractValue(element, attribute) {
if (!element) return null;
if (attribute === 'innerText') {
return element.innerText.trim();
} else if (attribute === 'innerHTML') {
return element.innerHTML.trim();
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
return attrValue ? new URL(attrValue, window.location.origin).href : null;
}
return element.getAttribute(attribute);
}
// Helper function to extract values from elements
function extractValue(element, attribute) {
if (!element) return null;
if (attribute === 'innerText') {
return element.innerText.trim();
} else if (attribute === 'innerHTML') {
return element.innerHTML.trim();
} else if (attribute === 'src' || attribute === 'href') {
const attrValue = element.getAttribute(attribute);
if (!attrValue) return null;
try {
return new URL(attrValue, window.location.origin).href;
} catch (error) {
console.warn(`Invalid URL: ${attrValue}`);
return null;
}
}
return element.getAttribute(attribute);
}

Comment on lines 407 to 420
let currentElement = td;
while (currentElement && currentElement.children.length > 0) {
let foundContentChild = false;
for (const child of currentElement.children) {
if (extractValue(child, attribute)) {
currentElement = child;
foundContentChild = true;
break;
}
}
if (!foundContentChild) break;
}
element = currentElement;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add depth limit to content child traversal

The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.

+const MAX_CONTENT_SEARCH_DEPTH = 3;
+let depth = 0;
 let currentElement = td;
-while (currentElement && currentElement.children.length > 0) {
+while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
     let foundContentChild = false;
     for (const child of currentElement.children) {
         if (extractValue(child, attribute)) {
             currentElement = child;
             foundContentChild = true;
             break;
         }
     }
     if (!foundContentChild) break;
+    depth++;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let currentElement = td;
while (currentElement && currentElement.children.length > 0) {
let foundContentChild = false;
for (const child of currentElement.children) {
if (extractValue(child, attribute)) {
currentElement = child;
foundContentChild = true;
break;
}
}
if (!foundContentChild) break;
}
element = currentElement;
}
const MAX_CONTENT_SEARCH_DEPTH = 3;
let depth = 0;
let currentElement = td;
while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
let foundContentChild = false;
for (const child of currentElement.children) {
if (extractValue(child, attribute)) {
currentElement = child;
foundContentChild = true;
break;
}
}
if (!foundContentChild) break;
depth++;
}
element = currentElement;
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
maxun-core/src/browserSide/scraper.js (3)

276-282: ⚠️ Potential issue

Add validation for fields object and selector

The code assumes the fields object is not empty and the first field has a valid selector. This could lead to runtime errors.

This was previously identified in an earlier review. The same validation is still needed:

+    if (Object.keys(fields).length === 0) {
+        return [];
+    }
+
     const firstField = Object.values(fields)[0];
+    if (!firstField?.selector) {
+        throw new Error('Invalid field configuration: missing selector');
+    }

288-308: 🛠️ Refactor suggestion

Extract common field processing logic and add URL error handling

The attribute processing logic is duplicated and URL construction lacks error handling.

These issues were previously identified. The same solutions are still needed:

  1. Extract the common processing logic into a helper function
  2. Add error handling for URL construction

See the previous review comments for the detailed implementation suggestions.


310-330: ⚠️ Potential issue

Fix undefined variable reference

The code uses an undefined parent variable in the querySelector call.

This issue was previously identified. The same fix is still needed:

-const elements = Array.from(parent.querySelectorAll(selector));
+const elements = Array.from(container.querySelectorAll(selector));
🧹 Nitpick comments (2)
maxun-core/src/browserSide/scraper.js (2)

265-274: Consider using Set for tag lookup

The tag checking logic can be optimized by using a Set for constant-time lookups.

-    const tableFields = {};
-    const nonTableFields = {};
-    
-    for (const [label, field] of Object.entries(fields)) {
-        if (['TD', 'TH', 'TR'].includes(field.tag)) {
-            tableFields[label] = field;
-        } else {
-            nonTableFields[label] = field;
-        }
-    }
+    const TABLE_TAGS = new Set(['TD', 'TH', 'TR']);
+    const tableFields = {};
+    const nonTableFields = {};
+    
+    for (const [label, field] of Object.entries(fields)) {
+        if (TABLE_TAGS.has(field.tag)) {
+            tableFields[label] = field;
+        } else {
+            nonTableFields[label] = field;
+        }
+    }

332-342: Optimize limit enforcement

The current implementation continues processing even after reaching the limit. Consider breaking out of the inner loop when the limit is reached.

     if (Object.keys(record).length > 0) {
         scrapedData.push(record);
+        if (scrapedData.length >= limit) {
+            break;
+        }
     }
 }

-if (scrapedData.length >= limit) {
-    scrapedData.length = limit;
-    break;
-}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50bcd4b and 641b655.

📒 Files selected for processing (1)
  • maxun-core/src/browserSide/scraper.js (1 hunks)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
maxun-core/src/browserSide/scraper.js (1)

307-330: Optimize table header detection performance

The current implementation traverses the DOM multiple times. Consider caching the results of header detection.

Apply this diff to improve performance:

+function memoizeHeaderCheck(row) {
+    const cache = new WeakMap();
+    return (selector) => {
+        if (cache.has(row)) return cache.get(row);
+        const result = hasThElement(row, { test: { selector } });
+        cache.set(row, result);
+        return result;
+    };
+}

 function hasThElement(row, tableFields) {
+    const checkHeader = memoizeHeaderCheck(row);
     for (const [label, { selector }] of Object.entries(tableFields)) {
-        const element = row.querySelector(selector);
-        if (element) {
-            let current = element;
-            while (current && current !== row) {
-                if (current.tagName === 'TH') {
-                    return true;
-                }
-                current = current.parentElement;
-            }
-        }
+        if (checkHeader(selector)) return true;
     }
     return false;
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 641b655 and 33de0f1.

📒 Files selected for processing (1)
  • maxun-core/src/browserSide/scraper.js (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (5)
maxun-core/src/browserSide/scraper.js (5)

265-278: Add error handling for URL construction

The URL construction could throw errors for invalid URLs. This issue was previously identified and remains unaddressed.

Apply this diff to handle malformed URLs gracefully:

 } else if (attribute === 'src' || attribute === 'href') {
     const attrValue = element.getAttribute(attribute);
-    return attrValue ? new URL(attrValue, window.location.origin).href : null;
+    if (!attrValue) return null;
+    try {
+        return new URL(attrValue, window.location.origin).href;
+    } catch (error) {
+        console.warn(`Invalid URL: ${attrValue}`);
+        return null;
+    }
 }

280-296: LGTM! Well-structured ancestor search

The function efficiently finds table ancestors with a reasonable depth limit and early returns.


298-305: Improve readability of the while loop

The assignment in the while condition was flagged by static analysis and makes the code less readable.

Apply this diff to restructure the loop:

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }
🧰 Tools
🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)


408-420: Add depth limit to content child traversal

The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.

Apply this diff to add a depth limit:

+const MAX_CONTENT_SEARCH_DEPTH = 3;
+let depth = 0;
 let currentElement = td;
-while (currentElement && currentElement.children.length > 0) {
+while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
     let foundContentChild = false;
     for (const child of currentElement.children) {
         if (extractValue(child, attribute)) {
             currentElement = child;
             foundContentChild = true;
             break;
         }
     }
     if (!foundContentChild) break;
+    depth++;
 }

372-375: Add validation for field configuration

The code assumes the first field and its selector are always valid.

Apply this diff to handle edge cases:

 if (Object.keys(tableFields).length > 0) {
     const firstField = Object.values(tableFields)[0];
+    if (!firstField?.selector) {
+        throw new Error('Invalid field configuration: missing selector');
+    }
     const firstElement = container.querySelector(firstField.selector);

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
maxun-core/src/browserSide/scraper.js (2)

298-305: 🛠️ Refactor suggestion

Improve readability of the while loop

The assignment in the while condition was flagged by static analysis and makes the code harder to read.

Apply this diff to improve the implementation:

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }
🧰 Tools
🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)


432-445: 🛠️ Refactor suggestion

Add depth limit to content traversal

The deep traversal for finding content children could lead to performance issues with deeply nested tables.

Apply this diff to improve the implementation:

+                      const MAX_CONTENT_SEARCH_DEPTH = 3;
+                      let depth = 0;
                       let currentElement = td;
-                      while (currentElement && currentElement.children.length > 0) {
+                      while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
                           let foundContentChild = false;
                           for (const child of currentElement.children) {
                               if (extractValue(child, attribute)) {
                                   currentElement = child;
                                   foundContentChild = true;
                                   break;
                               }
                           }
                           if (!foundContentChild) break;
+                          depth++;
                       }
🧹 Nitpick comments (2)
maxun-core/src/browserSide/scraper.js (2)

280-296: Optimize table ancestor lookup with early returns

The function can be simplified and made more efficient with early returns.

Apply this diff to improve the implementation:

 function findTableAncestor(element) {
     let currentElement = element;
     const MAX_DEPTH = 5;
     let depth = 0;
     
     while (currentElement && depth < MAX_DEPTH) {
-        if (currentElement.tagName === 'TD') {
-            return { type: 'TD', element: currentElement };
-        } else if (currentElement.tagName === 'TR') {
-            return { type: 'TR', element: currentElement };
-        }
+        switch (currentElement.tagName) {
+            case 'TD': return { type: 'TD', element: currentElement };
+            case 'TR': return { type: 'TR', element: currentElement };
+        }
         currentElement = currentElement.parentElement;
         depth++;
     }
     return null;
 }

336-359: Improve container matching logic

The container matching logic can be improved for better maintainability and readability.

Apply this diff to improve the implementation:

+    const CLASS_MATCH_THRESHOLD = 0.7;
     if (limit > 1 && containers.length <= 1) {
-      const [containerSelector, _] = listSelector.split('>').map(s => s.trim());
+      const [containerSelector] = listSelector.split('>').map(s => s.trim());
       const container = document.querySelector(containerSelector);
       
       if (container) {
         const allChildren = Array.from(container.children);
         
         const firstMatch = document.querySelector(listSelector);
         if (firstMatch) {
             const firstMatchClasses = Array.from(firstMatch.classList);
             
             containers = allChildren.filter(element => {
                 const elementClasses = Array.from(element.classList);
                 const commonClasses = firstMatchClasses.filter(cls => 
                     elementClasses.includes(cls));
-                return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7);
+                return commonClasses.length >= Math.floor(firstMatchClasses.length * CLASS_MATCH_THRESHOLD);
             });
         }
       }
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33de0f1 and 31d9730.

📒 Files selected for processing (1)
  • maxun-core/src/browserSide/scraper.js (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (4)
maxun-core/src/browserSide/scraper.js (4)

265-278: Enhance error handling and simplify attribute handling logic

The function could benefit from better error handling and a more maintainable structure.

Apply this diff to improve the implementation:

 function extractValue(element, attribute) {
     if (!element) return null;
     
-    if (attribute === 'innerText') {
-        return element.innerText.trim();
-    } else if (attribute === 'innerHTML') {
-        return element.innerHTML.trim();
-    } else if (attribute === 'src' || attribute === 'href') {
-        const attrValue = element.getAttribute(attribute);
-        return attrValue ? new URL(attrValue, window.location.origin).href : null;
-    }
-    return element.getAttribute(attribute);
+    switch (attribute) {
+        case 'innerText':
+            return element.innerText.trim();
+        case 'innerHTML':
+            return element.innerHTML.trim();
+        case 'src':
+        case 'href':
+            const attrValue = element.getAttribute(attribute);
+            if (!attrValue) return null;
+            try {
+                return new URL(attrValue, window.location.origin).href;
+            } catch (error) {
+                console.warn(`Invalid URL: ${attrValue}`);
+                return null;
+            }
+        default:
+            return element.getAttribute(attribute);
+    }
 }

361-387: LGTM! Field classification logic is well-structured

The field classification logic properly handles both table and non-table fields, with good error handling for missing elements.


464-485: LGTM! Non-table data processing is efficient

The non-table data processing logic is well-implemented, with proper limit handling and efficient element selection.


488-490: LGTM! Results merging is straightforward

The merging of table and non-table data is implemented correctly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
maxun-core/src/browserSide/scraper.js (3)

307-321: Optimize table header detection

The current implementation traverses the DOM tree multiple times for each field. Consider caching the TH check results.

 function hasThElement(row, tableFields) {
+    // Cache TH elements check
+    const hasThChildren = row.getElementsByTagName('TH').length > 0;
+    if (hasThChildren) return true;
+
     for (const [label, { selector }] of Object.entries(tableFields)) {
         const element = row.querySelector(selector);
         if (element) {
             let current = element;
             while (current && current !== row) {
                 if (current.tagName === 'TH') {
                     return true;
                 }
                 current = current.parentElement;
             }
         }
     }
     return false;
 }

336-359: Enhance container matching robustness

The current implementation has two potential issues:

  1. The class matching threshold (0.7) is hardcoded
  2. Empty class lists aren't handled explicitly
+    const CLASS_MATCH_THRESHOLD = 0.7; // Make configurable
+
     if (limit > 1 && containers.length <= 1) {
       const [containerSelector, _] = listSelector.split('>').map(s => s.trim());
       const container = document.querySelector(containerSelector);
       
       if (container) {
         const allChildren = Array.from(container.children);
         
         const firstMatch = document.querySelector(listSelector);
         if (firstMatch) {
             const firstMatchClasses = Array.from(firstMatch.classList);
+            
+            // Skip class matching if no classes present
+            if (firstMatchClasses.length === 0) {
+                return [firstMatch];
+            }
             
             containers = allChildren.filter(element => {
                 const elementClasses = Array.from(element.classList);
 
                 const commonClasses = firstMatchClasses.filter(cls => 
                     elementClasses.includes(cls));
-                return commonClasses.length >= Math.floor(firstMatchClasses.length * 0.7);
+                return commonClasses.length >= Math.floor(firstMatchClasses.length * CLASS_MATCH_THRESHOLD);
             });
         }
       }
     }

488-490: Add final limit validation

Ensure the combined results don't exceed the specified limit.

     // Merge and limit the results
     const scrapedData = [...tableData, ...nonTableData];
-    return scrapedData;
+    return scrapedData.slice(0, limit);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31d9730 and fcc71e0.

📒 Files selected for processing (1)
  • maxun-core/src/browserSide/scraper.js (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (3)
maxun-core/src/browserSide/scraper.js (3)

464-485: LGTM! Non-table data processing looks good

The implementation correctly handles limits and field extraction.


298-305: 🛠️ Refactor suggestion

Refactor the while loop for better readability

The assignment within the while condition makes the code harder to read and maintain. This was flagged by static analysis.

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }

Likely invalid or redundant comment.

🧰 Tools
🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)


433-443: 🛠️ Refactor suggestion

Add depth limit to content child traversal

The deep traversal for finding content children lacks a depth limit, which could lead to performance issues with deeply nested tables.

+                      const MAX_CONTENT_DEPTH = 3;
+                      let depth = 0;
                       while (currentElement && currentElement.children.length > 0) {
                           let foundContentChild = false;
                           for (const child of currentElement.children) {
                               if (extractValue(child, attribute)) {
                                   currentElement = child;
                                   foundContentChild = true;
                                   break;
                               }
                           }
                           if (!foundContentChild) break;
+                          if (++depth >= MAX_CONTENT_DEPTH) break;
                       }

Likely invalid or redundant comment.

Comment on lines +417 to +425
if (cellIndex >= 0) {
const td = currentRow.children[cellIndex];
if (td) {
element = td.querySelector(selector);

if (!element && selector.split(">").pop().includes('td:nth-child')) {
element = td;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation for cellIndex access

The code accesses currentRow.children[cellIndex] without validating the index, which could lead to undefined behavior.

               if (cellIndex >= 0) {
+                const cells = currentRow.children;
+                if (cellIndex >= cells.length) {
+                    console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`);
+                    continue;
+                }
                 const td = currentRow.children[cellIndex];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (cellIndex >= 0) {
const td = currentRow.children[cellIndex];
if (td) {
element = td.querySelector(selector);
if (!element && selector.split(">").pop().includes('td:nth-child')) {
element = td;
}
if (cellIndex >= 0) {
const cells = currentRow.children;
if (cellIndex >= cells.length) {
console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`);
continue;
}
const td = currentRow.children[cellIndex];
if (td) {
element = td.querySelector(selector);
if (!element && selector.split(">").pop().includes('td:nth-child')) {
element = td;
}

Comment on lines 368 to 387
containers.forEach((container, containerIndex) => {
for (const [label, field] of Object.entries(fields)) {
const sampleElement = container.querySelector(field.selector);

if (sampleElement) {
const ancestor = findTableAncestor(sampleElement);
if (ancestor) {
containerFields[containerIndex].tableFields[label] = {
...field,
tableContext: ancestor.type,
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
};
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
}
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for invalid selectors

The field classification process should handle invalid selectors gracefully to prevent runtime errors.

     containers.forEach((container, containerIndex) => {
         for (const [label, field] of Object.entries(fields)) {
+            try {
                 const sampleElement = container.querySelector(field.selector);
                 
                 if (sampleElement) {
                     const ancestor = findTableAncestor(sampleElement);
                     if (ancestor) {
                         containerFields[containerIndex].tableFields[label] = {
                             ...field,
                             tableContext: ancestor.type,
                             cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
                         };
                     } else {
                         containerFields[containerIndex].nonTableFields[label] = field;
                     }
                 } else {
                     containerFields[containerIndex].nonTableFields[label] = field;
                 }
+            } catch (error) {
+                console.warn(`Invalid selector for field "${label}": ${field.selector}`);
+                containerFields[containerIndex].nonTableFields[label] = field;
+            }
         }
     });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
containers.forEach((container, containerIndex) => {
for (const [label, field] of Object.entries(fields)) {
const sampleElement = container.querySelector(field.selector);
if (sampleElement) {
const ancestor = findTableAncestor(sampleElement);
if (ancestor) {
containerFields[containerIndex].tableFields[label] = {
...field,
tableContext: ancestor.type,
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
};
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
}
});
containers.forEach((container, containerIndex) => {
for (const [label, field] of Object.entries(fields)) {
try {
const sampleElement = container.querySelector(field.selector);
if (sampleElement) {
const ancestor = findTableAncestor(sampleElement);
if (ancestor) {
containerFields[containerIndex].tableFields[label] = {
...field,
tableContext: ancestor.type,
cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
};
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
} else {
containerFields[containerIndex].nonTableFields[label] = field;
}
} catch (error) {
console.warn(`Invalid selector for field "${label}": ${field.selector}`);
containerFields[containerIndex].nonTableFields[label] = field;
}
}
});

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)

94-124: Improve readability of parent element traversal logic.

The parent element traversal logic is complex and could benefit from:

  1. Better variable names (e.g., fullyContained -> isParentFullyContainingChild).
  2. Documentation explaining the geometric calculations.
  3. Extraction of the containment check into a separate function.
+const isParentContainingChild = (parent: HTMLElement, child: HTMLElement): boolean => {
+  const parentRect = parent.getBoundingClientRect();
+  const childRect = child.getBoundingClientRect();
+
+  const isFullyContained =
+    parentRect.left <= childRect.left &&
+    parentRect.right >= childRect.right &&
+    parentRect.top <= childRect.top &&
+    parentRect.bottom >= childRect.bottom;
+
+  const containmentRatio = (childRect.width * childRect.height) /
+    (parentRect.width * parentRect.height);
+  const hasSignificantOverlap = containmentRatio > 0.5;
+
+  return isFullyContained && hasSignificantOverlap;
+};

 // Replace the inline logic with:
-const parentRect = element.parentElement.getBoundingClientRect();
-const childRect = element.getBoundingClientRect();
-
-const fullyContained =
-  parentRect.left <= childRect.left &&
-  parentRect.right >= childRect.right &&
-  parentRect.top <= childRect.top &&
-  parentRect.bottom >= childRect.bottom;
-
-const significantOverlap =
-  (childRect.width * childRect.height) /
-  (parentRect.width * parentRect.height) > 0.5;
-
-if (fullyContained && significantOverlap) {
+if (isParentContainingChild(element.parentElement, element)) {

912-918: Improve table cell selector generation.

While the nth-child selector generation is correct, consider:

  1. Validating that the parent is a valid table row (<tr>).
  2. Adding error handling for malformed table structures.
 if (selector === 'td' && element.parentElement) {
+  if (element.parentElement.tagName !== 'TR') {
+    console.warn('Table cell found outside of a table row');
+    return selector;
+  }
   const siblings = Array.from(element.parentElement.children);
   const position = siblings.indexOf(element) + 1;
   return `${selector}:nth-child(${position})`;
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcc71e0 and 97efd15.

📒 Files selected for processing (1)
  • server/src/workflow-management/selector.ts (6 hunks)
🔇 Additional comments (1)
server/src/workflow-management/selector.ts (1)

87-92: 🛠️ Refactor suggestion

Extract table cell handling into a shared utility function.

This table cell handling logic is duplicated across multiple functions. Consider extracting it into a shared utility function to improve maintainability and reduce code duplication.

+const findTableAncestor = (element: HTMLElement): HTMLElement => {
+  if ((element.tagName === 'TD' || element.tagName === 'TH') && element.closest('table')) {
+    return element.closest('table')!;
+  }
+  return element;
+};

 // Replace duplicated code with:
-if (element.tagName === 'TD' || element.tagName === 'TH') {
-  const tableParent = element.closest('table');
-  if (tableParent) {
-    element = tableParent;
-  }
-}
+element = findTableAncestor(element);

Likely invalid or redundant comment.

Comment on lines +223 to +260
if (element.tagName === 'TD' || element.tagName === 'TH') {
const tableParent = element.closest('table');
if (tableParent) {
element = tableParent;
}
}

if (fullyContained && significantOverlap) {
element = element.parentElement;
} else {
break;
if (element.tagName !== 'TABLE') {
while (element.parentElement) {
if (element.tagName.toLowerCase() === 'body' ||
element.tagName.toLowerCase() === 'html') {
break;
}

const parentRect = element.parentElement.getBoundingClientRect();
const childRect = element.getBoundingClientRect();

const fullyContained =
parentRect.left <= childRect.left &&
parentRect.right >= childRect.right &&
parentRect.top <= childRect.top &&
parentRect.bottom >= childRect.bottom;

const significantOverlap =
(childRect.width * childRect.height) /
(parentRect.width * parentRect.height) > 0.5;

if (fullyContained && significantOverlap) {
const nextParent = element.parentElement;
if (nextParent.tagName.toLowerCase() !== 'body' &&
nextParent.tagName.toLowerCase() !== 'html') {
element = nextParent;
} else {
break;
}
} else {
break;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Eliminate code duplication in getRect function.

The table cell handling and parent traversal logic is duplicated from the getElementInformation function. This violates the DRY principle and makes maintenance harder.

Consider:

  1. Using the proposed findTableAncestor utility function.
  2. Extracting the parent traversal logic into a shared function.
  3. Creating a shared module for these utility functions.

Comment on lines +1007 to +1012
if (selector === 'td' && element.parentElement) {
const siblings = Array.from(element.parentElement.children);
const position = siblings.indexOf(element) + 1;
return `${selector}:nth-child(${position})`;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consolidate duplicate table cell selector logic.

The table cell selector generation logic is duplicated in three different places. This makes the code harder to maintain and increases the risk of inconsistencies.

Extract this logic into a shared utility function:

+const getTableCellSelector = (element: HTMLElement): string | null => {
+  if (element.tagName.toLowerCase() !== 'td' || !element.parentElement) {
+    return null;
+  }
+  if (element.parentElement.tagName !== 'TR') {
+    console.warn('Table cell found outside of a table row');
+    return null;
+  }
+  const siblings = Array.from(element.parentElement.children);
+  const position = siblings.indexOf(element) + 1;
+  return `td:nth-child(${position})`;
+};

 // Replace all occurrences with:
-if (selector === 'td' && element.parentElement) {
-  const siblings = Array.from(element.parentElement.children);
-  const position = siblings.indexOf(element) + 1;
-  return `${selector}:nth-child(${position})`;
-}
+const cellSelector = getTableCellSelector(element);
+if (cellSelector) {
+  return cellSelector;
+}

Also applies to: 1067-1072

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
maxun-core/src/browserSide/scraper.js (2)

280-296: Optimize table ancestor search with early returns

The function can be simplified using early returns for better readability:

 function findTableAncestor(element) {
     let currentElement = element;
     const MAX_DEPTH = 5;
     let depth = 0;
     
     while (currentElement && depth < MAX_DEPTH) {
-        if (currentElement.tagName === 'TD') {
-            return { type: 'TD', element: currentElement };
-        } else if (currentElement.tagName === 'TR') {
-            return { type: 'TR', element: currentElement };
-        }
+        switch (currentElement.tagName) {
+            case 'TD': return { type: 'TD', element: currentElement };
+            case 'TR': return { type: 'TR', element: currentElement };
+        }
         currentElement = currentElement.parentElement;
         depth++;
     }
     return null;
 }

452-466: Add depth limit to content child traversal

The deep traversal for finding content children lacks a depth limit:

+                      const MAX_CONTENT_SEARCH_DEPTH = 3;
+                      let depth = 0;
                       let currentElement = td;
-                      while (currentElement && currentElement.children.length > 0) {
+                      while (currentElement && currentElement.children.length > 0 && depth < MAX_CONTENT_SEARCH_DEPTH) {
                           let foundContentChild = false;
                           for (const child of currentElement.children) {
                               if (extractValue(child, attribute)) {
                                   currentElement = child;
                                   foundContentChild = true;
                                   break;
                               }
                           }
                           if (!foundContentChild) break;
+                          depth++;
                       }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97efd15 and fe38f5a.

📒 Files selected for processing (1)
  • maxun-core/src/browserSide/scraper.js (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
maxun-core/src/browserSide/scraper.js

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (6)
maxun-core/src/browserSide/scraper.js (6)

485-507: LGTM! Clean implementation of non-table data processing

The non-table data processing implementation is clean, efficient, and properly handles limits.


509-511: LGTM! Effective data merging

The data merging implementation is concise and effective.


265-278: 🛠️ Refactor suggestion

Add error handling for URL construction

The URL construction could throw errors for invalid URLs. Consider adding try-catch block:

 } else if (attribute === 'src' || attribute === 'href') {
     const attrValue = element.getAttribute(attribute);
-    return attrValue ? new URL(attrValue, window.location.origin).href : null;
+    if (!attrValue) return null;
+    try {
+        return new URL(attrValue, window.location.origin).href;
+    } catch (error) {
+        console.warn(`Invalid URL: ${attrValue}`);
+        return null;
+    }
 }

Likely invalid or redundant comment.


298-305: 🛠️ Refactor suggestion

Refactor the while loop condition

The assignment in the while condition was flagged by static analysis. Consider restructuring for better readability:

 function getCellIndex(td) {
     let index = 0;
     let sibling = td;
-    while (sibling = sibling.previousElementSibling) {
+    while (sibling.previousElementSibling) {
+        sibling = sibling.previousElementSibling;
         index++;
     }
     return index;
 }

Likely invalid or redundant comment.

🧰 Tools
🪛 Biome (1.9.4)

[error] 301-301: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)


438-446: 🛠️ Refactor suggestion

Add validation for cellIndex access

The code accesses currentRow.children[cellIndex] without validating the index:

               if (cellIndex >= 0) {
+                const cells = currentRow.children;
+                if (cellIndex >= cells.length) {
+                    console.warn(`Invalid cell index ${cellIndex} for row with ${cells.length} cells`);
+                    continue;
+                }
                 const td = currentRow.children[cellIndex];

Likely invalid or redundant comment.


389-407: 🛠️ Refactor suggestion

Add error handling for invalid selectors

The field classification process should handle invalid selectors gracefully:

     containers.forEach((container, containerIndex) => {
         for (const [label, field] of Object.entries(fields)) {
+            try {
                 const sampleElement = container.querySelector(field.selector);
                 
                 if (sampleElement) {
                     const ancestor = findTableAncestor(sampleElement);
                     if (ancestor) {
                         containerFields[containerIndex].tableFields[label] = {
                             ...field,
                             tableContext: ancestor.type,
                             cellIndex: ancestor.type === 'TD' ? getCellIndex(ancestor.element) : -1
                         };
                     } else {
                         containerFields[containerIndex].nonTableFields[label] = field;
                     }
                 } else {
                     containerFields[containerIndex].nonTableFields[label] = field;
                 }
+            } catch (error) {
+                console.warn(`Invalid selector for field "${label}": ${field.selector}`);
+                containerFields[containerIndex].nonTableFields[label] = field;
+            }
         }
     });

Likely invalid or redundant comment.

@amhsirak amhsirak merged commit 8e2357d into develop Dec 26, 2024
1 check passed
@amhsirak amhsirak removed the Status: Work In Progess This issue/PR is actively being worked on label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement Improvements to existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants