Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document.pageContent should not allow undefined #5884

Closed
5 tasks done
glorat opened this issue Jun 25, 2024 · 4 comments · Fixed by #5885, abdulrahman305/LocalAI#3, prathik2401/sentilog-deploy#4 or abdulrahman305/langchain-chatbot-demo#6 · May be fixed by gmickel/memorybot#12
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@glorat
Copy link
Contributor

glorat commented Jun 25, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import { describe, it, expect } from 'vitest'
import { Document } from '@langchain/core/documents'

describe('langchain Type Compatibility Test', () => {
  it('Document should handle empty pageContent', () => {
    const doc = new Document({ pageContent: '' })
    expect(doc.pageContent).toStrictEqual('')
  })
})

Error Message and Stack Trace (if applicable)

AssertionError: expected undefined to strictly equal ''
Expected :
Actual :undefined

Description

pageContent should be set to '', as mandated by its type contract string (and not string|undefined)

The bug is visible in the code
https://github.com/langchain-ai/langchainjs/blob/main/langchain-core/src/documents/document.ts

export class Document<
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  Metadata extends Record<string, any> = Record<string, any>
> implements DocumentInput, DocumentInterface
{
  pageContent: string;

  metadata: Metadata;

  constructor(fields: DocumentInput<Metadata>) {
    this.pageContent = fields.pageContent
      ? fields.pageContent.toString()
      : this.pageContent;
    this.metadata = fields.metadata ?? ({} as Metadata);
  }
}

Line 32 should check field.pageContent is undefined, not just truthy, otherwise it uses this.pageContent which is undefined and breaks the contract that pageContent must be a string. Thus in any event, pageContent needs to be initialised to something to fulfill the type contract

System Info

This will happen on any environment

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jun 25, 2024
Copy link

dosubot bot commented Jun 25, 2024

To ensure that the pageContent property is always set to a string, even when it is undefined, you can modify the constructor of the Document class to default pageContent to an empty string if it is not provided. Here is the corrected code:

export class Document<
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  Metadata extends Record<string, any> = Record<string, any>
> implements DocumentInput, DocumentInterface
{
  pageContent: string;

  metadata: Metadata;

  constructor(fields: DocumentInput<Metadata>) {
    this.pageContent = fields.pageContent !== undefined
      ? fields.pageContent.toString()
      : "";
    this.metadata = fields.metadata ?? ({} as Metadata);
  }
}

This change ensures that pageContent will always be a string, defaulting to an empty string if it is not provided [1].

To continue talking to Dosu, mention @dosu.

@glorat
Copy link
Contributor Author

glorat commented Jun 25, 2024

@dosu
I agree with you and the proposed fix. Let's also have a unit test for it. How about this:

import { test, expect } from "@jest/globals";
import {Document} from "../documents/document.js";

test("Document", async () => {
    const doc = new Document({pageContent: ''})
    expect (doc.pageContent).toEqual('')
});

Copy link

dosubot bot commented Jun 25, 2024

Your proposed unit test looks good. It effectively checks that the pageContent is set to an empty string when an empty string is provided. You might also want to add a test case to ensure that pageContent defaults to an empty string when it is not provided. Here is an extended version of your test:

import { test, expect } from "@jest/globals";
import { Document } from "../documents/document.js";

test("Document should handle empty pageContent", () => {
    const doc = new Document({ pageContent: '' });
    expect(doc.pageContent).toEqual('');
});

test("Document should default pageContent to an empty string when not provided", () => {
    const doc = new Document({});
    expect(doc.pageContent).toEqual('');
});

This ensures that both scenarios are covered.

To continue talking to Dosu, mention @dosu.

@glorat
Copy link
Contributor Author

glorat commented Jun 25, 2024

Dosu's second test not applicable since DocumentInput does not allow {}

Taken the liberty of making a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment