Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(providers/github): add GitHub provider #6

Merged
merged 2 commits into from
Jan 29, 2024

Conversation

mogery
Copy link
Member

@mogery mogery commented Jan 29, 2024

Fixes #3
/claim #3

This PR adds a GitHub provider for retrieving files from public GitHub repositories, like described in #3.

Remarks:

  • Since GitHub only allows rougly 50 requests per 5 minutes for unauthorized API callers, I had to add authentication, otherwise the tests would take, like, 30 minutes just waiting around because of the rate limiting. Nango is supported, as well as manual authorization by providing an OctoKit auth strategy and parameters.
    • This means that if set up correctly (correct scopes/permission specified), this is not only a public GitHub provider, but a private one too.

provider: "github",
type: this.docOnly
? "document" // don't run iterating computation if we only retrieved documents anyways
: isDoc(file.path) ? "document" : "code",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the schema for Document.type? I know "document" is valid, but is "code" allowed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine!

Comment on lines +186 to +198
// Construct pretty source URL.
sourceURL: `https://github.com/${
encodeURIComponent(this.owner)
}/${
encodeURIComponent(this.repo)
}/blob/${
encodeURIComponent(branchName)
}/${
file.path
.split("/") // Don't escape slashes, they're a part of the path.
.map(part => encodeURIComponent(part))
.join("/")
}`,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty dodgy, but I couldn't find a better way to construct a URL. file.url points to an api.github.com link, which won't bring up the GitHub UI. This "pretty URL" does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, this should be fine.

@nickscamara
Copy link
Member

Wow! This is awesome @mogery! This all looks really good. Thank you!

Comment on lines +175 to +177
// Decode the content blob as it is encoded
const decodedContent = Buffer.from(blob.data.content, 'base64').toString('utf8');

Copy link
Member Author

@mogery mogery Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may output large amounts of non-text data if binaries the repo contains binaries. Is that okay? Should we add UTF-8 detection?

@nickscamara
Copy link
Member

/tip $20

Copy link

algora-pbc bot commented Jan 29, 2024

@mogery
Copy link
Member Author

mogery commented Jan 29, 2024

Thank you!

Copy link

algora-pbc bot commented Jan 29, 2024

🎉🎈 @mogery has been awarded $20! 🎈🎊

@nickscamara nickscamara merged commit 1ae2da8 into mendableai:main Jan 29, 2024
@mogery mogery deleted the mog/github branch January 29, 2024 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Public GitHub Connector
2 participants