Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build and API: Backend changes to handle multiple PDFs #10424

Open
5 tasks
benjaoming opened this issue Jun 12, 2023 · 8 comments · May be fixed by #10438
Open
5 tasks

Build and API: Backend changes to handle multiple PDFs #10424

benjaoming opened this issue Jun 12, 2023 · 8 comments · May be fixed by #10438
Labels
Accepted Accepted issue on our roadmap Feature New feature

Comments

@benjaoming
Copy link
Contributor

benjaoming commented Jun 12, 2023

Referencing high-level issue #2045 and especially #2045 (comment)

API changes:

  • Add new API endpoint to return multiple files w/ urls + verbose labels

El Proxito changes

  • New El Proxito URLs

I'll add more details once the implementation is open. LMK if there's anything missed here.

WRT @humitos question:

either by doing a listdir on S3 for that version, or by saving this data into the db

I'm leaning towards saving this data in the DB during build time. That's when we know about which files exist and can save it correctly.

Open question:

  • How do we store data about PDF files? DB? Or just pure S3 listdirs?
  • Do we want to generalize this for ePub?
@benjaoming benjaoming added Feature New feature Accepted Accepted issue on our roadmap labels Jun 12, 2023
@benjaoming benjaoming self-assigned this Jun 12, 2023
@benjaoming
Copy link
Contributor Author

benjaoming commented Jun 12, 2023

Legacy frontend support

Currently, builds fail if there are multiple PDFs in the output. Once we start supporting it in the backend (through a feature flag?), we should make sure to note that we have ONLY implemented backend support, meaning that on the API can be used to list these files.

While we are adding backend support, the frontend will still only work for 1 PDF file.

I think that's okay since we can put a direct reference in that message to a GitHub issue covering frontend support.

Meanwhile, we might continue to publish the "first" file of the output (chosen for instance by its alphanumeric order).

@humitos
Copy link
Member

humitos commented Jun 13, 2023

Currently, builds fail if there are multiple PDFs in the output. Once we start supporting it in the backend (through a feature flag?),

I think implementing a feature flag for internal usage for now is fine. Once we have the frontend UI exposed we can enable to users that have this requirement as a small rollout and then globally.

we might continue to publish the "first" file of the output

I'd avoid dealing with both implementation at the same time. That would be pretty confusing and hard to explain to users.

@benjaoming
Copy link
Contributor Author

@humitos @stsewd would be great to have your inputs with regards to whether or not to store information about collected and copied files during the build, maybe in the ImportedFile model?

Is there also an aspect of performing search indexing?

From the description:

@humitos:

either by doing a listdir on S3 for that version, or by saving this data into the db

@benjaoming:
I'm leaning towards saving this data in the DB during build time. That's when we know about which files exist and can save it correctly.

@stsewd
Copy link
Member

stsewd commented Jun 15, 2023

I'm +1 on having that in the DB, and yeah I could see a way of reusing the ImportedFile model for that, maybe add a new field to identify the file as downloadable PDF.

+1 on extending this to more file types.

There is also the question about how these files will be stored, I think it's fine to let users define the name, maybe don't allow nested paths.

@humitos
Copy link
Member

humitos commented Jun 15, 2023

I think the correct way is to store the filenames in the database, yeah 👍🏼

I agree we should not allow nested path. We should only upload and serve $READTHEDOCS_OUTPUT/pdf/*.pdf and nothing else.

I'm not familiarized with ImportedFile model but I thought we wanted to delete it because we weren't using it? Maybe I'm wrong, tho 🤷🏼 . So, probably whatever Santos says about that model is the best path to follow 😄

@benjaoming
Copy link
Contributor Author

rclone has some really nice filtering options, so we can tell it to only sync $READTHEDOCS_OUTPUT/pdf/*.pdf 💯

@benjaoming
Copy link
Contributor Author

Thanks for the inputs ❤️ This should be great to get started 🏃‍♂️

@stsewd
Copy link
Member

stsewd commented Jun 15, 2023

The ImportedFile model is useful for re-indexing (this is since it has the rank and ignored field in the model, but we can have a way to rely on the config file metadata from the build for this), what we can remove are the SphinxDomain models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Feature New feature
Projects
Status: Planned
3 participants