Skip to content

Fix quickstart doc with docker compose #1610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 23, 2025

Conversation

MonkeyCanCode
Copy link
Contributor

@MonkeyCanCode MonkeyCanCode commented May 19, 2025

This PR fixes couple issues found while trying to go through quickstart guide:

  1. Doc site/content/in-dev/unreleased/getting-started/quickstart.md has an invalid ref to getting-started/eclipselink/docker-compose-postgres.yml. This is updated to getting-started/assets/postgres/docker-compose-postgres.yml (this is correct in getting-started/eclipselink/README.md but not in the public page)
  2. Due to now we have 3 docker compose and 1 in a diff path, running this command can cause failure as they 2 of them have binds from local to container but from diff paths (it will create empty dirs on the wrong path). To fix this, I introduced ASSETS_PATH to point to the absolute path of assets directory which is being used for binding.
  3. Doc getting-started/eclipselink/docker-compose-bootstrap-db.yml has a hard-coded value of default username/password which is no longer valid after recent change in Use env var in spark container #1522
  4. quickstart page (https://polaris.apache.org/in-dev/unreleased/getting-started/quickstart/#docker-image) ask user to build the image then run docker compose and asks user to do expose CLIENT_ID and CLIENT_SECRET. Due to change in item 3, that is no longer valid as default value are not longer available. This will cause failure as well. To fix this, I add .env file to contain the original default value. This will continue to allow user to set them to diff by remove env loading for docker compose.

pingtimeout
pingtimeout previously approved these changes May 19, 2025
Copy link
Contributor

@pingtimeout pingtimeout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

This PR raises a question about how we structure the docker-compose commands. I understand the desire to reduce the amount of duplicated yaml across docker-compose files. However, for the sake of being beginner-friendly, I would rather have everything contained in a single docker-compose file, even if it slightly increases the maintenance effort (think upgrading the Postgres version). Wdyt?

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 19, 2025
@dimas-b
Copy link
Contributor

dimas-b commented May 19, 2025

I would rather have everything contained in a single docker-compose file

I support that. I even suggested that in #1470 (comment) :)

@adnanhemani
Copy link
Collaborator

While there are some good call-outs in this PR of things we should change, there are also some regressions. Please do not merge this PR - give me until EOD to get comments in.

Copy link
Collaborator

@adnanhemani adnanhemani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MonkeyCanCode - lots of great catches here. #1522 introduced a major new change in Quickstart where we are no longer relying on default credential values and must define them up front - and we did not change the website to fully reflect that. For allowing this change and not recognizing the regression to the documentation, I apologize.

We had discussed an ".env" file in this thread but ultimately decided against it: #1522 (comment). I still agree with the comments posted there. I think that if we move this section into the top of the Quickstart page and then remove the later instructions to export the same environment variables, it would be best. We should do the same for the Cloud Provider Deployment instructions and the READMEs in the Getting Started folder.

100% agreed on the ASSETS_PATH change and the correcting of the YAML file locations.

@@ -82,7 +82,7 @@ services:
--conf, "spark.sql.catalog.quickstart_catalog.type=rest",
--conf, "spark.sql.catalog.quickstart_catalog.warehouse=quickstart_catalog",
--conf, "spark.sql.catalog.quickstart_catalog.uri=http://polaris:8181/api/catalog",
--conf, "spark.sql.catalog.quickstart_catalog.credential=${USER_CLIENT_ID}:${USER_CLIENT_SECRET}",
--conf, "spark.sql.catalog.quickstart_catalog.credential=${CLIENT_ID}:${CLIENT_SECRET}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression, it should remain USER_CLIENT_ID/SECRET - these are user credentials rather than the admin credentials in CLIENT_ID/SECRET

Copy link
Contributor Author

@MonkeyCanCode MonkeyCanCode May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did noticed this one last night and assuming there is intensional. However, going through the quickstart guide, this is supposed to be the credential created at the end of the polaris CLI. But USER_CLIENT_ID and USER_CLIENT_SECRET is never defined in this context I believed. As an end-user tries to bring up setup via docker compose, there is no relation to the polaris CLI.

Now assuming we want to keep it this way, an end-user can't really define this one until polaris CLI returns the value back (which needed to bring up docker compose first, run polaris CLI for entities creation, then restart spark-sql to have this be effective). This restart workflow is really odd imo and we should replace it with other more automated workflow.

If this is causing regression, I can change it back. But I don't think we ask user to follow the restart workflow as it is really tedious imo.

Let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at this: https://polaris.apache.org/in-dev/unreleased/getting-started/using-polaris/#connecting-with-spark

We do ask users to refresh their Docker containers once they export the required variables.

You've correctly pointed out the chicken-and-the-egg problem we have here, but unfortunately this is how we have solved it so far - asking the user for one line ran is not completely unreasonable and we have no better known solution without breaking the Docker Compose file down even further :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there are couple problem with both existed route as well as the above, let me share my observations:

On terminal 1, we will run the following:

export ASSETS_PATH=$(pwd)/getting-started/assets/
docker compose --env-file getting-started/assets/getting-started.env \
  -f getting-started/assets/postgres/docker-compose-postgres.yml \
  -f getting-started/eclipselink/docker-compose-bootstrap-db.yml \
  -f getting-started/eclipselink/docker-compose.yml up

Due to USER_CLIENT_ID and USER_CLIENT_SECRET are not yet defined at this stage, the spark-sql will fail due to fail auth (as both will resolve to empty string). So the container is in stop state.

Now on terminal 2, we do the needed work to bootstrap polaris entities and have a different principal created and run the following to set values for environment variables:

export USER_CLIENT_ID=xxxxx
export USER_CLIENT_SECRET=xxxx

and try to start the failed containers here with current command in the doc:

docker compose -f getting-started/eclipselink/docker-compose.yml up -d

This will actually crashed the first setup as two docker compose command on diff terminal/session will create two diff projects and they are trying to bind to the same ports.

Now to workaround the problem, I added -p xxx to specify project so two commands from diff sessions will be dealing with same resources. However, this doesn't solve the problem of created container doesn't reload (which back to my original point where this is not really a regression as it never works before). To have containers reload the env, we will need to stop/remove the container then recreate it. By doing so, it will then try to load envs from the client's terminal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this comment. But I see what's the issue here - the original Docker Compose command should've been run with the "-d" detached option so that everything runs in one terminal only. Not sure how I missed that, sorry about that. We shouldn't need to force users to run multiple terminals for any of these commands.

@@ -36,7 +36,8 @@ cd ~/polaris
:polaris-quarkus-admin:assemble --rerun \
-Dquarkus.container-image.tag=postgres-latest \
-Dquarkus.container-image.build=true
docker compose -f getting-started/eclipselink/docker-compose-postgres.yml -f getting-started/eclipselink/docker-compose-bootstrap-db.yml -f getting-started/eclipselink/docker-compose.yml up
export ASSETS_PATH=$(pwd)/getting-started/assets/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this line to the top of the file with the exporting of the CLIENT_ID/SECRET, so that it only needs to be run once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I followed, this is the beginning of the quickstart page (https://polaris.apache.org/in-dev/unreleased/getting-started/quickstart/#docker-image). This is only needed for the context of docker compose.

The export of CLIENT_ID/CLIENT_SECRET is invalid I think as the current state (without the env) file, this won't even be able to start.

If I understand correctly, we should consider move export of CLIENT_ID/CLIENT_SECRET to this section (as the current docker compose file has no credential, so it will try to set empty string for root credential (as well as username, which is in-valid).

The export is only needed if user doesn't want to use env file (as env file will load the credential in the updated command). Let me know what you think.

Copy link
Collaborator

@adnanhemani adnanhemani May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this comment is in conjunction with the suggestion from the overall review's comment. We should do the following:

  1. Move the export of CLIENT_ID/CLIENT_SECRET to the top of the file - the Docker Compose files will be able to intake the environment variables set in bash. (i.e. keep one reference at the top of the Quickstart page and one at the top of each of the cloud deployment pages)
  2. Remove all references to setting the CLIENT_ID/CLIENT_SECRET elsewhere.
  3. Add the export ASSETS_PATH to these references to setting CLIENT_ID/CLIENT_SECRET

Does this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got what you mean. I had made some changes to this doc for refactor. Please review.

@github-project-automation github-project-automation bot moved this from Ready to merge to PRs In Progress in Basic Kanban Board May 20, 2025
@adnanhemani
Copy link
Collaborator

This PR raises a question about how we structure the docker-compose commands. I understand the desire to reduce the amount of duplicated yaml across docker-compose files. However, for the sake of being beginner-friendly, I would rather have everything contained in a single docker-compose file, even if it slightly increases the maintenance effort (think upgrading the Postgres version). Wdyt?

I disagree on this - if you read my reply to the comment that @dimas-b linked (#1470 (comment)), there's the reason why the Dockerfiles need to stay piecewise (tldr - it doesn't have to do with duplicated YAML but with the ability to reload services when required). If you have a better way to solve that issue, I'm open to it!

@MonkeyCanCode
Copy link
Contributor Author

I would rather have everything contained in a single docker-compose file

I support that. I even suggested that in #1470 (comment) :)

If we all want to do it in a single docker compose file, I can take care this in the next PR. Both route works and there are ways around to make them work. More toward to a preference things (e.g. a single giant compose file may be hard to read and maybe we will have other service such as minio which we don't want to start out of box...but an end-user can decide if minio should be spin up by adding additional compose file to the command).

Copy link
Collaborator

@adnanhemani adnanhemani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really close - last few comments

export CLIENT_ID=root
export CLIENT_SECRET=s3cr3t
```
- **For Docker**: These variables are configured in the `getting-started.env` file. To use custom values, export them as shown above and remove the `--env-file` option from the `docker compose` command.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is introducing additional mental complexity - why don't we just require everyone to export these variables and then remove the .env file itself?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, we need to do the same thing to the Cloud Providers pages - should be a simple copy-paste change to all 3 pages.

Copy link
Contributor Author

@MonkeyCanCode MonkeyCanCode May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think env file is helping. Currently we have auth hard-coded in couple place such as auth info for Postgres. Later on if we want to integrate with additional services (e.g. minio/keycloak), we will need to do the same with current route. Introducing a env file is pretty helpful for having a centralized location for config/auth info imo and keep the docker compose file clean. It is not a must for this PR, but I do think it will avoid many of those issues in the long run.

Then for cloud providers piece, I think those are working on https://github.com/apache/polaris/pull/1646/files?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that we shouldn't introduce these further services directly into the Quickstart flow that we publish on the website - and so the environment file wouldn't be helpful even in the future. Those integrations will need to write their own Docker Compose files and all. But I can see your point as well. I'm not going to hold a hard opinion on this as a result.

Cloud Providers weren't covered in #1646. I do think that piece is a requirement for this PR though, we should ensure changes are being made together for both local and cloud providers, in general, or we risk getting to the point where things are broken again as they are now. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. The only problem is I don't have a way to test the cloud deployment (so I am assuming it should work...PR is updated). There are more issues with those inconsistencies. Lets get this merge first then we can fix other issues as the existed one is not even use-able. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the variables need to be exported prior to the bash script being run in the cloud deployments. I can re-test later tomorrow! Or worst-case, I can make the change myself tomorrow as well - not a big issue.

Comment on lines -86 to -90
You can also set these credentials as environment variables for use with the Polaris CLI:
```shell
export CLIENT_ID=root
export CLIENT_SECRET=secret
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need this to enable polaris cli.

@@ -83,11 +89,6 @@ When using a Gradle-launched Polaris instance in this tutorial, we'll launch an
For more information on how to configure Polaris for production usage, see the [docs]({{% relref "../configuring-polaris-for-production" %}}).

When Polaris is run using the `./gradlew run` command, the root principal credentials are `root` and `secret` for the `CLIENT_ID` and `CLIENT_SECRET`, respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: We could actually remove this line.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MonkeyCanCode for working on it. LGTM. Left minor comments.

Comment on lines +245 to +247
docker compose -p polaris -f getting-started/eclipselink/docker-compose.yml stop trino
docker compose -p polaris -f getting-started/eclipselink/docker-compose.yml rm -f trino
docker compose -p polaris -f getting-started/eclipselink/docker-compose.yml up -d --no-deps trino
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to provide jdbc examples instead of eclipseLink ones here, we could do that in a follow-up PR though.

Comment on lines +182 to +184
docker compose -p polaris -f getting-started/eclipselink/docker-compose.yml stop spark-sql
docker compose -p polaris -f getting-started/eclipselink/docker-compose.yml rm -f spark-sql
docker compose -p polaris -f getting-started/eclipselink/docker-compose.yml up -d --no-deps spark-sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, prefer to have a JDBC example.

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 23, 2025
@flyrain flyrain merged commit 0eafcaa into apache:main May 23, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board May 23, 2025
@flyrain
Copy link
Contributor

flyrain commented May 23, 2025

Thanks @MonkeyCanCode for the fix! Thanks @pingtimeout , @dimas-b , and @adnanhemani for the review. Let's keep fixing what's left.

snazy added a commit to snazy/polaris that referenced this pull request Jun 13, 2025
* fix(nightly-CI): Do not publish snapshots from forks (apache#1635)

Adopt the `Nightly Build` workflow to not (try to) publish every night from forks.

* main: Update dependency io.smallrye.config:smallrye-config-core to v3.13.0 (apache#1637)

* Use echo to print script errors (apache#1648)

* [HOTFIX] QUICKSTART (apache#1646)

The change adds the following to fix Quick start experience : 
[1] ENV variables required by common assets after apache#1522 
[2] New configs required to enable FILE based sources apache#1649

Co-authored-by: singhpk234 <singhpk234@users.noreply.github.com>
Co-authored-by: pjanuario <pjanuario@users.noreply.github.com>

* main: Update dependency gradle to v8.14.1 (apache#1652)

* main: Update dependency gradle to v8.14.1

* Re-adopt PR to the project's needs

---------

Co-authored-by: Robert Stupp <snazy@snazy.de>

* [Policy Store] Add policyTypeCode to Slice/Index for Future Filtering Support and Update Policy Persistence Method (apache#1628)

This PR adds policyTypeCode to the in-memory tree map store's slice and the SQL index on policy_mapping_records (already done in JDBC in apache#1468). This prepares for future features that need to filter efficiently by policy type, like listing all entities with a data compaction policy.

It also updates the loadAllTargetsOnPolicy method to accept policyTypeCode, enabling it to use the new index for better performance.

* fix(test): Do not let some more tests spam `/tmp` (apache#1651)

* fix(test): Do not let some more tests not spam `/tmp`

* `PolarisRestCatalogViewFileIntegrationTest`
* `FileIOExceptionsTest`
* `PolarisRestCatalogViewFileIntegrationTest`

Changes the tests to leverage JUnit's `@TempDir`.

Simplifies `PolarisEclipseLinkMetaStoreManagerTest`

* review: rename the (now) abstract class

* fix(testing): Do not let PolarisOverlappingTableTest spam `/tmp` (apache#1641)

Changes the test to leverage JUnit's `@TempDir`.

* Add CATALOG_MANAGE_METADATA to super privilege set of policy attachment privileges (apache#1643)

* Fix quickstart doc with docker compose (apache#1610)

* main: Update dependency boto3 to v1.38.22 (apache#1657)

* Refactor IcebergCatalog to isolate internal state (apache#1659)

Following up on apache#1694

* Restore `private` scope on internal fields in `IcebergCatalog`

* Use a test-only setter instead of sub-classing to manage injecting
  test FileIO implementations

* Refactor: Use per-request STS credentials (apache#1629)

* Refactor: Use per-request STS credentials

No functional changes.

This is mostly to allow more storage integration
flexibility in downstream build.

This might also be useful for non-AWS storage.

* fix and enforce more errorprone checks (apache#1663)

enforces the following checks:
https://errorprone.info/bugpattern/ObjectsHashCodePrimitive
https://errorprone.info/bugpattern/OptionalMapToOptional
https://errorprone.info/bugpattern/StringCharset
https://errorprone.info/bugpattern/VariableNameSameAsType

* Create a wrapper script to generate python client; regenerate the python client (apache#1347)

As noted in apache#755 and elsewhere, the generated types in client/python are currently out of date. This introduces a script to regenerate them and a gradle task to run that script.

I've also run the script, which necessitated several things to get tests passing:

1. There were small nonfunctional spec changes needed in order to keep the Python client working
2. The CLI and its tests required a few fixes to work with the updated Python client
3. Many of the regtests required fixes to work with the updated Python client

* [Python Client] CI for Python client (Continue PR#1096) (apache#1639)

Adds CI for python client. It does not include caching poetry step for now since we do not have poetry.lock (it is in .gitignore), see relevant discussion in: apache#1102 (comment), apache#1096 (comment), we can add that later

* main: Update actions/setup-python action to v5 (apache#1671)

* main: Update actions/checkout action to v4 (apache#1670)

* main: Update python Docker tag to v3.13 (apache#1669)

* main: Update dependency pytest to ~=7.4.4 (apache#1668)

* main: Update dependency software.amazon.awssdk:bom to v2.31.50 (apache#1677)

* main: Update dependency boto3 to v1.38.23 (apache#1667)

* feat(build): make archive builds reproducible (apache#1664)

See https://docs.gradle.org/current/userguide/working_with_files.html#sec:reproducible_archives

* main: Update dependency io.prometheus:prometheus-metrics-exporter-servlet-jakarta to v1.3.8 (apache#1679)

* NoSQL: adapt to change on oss/main

* INFO: Last merged commit: 6ef8b3e

---------

Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: ModEtchFill <50123102+ModEtchFill@users.noreply.github.com>
Co-authored-by: Prashant Singh <35593236+singhpk234@users.noreply.github.com>
Co-authored-by: singhpk234 <singhpk234@users.noreply.github.com>
Co-authored-by: pjanuario <pjanuario@users.noreply.github.com>
Co-authored-by: Honah (Jonas) J. <honahx@apache.org>
Co-authored-by: MonkeyCanCode <yongzheng0809@gmail.com>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@dremio.com>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com>
Co-authored-by: Christopher Lambert <xn137@gmx.de>
Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants