Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Best practices v2 #4158

Closed
wants to merge 9 commits into from

Conversation

ddorian
Copy link
Contributor

@ddorian ddorian commented Apr 7, 2020

Added ysql + docdb.
As few changes as possible so we can get something in.
And then I create a new pull request.

@ddorian ddorian added the area/documentation Documentation needed label Apr 7, 2020
@ddorian ddorian requested review from rkarthik007 and ndeodhar April 7, 2020 10:07
Copy link
Collaborator

@rkarthik007 rkarthik007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

Comment on lines 35 to 92
### Hash vs range sharding
Choosing the type of partitioning in primary-keys and indexes can be hard.
There are differences how writes and reads are spread in a distributed database.
Here are things we have to keep in mind when designing schemas:

#### Consitent hash sharding
As a team with experience building and running multiple database systems, we have found hash sharding to be ideal
for workloads requiring scale. For example, without any exception, all the massively scalable services we ran at Facebook
on Apache HBase implemented hash sharding at the application layer and pre-split tables.
Without hash sharding, these applications would have hit severe bottlenecks.

With consistent hash sharding, data is evenly and randomly distributed across shards using a partitioning algorithm.
Each row of the table is placed into a shard determined by computing a consistent hash on the partition column values of that row.

**Pros:** This sharding strategy is ideal for massively scalable workloads because it distributes data evenly across all the nodes in the cluster,
while retaining ease of adding nodes into the cluster. Recall from earlier that algorithmic hash sharding is very effective also at
distributing data across nodes, but the distribution strategy depends on the number of nodes. With consistent hash sharding,
there are many more shards than the number of nodes and there is an explicit mapping table maintained tracking the assignment of
shards to nodes. When adding new nodes, a subset of shards from existing nodes can be efficiently moved into
the new nodes without requiring a massive data reassignment.

**Cons:** Performing range queries could be inefficient. Examples of range queries are finding rows greater than a lower bound
or less than an upper bound (as opposed to point lookups).

#### Range sharding
Range sharding is essential for efficiently supporting queries looking up a range of rows based on column values that are less than,
greater than or that lie between some user specified values. Subjecting such queries to a scheme that is not range sharded could be prohibitive in performance,
since it might have to perform a full table scan. This makes this strategy essential.

When range sharding is picked in scenarios that do not require the primary keys to be ordered, applications run into scalability
bottlenecks as mentioned in the cons section below. An often recommended workaround is to implement hash sharding on top
of range sharding. But in practice, users do not always remember to implement hash sharding on top of range sharding.

**Pros**: This type of sharding allows efficiently querying a range of rows by the primary key values.
Examples of such a query is to look up all keys that lie between a lower bound and an upper bound.

**Cons**: Range sharding leads to a number of issues in practice at scale, some of which are similar to that of linear hash sharding.
Firstly, when starting out with a single shard implies only a single node is taking all the user queries.
This often results in a database “warming” problem, where all queries are handled by a single node even if there are
multiple nodes in the cluster. The user would have to wait for enough splits to happen and these shards to get
redistributed before all nodes in the cluster are being utilized. This can be a big issue in production workloads.
This can be mitigated in some cases where the distribution is keys is known ahead of time by pre-splitting the
table into multiple shards, however this is hard in practice.

Secondly, globally ordering keys across all the shards often generates hot spots: some shards will get much more activity than others,
and the node hosting those will be overloaded relative to others. While these can be mitigated to some extent with active load balancing,
this does not always work well in practice because by the time hot shards are redistributed across nodes,
the workload could change and introduce new hot spots.

#### Consistent hash as the default sharding strategy
Defaults are meant to serve the target use case. Users turn to YugabyteDB primarily for scalability reasons.
Most use cases requiring scalability often do not need to perform range lookups on the primary key, hence we picked consistent
hash sharding as the default in YugabyteDB. User identity (user ids do not need ordering),
product catalog (product ids are not related to one another) and stock ticker data (one stock symbol is independent of all other stock symbol)
are some common examples.

In cases when range queries become important after the data is already loaded into a hash sharded table,
a range sharded secondary index can be created on that column. Once the secondary index is rebuilt, range queries would become efficient.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not appropriate here, it belongs in the sharding section on architecture. So lets remove this completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed down with only a link to the right docs.

Comment on lines 95 to 110
### Co-location
Co-location is a [new feature in beta](../explore/colocated-tables/linux.md) where you can put tables of
a database inside 1 tablet. This increases performance and lowers latency on write transactions, joined queries and aggregations
since they happen in 1 tablet/node.

If you have a small dataset (<500GB) that requires HA, or a DB with only few scale out tables, then colocation is a good option.


### Partial indexes
[A partial index](../api/ysql/commands/ddl_create_index.md#where-clause) is an index that is built on a subset of a table and includes only rows that satisfy the condition
specified in the `WHERE` clause. It can be used to exclude `NULL` or common values from the index.
This will speed up any writes to the table since rows containing the common column values don't need to be indexed.
It will also reduce the size of the index, thereby improving the speed for read queries that use the index.
It can also be used to index just the rows of interest.
For example, an application maintaining a shipments table may have a column for shipment status.
If the application mostly accesses the in-flight shipments, then it can use a partial index to exclude rows whose shipment status is delivered.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are also not the correct items here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed both and moved them to performance with links.

docs/content/latest/develop/best-practices-ysql.md Outdated Show resolved Hide resolved
docs/content/latest/develop/best-practices-ysql.md Outdated Show resolved Hide resolved
@stevebang stevebang requested a review from rkarthik007 May 14, 2020 17:56
@ddorian ddorian requested review from m-iancu and removed request for ndeodhar November 12, 2020 17:56
@ddorian
Copy link
Contributor Author

ddorian commented Sep 26, 2023

Closing as I've created smaller PRs for each section.

@ddorian ddorian closed this Sep 26, 2023
@ddorian ddorian deleted the docs/best-practices-v2 branch September 27, 2023 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Documentation needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants