[DOCS] Best practices v2 #4158

ddorian · 2020-04-07T10:07:02Z

Added ysql + docdb.
As few changes as possible so we can get something in.
And then I create a new pull request.

rkarthik007

Partial review

rkarthik007 · 2020-04-08T21:58:30Z

docs/content/latest/develop/best-practices-ysql.md

+### Hash vs range sharding
+Choosing the type of partitioning in primary-keys and indexes can be hard. 
+There are differences how writes and reads are spread in a distributed database.
+Here are things we have to keep in mind when designing schemas:
+
+#### Consitent hash sharding
+As a team with experience building and running multiple database systems, we have found hash sharding to be ideal 
+for workloads requiring scale. For example, without any exception, all the massively scalable services we ran at Facebook 
+on Apache HBase implemented hash sharding at the application layer and pre-split tables. 
+Without hash sharding, these applications would have hit severe bottlenecks.
+
+With consistent hash sharding, data is evenly and randomly distributed across shards using a partitioning algorithm. 
+Each row of the table is placed into a shard determined by computing a consistent hash on the partition column values of that row. 
+
+**Pros:** This sharding strategy is ideal for massively scalable workloads because it distributes data evenly across all the nodes in the cluster, 
+while retaining ease of adding nodes into the cluster. Recall from earlier that algorithmic hash sharding is very effective also at 
+distributing data across nodes, but the distribution strategy depends on the number of nodes. With consistent hash sharding, 
+there are many more shards than the number of nodes and there is an explicit mapping table maintained tracking the assignment of 
+shards to nodes. When adding new nodes, a subset of shards from existing nodes can be efficiently moved into 
+the new nodes without requiring a massive data reassignment.
+
+**Cons:** Performing range queries could be inefficient. Examples of range queries are finding rows greater than a lower bound 
+or less than an upper bound (as opposed to point lookups).
+
+#### Range sharding
+Range sharding is essential for efficiently supporting queries looking up a range of rows based on column values that are less than, 
+greater than or that lie between some user specified values. Subjecting such queries to a scheme that is not range sharded could be prohibitive in performance, 
+since it might have to perform a full table scan. This makes this strategy essential.
+
+When range sharding is picked in scenarios that do not require the primary keys to be ordered, applications run into scalability 
+bottlenecks as mentioned in the cons section below. An often recommended workaround is to implement hash sharding on top 
+of range sharding. But in practice, users do not always remember to implement hash sharding on top of range sharding.
+
+**Pros**: This type of sharding allows efficiently querying a range of rows by the primary key values. 
+Examples of such a query is to look up all keys that lie between a lower bound and an upper bound.
+
+**Cons**: Range sharding leads to a number of issues in practice at scale, some of which are similar to that of linear hash sharding.  
+Firstly, when starting out with a single shard implies only a single node is taking all the user queries. 
+This often results in a database “warming” problem, where all queries are handled by a single node even if there are 
+multiple nodes in the cluster. The user would have to wait for enough splits to happen and these shards to get 
+redistributed before all nodes in the cluster are being utilized. This can be a big issue in production workloads. 
+This can be mitigated in some cases where the distribution is keys is known ahead of time by pre-splitting the 
+table into multiple shards, however this is hard in practice.
+
+Secondly, globally ordering keys across all the shards often generates hot spots: some shards will get much more activity than others, 
+and the node hosting those will be overloaded relative to others. While these can be mitigated to some extent with active load balancing, 
+this does not always work well in practice because by the time hot shards are redistributed across nodes, 
+the workload could change and introduce new hot spots.
+
+#### Consistent hash as the default sharding strategy
+Defaults are meant to serve the target use case. Users turn to YugabyteDB primarily for scalability reasons. 
+Most use cases requiring scalability often do not need to perform range lookups on the primary key, hence we picked consistent 
+hash sharding as the default in YugabyteDB. User identity (user ids do not need ordering), 
+product catalog (product ids are not related to one another) and stock ticker data (one stock symbol is independent of all other stock symbol) 
+are some common examples.
+
+In cases when range queries become important after the data is already loaded into a hash sharded table, 
+a range sharded secondary index can be created on that column. Once the secondary index is rebuilt, range queries would become efficient.


This is not appropriate here, it belongs in the sharding section on architecture. So lets remove this completely.

Trimmed down with only a link to the right docs.

rkarthik007 · 2020-04-08T21:59:09Z

docs/content/latest/develop/best-practices-ysql.md

+### Co-location
+Co-location is a [new feature in beta](../explore/colocated-tables/linux.md) where you can put tables of 
+a database inside 1 tablet. This increases performance and lowers latency on write transactions, joined queries and aggregations
+since they happen in 1 tablet/node.
+
+If you have a small dataset (<500GB) that requires HA, or a DB with only few scale out tables, then colocation is a good option.
+
+
+### Partial indexes
+[A partial index](../api/ysql/commands/ddl_create_index.md#where-clause) is an index that is built on a subset of a table and includes only rows that satisfy the condition 
+specified in the `WHERE` clause. It can be used to exclude `NULL` or common values from the index. 
+This will speed up any writes to the table since rows containing the common column values don't need to be indexed. 
+It will also reduce the size of the index, thereby improving the speed for read queries that use the index.
+It can also be used to index just the rows of interest. 
+For example, an application maintaining a shipments table may have a column for shipment status. 
+If the application mostly accesses the in-flight shipments, then it can use a partial index to exclude rows whose shipment status is delivered.


These are also not the correct items here.

Trimmed both and moved them to performance with links.

docs/content/latest/develop/best-practices-ysql.md

… docs/best-practices-v2 � Conflicts: � docs/content/latest/develop/best-practices-ycql.md

… docs/best-practices-v2

ddorian · 2023-09-26T12:44:37Z

Closing as I've created smaller PRs for each section.

add best-practices for ysql/docdb

a5afff6

ddorian added the area/documentation Documentation needed label Apr 7, 2020

ddorian requested review from rkarthik007 and ndeodhar April 7, 2020 10:07

rkarthik007 reviewed Apr 8, 2020

View reviewed changes

ddorian added 8 commits April 30, 2020 13:54

Merge branch 'master' of https://github.com/yugabyte/yugabyte-db into…

9e6f1d3

… docs/best-practices-v2 � Conflicts: � docs/content/latest/develop/best-practices-ycql.md

fixed menu item and links

76d1f90

move covered index to performance section

5098064

remove cluster aware drivers

22e01af

trimmed sharding advice

af531c0

trim partial indexes

c3498de

much shorter colocation

81f77ef

Merge branch 'master' of https://github.com/yugabyte/yugabyte-db into…

856d5ac

… docs/best-practices-v2

stevebang requested a review from rkarthik007 May 14, 2020 17:56

ddorian mentioned this pull request May 27, 2020

Any Documentation on Sizing Suggestions #4569

Closed

stevebang assigned ddorian Jul 31, 2020

ddorian requested review from m-iancu and removed request for ndeodhar November 12, 2020 17:56

ddorian closed this Sep 26, 2023

ddorian deleted the docs/best-practices-v2 branch September 27, 2023 12:34

ryan-ally mentioned this pull request Nov 27, 2023

[Snyk] Fix for 10 vulnerabilities ryan-ally/yugabyte-db#200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Best practices v2 #4158

[DOCS] Best practices v2 #4158

ddorian commented Apr 7, 2020

rkarthik007 left a comment

rkarthik007 Apr 8, 2020

ddorian Apr 30, 2020

rkarthik007 Apr 8, 2020

ddorian May 27, 2020

ddorian commented Sep 26, 2023

[DOCS] Best practices v2 #4158

[DOCS] Best practices v2 #4158

Conversation

ddorian commented Apr 7, 2020

rkarthik007 left a comment

Choose a reason for hiding this comment

rkarthik007 Apr 8, 2020

Choose a reason for hiding this comment

ddorian Apr 30, 2020

Choose a reason for hiding this comment

rkarthik007 Apr 8, 2020

Choose a reason for hiding this comment

ddorian May 27, 2020

Choose a reason for hiding this comment

ddorian commented Sep 26, 2023