500 Error using _lastUpdated parameter to search count

**Describe the bug**
FHIR sever with large size database has poor performance and returned 500 Error when searching count with _lastUpdated parameter. 500 Error can stably be reproduced on FHIR server with 1tb database and sometimes happened on 500GB. 
<img width="700" alt="image" src="https://user-images.githubusercontent.com/57981365/207597406-e034b763-addc-4a82-938d-ed9d1b700029.png">
**FHIR Version?**
Stu3/R4/R5

**Data provider?**
SQL Server

**To Reproduce**
Steps to reproduce the behavior:
```
Get https://<name>.fhir.azurehealthcareapis.com?_summary=count&_lastUpdated=lt2022-12-05T02:50:36&_lastUpdated=ge2022-01-01T11:02:05
```
on FHIR server with large sql database.

**Expected behavior**
Return resource count.

**Actual behavior**
500 internal error.

**My Debug Process**

After debugging locally, I find the original request will be translated and rewrite to a SQL query:
```
SELECT COUNT_BIG(*)
      FROM dbo.Resource r
      WHERE ResourceSurrogateId < @p0
          AND ResourceSurrogateId >= @p1
          AND ResourceTypeId IN (@p2,@p3,@p4,@p5,@p6,@p7,@p8,@p9,@p10,@p11,@p12,@p13,@p14,@p15,@p16,@p17,@p18,@p19,@p20,@p21,@p22,@p23,@p24,@p25,@p26,@p27,@p28,@p29,@p30,@p31,@p32,@p33,@p34,@p35,@p36,@p37,@p38,@p39,@p40,@p41,@p42,@p43,@p44,@p45,@p46,@p47,@p48,@p49,@p50,@p51,@p52,@p53,@p54,@p55,@p56,@p57,@p58,@p59,@p60,@p61,@p62,@p63,@p64,@p65,@p66,@p67,@p68,@p69,@p70,@p71,@p72,@p73,@p74,@p75,@p76,@p77,@p78,@p79,@p80,@p81,@p82,@p83,@p84,@p85,@p86,@p87,@p88,@p89,@p90,@p91,@p92,@p93,@p94,@p95,@p96,@p97,@p98,@p99,@p100,@p101,@p102,@p103,@p104,@p105,@p106,@p107,@p108,@p109,@p110,@p111,@p112,@p113,@p114,@p115,@p116,@p117,@p118,@p119,@p120,@p121,@p122,@p123,@p124,@p125,@p126,@p127,@p128,@p129,@p130,@p131,@p132,@p133,@p134,@p135,@p136,@p137,@p138,@p139,@p140,@p141,@p142,@p143,@p144,@p145,@p146,@p147)

          AND IsHistory = 0
          AND IsDeleted = 0
```
The query will add all resource types and there is an inline comment in code :
[fhir-server/PartitionEliminationRewriter.cs at main · microsoft/fhir-server (github.com)](https://github.com/microsoft/fhir-server/blob/main/src/Microsoft.Health.Fhir.SqlServer/Features/Search/Expressions/Visitors/PartitionEliminationRewriter.cs#L101)
<img width="600" alt="image" src="https://user-images.githubusercontent.com/57981365/207598523-d2a3b534-4429-42e8-95c8-ab3948ac779d.png">
It seems make sense since resourceTypeId is partition key in SQL database. 
So I did some experiments in SQL server and explore the reason why this "better query plan" still has poor performance which led to 500 Error.

**My Experiment**
I created a 4T FHIR SQL database using OSS project and compared different queries:


1. Filter by ResourceSurrogateId and all ResourceTypeId (This is the query plan used by FHIR server. )
Search result: 2427786740
Time: 6min49s
<img width="600" alt="image" src="https://user-images.githubusercontent.com/57981365/207599321-6b87bf08-b359-4771-a255-922c9e3762f9.png">
2.    Only filter by ResourceSurrogateId .
Search result:2427786740
Time:1min10s
<img width="600" alt="image" src="https://user-images.githubusercontent.com/57981365/207602187-5c97b1b5-638c-4bd8-b5cd-f410948d8d80.png">

3.  Filter by ResourceSurrogateId with one ResourceTypeID
Search result: 1166112668(half of the previous)
Time: 9.3s
<img width="600" alt="image" src="https://user-images.githubusercontent.com/57981365/207602268-acb62730-59e2-4701-af9d-8020596c1d9b.png">

From this set of experiments, we find that:
1. Query with one resource type has good performance since resource type is partition key.
2. Enumerate all resource type has unacceptable performance, although query on each resource type is quick, the overall time is much longer than query without resource type.  I guess much overhead is used for parallelization (inner join and aggregate).

**According to the experiment, it verified that it's not a good idea to simply enumerate all resource types in the sql query.**

I also try to analysis it in theory:

For Resource table, it creates three un-clustered indexes to improve query performance:
	IX_Resource_ResourceTypeId_ResourceId
	IX_Resource_ResourceTypeId_ResourceId_Version
	IX_Resource_ResourceTypeId_ResourceSurrgateId
	
All of them use ResourceTypeId as the first index key and these indexes will only be used when query parameters include resource type.  To use these indexes, it's reasonable to add resource type in SQL query. But when all resource types are added, all of the improvement of indexes will be offset by poor performance of inner join and aggregate. 

At last, I try to build a new index on ResourceSurrogateId, and using this index, the request search count will improve 5-6 times than original search method. 

Adding a new index will change the schema of SQL database and may not an only way. I hope my experiment and attempts results can help the team solve this problem and improve the search performance.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

500 Error using _lastUpdated parameter to search count #3021

sowu880
openedon Dec 14, 2022

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

500 Error using _lastUpdated parameter to search count #3021

Description

sowu880openedon Dec 14, 2022

Metadata