roachtest: import-cancellation failed #90510

cockroach-teamcity · 2022-10-22T15:57:14Z

roachtest.import-cancellation failed with artifacts on master @ a3b578466bcdcc5d2bbea628d33c337d87b9f4b8:

test artifacts and logs in: /artifacts/import-cancellation/run_1
(import_cancellation.go:170).func2: output in run_132432.856474893_n1_workload_run_querybench: ./workload run querybench --db=csv --concurrency=1 --query-file=tpch --num-runs=3 --max-ops=66 {pgurl:1-6} --histograms=perf/stats.json --histograms-max-latency=8m20s returned: SSH_PROBLEM: exit status 255
(monitor.go:127).Wait: monitor failure: monitor task failed: t.Fatal() was called
(test_runner.go:1062).teardownTest: test timed out (4h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: import-cancellation failed #90434 roachtest: import-cancellation failed [C-test-failure O-roachtest O-robot T-storage branch-release-22.2]

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!

Jira issue: CRDB-20809
Epic CRDB-16237}

The text was updated successfully, but these errors were encountered:

nicktrav · 2022-10-22T16:02:04Z

(test_runner.go:1062).teardownTest: test timed out (4h0m0s)

Slightly different to the failure on the 22.2 release branch.

nicktrav · 2022-10-22T16:42:20Z

Interestingly, three of the nodes have poor LSM health during the import phase. I wonder if this extended the import phase long enough such that the query phase ran out of time to complete. That might be worth digging into separately.

Node 6 at 13:17:45:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   2.6 M       -   5.4 G       -       -       -       -   5.4 G       -       -       -     1.0
      0       638   1.9 G    1.74   5.4 G    47 G    11 K     0 B       0   1.3 G   1.8 K     0 B       8     0.2
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3       103   294 M    1.71    46 G   5.6 G   1.1 K    13 M       2    50 G    12 K    49 G       1     1.1
      4       348   1.6 G    1.72    24 G   9.1 G   1.3 K    29 G   7.1 K    44 G   6.9 K    44 G       1     1.9
      5       898   8.9 G    1.72    43 G    16 G   2.0 K    16 G   3.6 K   113 G   9.5 K   114 G       1     2.6
      6      1976    51 G       -    64 G   1.1 G     189   107 M      28   196 G   8.0 K   205 G       1     3.0
  total      3963    64 G       -    84 G    79 G    15 K    45 G    11 K   488 G    38 K   412 G      12     5.8
  flush       418
compact     24552    49 G   101 M       3          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype     13751      88      50   10663       0       0       0  (default, delete, elision, move, read, rewrite, multi-level)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl        12    99 M
 bcache     338 K   6.6 G   99.5%  (score == hit-rate)
 tcache     2.3 K   1.5 M  100.0%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers       210
 filter         -       -   98.8%  (score == utility)

nicktrav · 2022-10-22T17:15:28Z

Did a manual run, which completed in ~10 mins.

https://teamcity.cockroachdb.com/viewLog.html?buildId=7081957&buildTypeId=Cockroach_Nightlies_RoachtestStress&tab=buildLog

jbowens · 2022-10-22T18:24:01Z

I suspect this roachtest's runtime varies significantly, since it uses a prng to determine cancellation, etc. I'm guessing the timed out run involved more or larger import rollbacks. Large import rollbacks can easily cause high r-amp during the import retry, since all of the writes are duplicate writes. All of the ingests into the rolled back keyspace must be ingested into higher levels of the LSM. The import's CheckSSTConflicts calls also become more expensive and no longer hit the fast-paths.

For now, I think we should probably edit the test to be deterministic and excercise a fair amount of import rollback. Then we can see what we can do to make this case less painful. I would really like to do cockroachdb/pebble#1170 as soon as we can,, because that would really make a significant difference in this case.

Currently, due to the randomness present in the test, there may be significant variance in test runs. In some situations, the nature of the random import cancellations may result in the test timing out. This makes the test flaky. Use a fixed seed for the test's PRNG to make the test more deterministic. Fix cockroachdb#90510. Release note: None.

nicktrav · 2022-10-22T22:28:40Z

Another thing I'm noticing about the tpc-h workload - the queries are sloooow.

For example:

SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, sum((l_extendedprice * (1 - l_discount)) * (1 + l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order
FROM lineitem
WHERE l_shipdate <= ('1998-12-01' - '90 days'::INTERVAL DAY)
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus

Has been running 3200+ seconds. Here's the explain:

There are no stats on that table. I assume with stats we'd be using the index on l_shipdate. In my manual run, I attempted to create stats on the table, but that takes a loooong time - I assume because it's 330 GiB.

edit: I misread the explain plan. It is using the right index. However, that particular query has to scan 98% of the table, based on stats I populated on that column manually, and the table is 600M+ rows.

For completeness, here's the plan with stats:

  distribution: full
  vectorized: true

  • sort
  │ estimated row count: 6
  │ order: +l_returnflag,+l_linestatus
  │
  └── • group (hash)
      │ estimated row count: 6
      │ group by: l_returnflag, l_linestatus
      │
      └── • render
          │
          └── • filter
              │ estimated row count: 1,183,444,976
              │ filter: l_shipdate <= '1998-09-02'
              │
              └── • scan
                    estimated row count: 1,199,396,967 (100% of the table; stats collected 14 minutes ago; using stats forecast for 2 hours in the future)
                    table: lineitem@lineitem_pkey
                    spans: FULL SCAN

With stats, the query finished in a fraction of the time:

Time: 252.179s total (execution 252.178s / network 0.001s)

cockroach-teamcity · 2022-10-23T11:40:32Z

roachtest.import-cancellation failed with artifacts on master @ cdf4b18048a9e29019125c64bea75c8d50fa3774:

test artifacts and logs in: /artifacts/import-cancellation/run_1
(import_cancellation.go:295).runImportSequence: Job 807582638859517958 failed.
(import_cancellation.go:282).runImportSequence: pq: query execution canceled
(import_cancellation.go:150).runImportCancellation: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: import-cancellation failed #90434 roachtest: import-cancellation failed [C-test-failure O-roachtest O-robot T-storage branch-release-22.2]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-10-24T15:29:49Z

roachtest.import-cancellation failed with artifacts on master @ 11e299bb1bec2f5666658393f14c12ebe11cc4a4:

test artifacts and logs in: /artifacts/import-cancellation/run_1
(import_cancellation.go:170).func2: output in run_125305.402841274_n1_workload_run_querybench: ./workload run querybench --db=csv --concurrency=1 --query-file=tpch --num-runs=3 --max-ops=66 {pgurl:1-6} --histograms=perf/stats.json --histograms-max-latency=8m20s returned: SSH_PROBLEM: exit status 255
(monitor.go:127).Wait: monitor failure: monitor task failed: t.Fatal() was called
(test_runner.go:1062).teardownTest: test timed out (4h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: import-cancellation failed #90434 roachtest: import-cancellation failed [C-test-failure O-roachtest O-robot T-storage branch-release-22.2]

_{This test on roachdash | Improve this report!}

Currently, due to the randomness present in the test, there may be significant variance in test runs. In some situations, the nature of the random import cancellations may result in the test timing out. This makes the test flaky. Use a fixed seed for the test's PRNG to make the test more deterministic. Additionally, reduce the size of the overall import by lowering the number of files used to generate the `lineitem` table. This should also help stabilize the runtime. Fix cockroachdb#90510. Release note: None.

nicktrav · 2022-10-24T18:13:29Z

Last night's run was also a timeout. The breakdown of the time spent is roughly as follows:

Build / spin-up < 1m
Import: 11:29 - 12:53 (1h24m)
Query: 12:53 - 15:29 (2h36m)

The failure prior to that was due to some context cancellation - still digging into that one.

celiala · 2022-10-24T19:34:36Z

Marking as GA blocker, as per thread.

cockroach-teamcity · 2022-10-25T16:08:45Z

roachtest.import-cancellation failed with artifacts on master @ c5decaeb69155399647909163451f3d8ca7858b6:

test artifacts and logs in: /artifacts/import-cancellation/run_1
(import_cancellation.go:170).func2: output in run_133601.467720815_n1_workload_run_querybench: ./workload run querybench --db=csv --concurrency=1 --query-file=tpch --num-runs=3 --max-ops=66 {pgurl:1-6} --histograms=perf/stats.json --histograms-max-latency=8m20s returned: SSH_PROBLEM: exit status 255
(monitor.go:127).Wait: monitor failure: monitor task failed: t.Fatal() was called
(test_runner.go:1062).teardownTest: test timed out (4h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: import-cancellation failed #90434 roachtest: import-cancellation failed [C-test-failure O-roachtest O-robot T-storage branch-release-22.2]

_{This test on roachdash | Improve this report!}

nicktrav · 2022-10-25T16:57:31Z

I'm going to test another patch to this test that forces table stats collection, waits for the jobs to complete, and then see what difference that makes to the workload runtime.

I suspect the stats collection job will be useful, as it will need to scan the whole keyspace.

Currently, due to the randomness present in the test, there may be significant variance in test runs. In some situations, the nature of the random import cancellations may result in the test timing out. This makes the test flaky. Use a fixed seed for the test's PRNG to make the test more deterministic. Reduce the size of the overall import by lowering the number of files used to generate the `lineitem` table. This should also help stabilize the runtime. Force the collection of table statistics as soon as the table has finished importing. Fix cockroachdb#90510. Release note: None.

cockroach-teamcity · 2022-10-26T12:23:54Z

roachtest.import-cancellation failed with artifacts on master @ 1b1c8da55be48c174b7b370b305f42622546209f:

test artifacts and logs in: /artifacts/import-cancellation/run_1
(test_impl.go:291).Fatal: dial tcp 35.229.90.8:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: import-cancellation failed #90434 roachtest: import-cancellation failed [C-test-failure O-roachtest O-robot T-storage branch-release-22.2]

_{This test on roachdash | Improve this report!}

90547: roachtest: import-cancellation: use a fixed seed; smaller import r=jbowens a=nicktrav Currently, due to the randomness present in the test, there may be significant variance in test runs. In some situations, the nature of the random import cancellations may result in the test timing out. This makes the test flaky. Use a fixed seed for the test's PRNG to make the test more deterministic. Additionally, reduce the size of the overall import by lowering the number of files used to generate the `lineitem` table. This should also help stabilize the runtime. Fix #90510. Release note: None. 90720: Revert fmtsafe.go output r=knz a=smg260 TestLint/TestRoachVet is a unit test which runs roachvet and greps various messages from its output - the reverted message being one of them. Release note: none. Epic: none 90736: sqlinstance: don't pretend to release the instance ID on shutdown r=andreimatei a=andreimatei Before this patch, the code looked like tenant SQL servers release their instance IDs on shutdown, but in fact they didn't. Instead of releasing on server shutdown, releasing the instance ID was done in a callback that was running under very specific circumstances - i.e. when the session heartbeat loop found the session record to have been deleted. Additionally, that callback also ran unreliably - it short-circuited if the session is not expired according to the local clock. This patch does away with releasing the instance ID, on two arguments: 1. The instance ID was rarely and inconsistently released to begin with, as explained above. 2. The instance ID release code is buggy - it blindly deletes the instance ID, even if another server had stolen it. In fact, given when we were releasing, I think we were particularly likely to release IDs that had been stolen. This is very dangerous. Better to never do it than run this risk. I've considered properly releasing the instance ID on shutdown, but I had trouble working out exactly when and where this can be done: you can't do it after stopper quiescing, and you don't want to do it too soon before that while the instance ID is still used by the server. I think properly releasing the ID will have to be coupled with a way of rejecting further uses of the ID. In the meantime I've left a TODO. Fixes #90576 Release note: None Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: Miral Gadani <miral@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

Currently, due to the randomness present in the test, there may be significant variance in test runs. In some situations, the nature of the random import cancellations may result in the test timing out. This makes the test flaky. Use a fixed seed for the test's PRNG to make the test more deterministic. Reduce the size of the overall import by lowering the number of files used to generate the `lineitem` table. This should also help stabilize the runtime. Force the collection of table statistics as soon as the table has finished importing. Reduce the number of time each TPC-H query is repeated, from three to two. Fix #90510. Release note: None.

Currently, due to the randomness present in the test, there may be significant variance in test runs. In some situations, the nature of the random import cancellations may result in the test timing out. This makes the test flaky. Use a fixed seed for the test's PRNG to make the test more deterministic. Reduce the size of the overall import by lowering the number of files used to generate the `lineitem` table. This should also help stabilize the runtime. Force the collection of table statistics as soon as the table has finished importing. Reduce the number of time each TPC-H query is repeated, from three to two. Fix cockroachdb#90510. Release note: None.

cockroach-teamcity added this to the 22.2 milestone Oct 22, 2022

blathers-crl bot added the T-storage Storage Team label Oct 22, 2022

cockroach-teamcity mentioned this issue Oct 23, 2022

roachtest: import-cancellation failed #90434

Closed

nicktrav mentioned this issue Oct 24, 2022

roachtest: import-cancellation: use a fixed seed; smaller import #90547

Merged

nicktrav self-assigned this Oct 24, 2022

celiala added GA-blocker blocks-22.2.0-rc.1 and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 24, 2022

craig bot closed this as completed in a37fc05 Oct 26, 2022

blathers-crl bot mentioned this issue Oct 26, 2022

release-22.2: roachtest: import-cancellation: use a fixed seed; smaller import #90741

Merged

nicktrav mentioned this issue Oct 27, 2022

release-22.2.0: add import-cancellation roachtest #90775

Merged

nicktrav added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Dec 12, 2022

exalate-issue-sync bot added the blocks-22.2.0-rc.1 label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: import-cancellation failed #90510

roachtest: import-cancellation failed #90510

cockroach-teamcity commented Oct 22, 2022 •

edited by exalate-issue-sync bot

Loading

nicktrav commented Oct 22, 2022

nicktrav commented Oct 22, 2022

nicktrav commented Oct 22, 2022

jbowens commented Oct 22, 2022 •

edited

Loading

nicktrav commented Oct 22, 2022 •

edited

Loading

cockroach-teamcity commented Oct 23, 2022

cockroach-teamcity commented Oct 24, 2022

nicktrav commented Oct 24, 2022

celiala commented Oct 24, 2022 •

edited

Loading

cockroach-teamcity commented Oct 25, 2022

nicktrav commented Oct 25, 2022

cockroach-teamcity commented Oct 26, 2022

roachtest: import-cancellation failed #90510

roachtest: import-cancellation failed #90510

Comments

cockroach-teamcity commented Oct 22, 2022 • edited by exalate-issue-sync bot Loading

nicktrav commented Oct 22, 2022

nicktrav commented Oct 22, 2022

nicktrav commented Oct 22, 2022

jbowens commented Oct 22, 2022 • edited Loading

nicktrav commented Oct 22, 2022 • edited Loading

cockroach-teamcity commented Oct 23, 2022

cockroach-teamcity commented Oct 24, 2022

nicktrav commented Oct 24, 2022

celiala commented Oct 24, 2022 • edited Loading

cockroach-teamcity commented Oct 25, 2022

nicktrav commented Oct 25, 2022

cockroach-teamcity commented Oct 26, 2022

cockroach-teamcity commented Oct 22, 2022 •

edited by exalate-issue-sync bot

Loading

jbowens commented Oct 22, 2022 •

edited

Loading

nicktrav commented Oct 22, 2022 •

edited

Loading

celiala commented Oct 24, 2022 •

edited

Loading