Bug when changing from platform tools v1.37 to v1.39 #252

LucasSte · 2024-03-14T19:58:39Z

Problem

When building Openbook-V2 with the tools version v1.37, the following steps work correctly, but fail in platform tools version v1.39.

Steps to reproduce:

git clone https://github.com/openbook-dex/openbook-v2
cargo build-sbf --features enable-gpl
solana program deploy ./target/deply/openbook_v2.so
yarn run ts-node test.ts

import { Connection, Keypair, LAMPORTS_PER_SOL, PublicKey, Transaction } from '@solana/web3.js';
import * as openbook from '@openbook-dex/openbook-v2';
import { Program, BN, AnchorProvider, Wallet } from '@coral-xyz/anchor';
import * as splToken from "@solana/spl-token";

import * as fs from 'fs';

async function createMint(connection : Connection,  authority: Keypair, nb_decimals = 6) : Promise<PublicKey> {
    const kp = Keypair.generate();
    return await splToken.createMint(connection, 
        authority, 
        authority.publicKey, 
        authority.publicKey, 
        nb_decimals,
        kp)
}

export async function main() {

    const secretKey = JSON.parse(fs.readFileSync("/home/galactus/.config/solana/id.json", "utf-8"));
    const keypair = Keypair.fromSecretKey(new Uint8Array(secretKey));
    const authority = keypair;
    const payer = authority;
    const connection = new Connection("https://api.testnet.rpcpool.com/dfeb84a5-7fe8-4783-baf9-60cca0babbc7", "processed");


    let airdrop_sig = await connection.requestAirdrop(authority.publicKey, 2 * LAMPORTS_PER_SOL);
    await connection.confirmTransaction(airdrop_sig);

    let baseMint = await createMint(connection, authority, 6);
    let quoteMint = await createMint(connection, authority, 6);

    const quoteLotSize = new BN(1000000);
    const baseLotSize = new BN(1000000000);
    const makerFee = new BN(0);
    const takerFee = new BN(0);
    const timeExpiry = new BN(0);

    const wallet = new Wallet(authority);

    const programId = new PublicKey("AiqQtnazKRRUkn9enZ9SLUy35FS5aT38QrBET3qCiPqF");
    const provider = new AnchorProvider(connection, wallet, {});
    let client = new openbook.OpenBookV2Client( provider, programId);

    // Add your test here.
    const [[bidIx, askIx, eventHeapIx, ix], [market, bidsKeypair, askKeypair, eventHeapKeypair]] = await client.createMarketIx(
      authority.publicKey,
      "Market Name",
      quoteMint,
      baseMint,
      quoteLotSize,
      baseLotSize,
      makerFee,
      takerFee,
      timeExpiry,
      null, // oracleA
      null, // oracleB
      null, // openOrdersAdmin
      null, // consumeEventsAdmin
      null, // closeMarketAdmin
    );
    console.log("sending tx");

    let tx = new Transaction();
    tx.add(bidIx);
    tx.add(askIx);
    tx.add(eventHeapIx);
    tx.add(ix);
    tx.recentBlockhash = (await connection.getLatestBlockhash()).blockhash;
    // Send transaction
    let sig = await connection.sendTransaction(tx, [authority, market, bidsKeypair, askKeypair, eventHeapKeypair], {
        skipPreflight: false
    });
    console.log('Your transaction signature', sig);
    await connection.confirmTransaction(sig);
    
    console.log("Market initialized successfully");
    console.log("Market account:", market.publicKey.toBase58());
    console.log("Bids account:", bidsKeypair.publicKey.toBase58());
    console.log("Asks account:", askKeypair.publicKey.toBase58());
    console.log("Event heap account:", eventHeapKeypair.publicKey.toBase58());
    // console.log("Market authority:", market.authority.toBase58());
    console.log("Quote mint:", quoteMint.toBase58());
    console.log("Base mint:", baseMint.toBase58());
    console.log("Quote lot size:", quoteLotSize.toString());
    console.log("Base lot size:", baseLotSize.toString());
}

main().then(x => {
    console.log('finished sucessfully')
}).catch(e => {
    console.log('caught an error : ' + e)
})

Error in v1.39:

Program 9QJrVWzEaZBjao31iqBNaGqmXUNim7tmdb9kgczqGQXD failed: Access violation in unknown section at address 0x0 of size 8'

Proposed Solution

Investigating the problem

The text was updated successfully, but these errors were encountered:

acheroncrypto · 2024-03-15T10:19:36Z

Anchor has a bunch of tests that fail after upgrading to 1.18 CLI, with the main difference coming from platform-tools v1.37 vs v1.39 (coral-xyz/anchor#2795 (comment)).

The tests work as long as the program is built using an earlier version than v1.39, independent of solana-cli, test-validator or solana-program version used.

LucasSte · 2024-03-18T18:22:08Z

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

The OpenBook-V2 failure has been consuming more time, as it is a large contract that also depends on Anchor's code generation. I discovered the anchor expand command to generate a single file with the Rust code passed to the compiler and I've been ridding it of the code portions that do not influence the error.

I haven't yet pinpointed the problem, but I suspect something has changed in Rust's data structures that interferes with function calls and stack variables.

godmodegalactus · 2024-03-18T20:26:36Z

I guess we can create a simpler example then if we could pinpoint where the issue comes from.
Or we can try to test an anchor example.

acheroncrypto · 2024-03-19T06:56:06Z

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

Nice! Have you checked any of the other failures too?

The OpenBook-V2 failure has been consuming more time, as it is a large contract that also depends on Anchor's code generation. I discovered the anchor expand command to generate a single file with the Rust code passed to the compiler and I've been ridding it of the code portions that do not influence the error.

Here is a much shorter example that is likely related: https://beta.solpg.io/65cbb30bcffcf4b13384cf5b (run locally)

I haven't yet pinpointed the problem, but I suspect something has changed in Rust's data structures that interferes with function calls and stack variables.

I think we might be using more memory somehow. The behavior on the example I've shared is very weird too.

LucasSte · 2024-03-20T12:33:05Z

I've found the cause for the problem in OpenBook. Your example @acheroncrypto is likely to be hitting the same problem, as I simplified the OpenBook contract so much that it looked like the code you showed.

What is the problem?

SBFv1 functions have a limited frame size of 4096 bytes (4 kb), so using too many stack variables risks overwriting the frame of the caller function. In the OpenBook example, the anchor-generated function try_accounts (this one) deserializes instructions and accounts, and performs all accounts check, with heavy stack use. Such a function can get quite big when an instruction utilizes many accounts, as it is the case for OpenBook.

In the example, try_accounts is writing a value in frame of create_market, which had stored on its stack a pointer address. It reads a wrong pointer value from the stack and tries to access it, leading to a memory access violation, because the address it had stored in the stack now contains gibberish.

SBFv2 introduces dynamic stack frames, so this problem won't exist anymore once we migrate to the new runtime.

Why wasn't this a problem in v1.37?

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

In v1.37, try_accounts utilizes exactly 4096 bytes of the stack, so a couple more allocations were needed for us to break the code. These extra allocations come from the new SROA pass.

Any solution?

Although we can disable the SROA pass, such a measure won't make try_accounts impervious to future optimization changes or overflowing its frame in case a contract utilizes too many accounts. A suggestion would be to break down that method in smaller ones, decreasing stack usage.

LucasSte · 2024-03-20T18:31:09Z

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

Nice! Have you checked any of the other failures too?

I had a look at ido-pool, but the problem I've found is the same one as the one in OpenBook. We'll back-port the enum size bug fix to v1.18. Please, @acheroncrypto let us know you need anything else to get your PR merged.

acheroncrypto · 2024-03-21T03:24:53Z

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

Thanks for the explanation! I'll note that this looks like a major regression for our case because not only does the 10 account example I've given work on v1.37, but you can also add many more accounts to the instruction until you hit the transaction size limit (1232 bytes). It doesn't run into any stack issues even with many more accounts used compared to v1.39.

Although we can disable the SROA pass, such a measure won't make try_accounts impervious to future optimization changes or overflowing its frame in case a contract utilizes too many accounts. A suggestion would be to break down that method in smaller ones, decreasing stack usage.

The issue is that we can fix these problems in our tests, but it's likely that many of the production programs will also hit this problem once they start using solana-cli 1.18.

I had a look at ido-pool, but the problem I've found is the same one as the one in OpenBook. We'll back-port the enum size bug fix to v1.18. Please, @acheroncrypto let us know you need anything else to get your PR merged.

Thanks, we'll first need a new release that has the fixes to get the PR merged.

We also have some token 2022 tests failing, which I haven't yet debugged, but they are most likely not related to platform-tools.

LucasSte · 2024-03-21T14:36:45Z

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

Thanks for the explanation! I'll note that this looks like a major regression for our case because not only does the 10 account example I've given work on v1.37, but you can also add many more accounts to the instruction until you hit the transaction size limit (1232 bytes). It doesn't run into any stack issues even with many more accounts used compared to v1.39.

Those regressions are a concern, but I have good news. I've tested platform-tools version v1.41 in this Openbook-v2 issue, in Anchor's tests tests/pyth and tests/ido-pool and the new LLVM version made everything work again. We'll back-port v1.41 to Solana v1.18.

acheroncrypto · 2024-03-27T11:34:37Z

Thanks @LucasSte! The memory issues we had are fixed in the 1.18.8 release.

LucasSte · 2024-03-27T13:02:02Z

Thanks @LucasSte! The memory issues we had are fixed in the 1.18.8 release.

Thanks for the feedback. Can we close this issue?

acheroncrypto · 2024-03-27T13:13:38Z

I think so, yes.

LucasSte self-assigned this Mar 14, 2024

acheroncrypto mentioned this issue Mar 15, 2024

Large Stack Variables Error in Anchor 0.29.0 coral-xyz/anchor#2835

Open

acheroncrypto mentioned this issue Mar 20, 2024

[feat] Add Token22 Extensions coral-xyz/anchor#2789

Merged

LucasSte mentioned this issue Mar 21, 2024

Back-port Bump platform tools version (from solana-labs/solana #35330) #362

Merged

acheroncrypto mentioned this issue Mar 25, 2024

Upgrade Solana to 1.18.8 coral-xyz/anchor#2867

Merged

LucasSte closed this as completed Mar 28, 2024

anthontaylor mentioned this issue May 2, 2024

Access violation in stack frame when compiling program with any solana version greater than 1.18.0 #1158

Closed

This was referenced May 4, 2024

Stack Size increased with 1.18.x (platform-tools) solana-labs/solana#35543

Closed

Stack Size increased with 1.18.x (platform-tools) #1186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug when changing from platform tools v1.37 to v1.39 #252

Bug when changing from platform tools v1.37 to v1.39 #252

LucasSte commented Mar 14, 2024 •

edited

Loading

acheroncrypto commented Mar 15, 2024

LucasSte commented Mar 18, 2024

godmodegalactus commented Mar 18, 2024

acheroncrypto commented Mar 19, 2024

LucasSte commented Mar 20, 2024 •

edited

Loading

LucasSte commented Mar 20, 2024

acheroncrypto commented Mar 21, 2024

LucasSte commented Mar 21, 2024

acheroncrypto commented Mar 27, 2024

LucasSte commented Mar 27, 2024

acheroncrypto commented Mar 27, 2024

Bug when changing from platform tools v1.37 to v1.39 #252

Bug when changing from platform tools v1.37 to v1.39 #252

Comments

LucasSte commented Mar 14, 2024 • edited Loading

Problem

Proposed Solution

acheroncrypto commented Mar 15, 2024

LucasSte commented Mar 18, 2024

godmodegalactus commented Mar 18, 2024

acheroncrypto commented Mar 19, 2024

LucasSte commented Mar 20, 2024 • edited Loading

LucasSte commented Mar 20, 2024

acheroncrypto commented Mar 21, 2024

LucasSte commented Mar 21, 2024

acheroncrypto commented Mar 27, 2024

LucasSte commented Mar 27, 2024

acheroncrypto commented Mar 27, 2024

LucasSte commented Mar 14, 2024 •

edited

Loading

LucasSte commented Mar 20, 2024 •

edited

Loading