Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Fix address space in casts from derived to base class #512

Merged
merged 2 commits into from
Aug 16, 2019

Conversation

Fznamznon
Copy link
Contributor

In case of multiple inheritance of non-empty classes clang emits two
bitcasts and a GEP for conversion from derived to a base class. First
bitcast converts pointer to derived class to int8 pointer, then GEP
takes this int8 pointer and applies base class offset, next second
bitcast converts int8 pointer with offset to base class pointer. With
new SYCL address space rules pointer to derived class may have generic
address space. In this case two address space casts instead of bitcasts
should be emitted.
This problem caused assertion fail inside the CodeGen and invalid module
generation.

Signed-off-by: Mariya Podchishchaeva mariya.podchishchaeva@intel.com

@@ -246,7 +246,7 @@ ApplyNonVirtualAndVirtualOffset(CodeGenFunction &CGF, Address addr,

// Apply the base offset.
llvm::Value *ptr = addr.getPointer();
ptr = CGF.Builder.CreateBitCast(ptr, CGF.Int8PtrTy);
ptr = CGF.Builder.CreatePointerBitCastOrAddrSpaceCast(ptr, CGF.Int8PtrTy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you cast a pointer to an i8* pointer of the same address space? Casting from generic (or actually any other than private) address space to private is illegal, unless you know the actual address space of a pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide a link to SPIRV/OpenCL specification which tells that cast from generic to private is illegal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no spec that defines OpenCL/SYCL address spaces in LLVM IR (unless you consider the old SPIR spec).

Generally, if you have a generic pointer and you want to cast it to global/local/private, you should do this if and only if you know for sure that this generic pointer is in fact global, local, or private, and it can never be in any other address space. In other words, you have to preserve the original address space of a pointer or leave it generic. Otherwise, you're basically lying to the compiler, and this could result in miscompilation, or an inference pass can go crazy when it finds out the original address space and a mismatch between source and dest address spaces in an addrspacecast instruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, seems that actually it's not illegal, but it probably could lead to address space inference failure.
I was a little bit confused, because OpenCL allows casts from generic address space to named address spaces and I think it's easy to get address space cast from generic address space to named address space in IR generated from OpenCL.
Okay, I will change this patch to emit int8* pointer of the same address space for safety reason.

In case of multiple inheritance of non-empty classes clang emits two
bitcasts and a GEP for conversion from derived to a base class. First
bitcast converts pointer to derived class to int8 pointer, then GEP
takes this int8 pointer and applies base class offset, next second
bitcast converts int8 pointer with offset to base class pointer. With
new SYCL address space rules pointer to derived class may have generic
address space. In this case two bitcasts to int8 pointer with generic
address space should be emitted.
This problem caused assertion fail inside the CodeGen and invalid module
generation.

Signed-off-by: Mariya Podchishchaeva <mariya.podchishchaeva@intel.com>
@Fznamznon Fznamznon force-pushed the private/mpodchis/fixasforderived branch from aefb169 to 0f7903f Compare August 15, 2019 16:44
Signed-off-by: Mariya Podchishchaeva <mariya.podchishchaeva@intel.com>
@Fznamznon
Copy link
Contributor Author

@asavonic, are all your concerns addressed?

asavonic
asavonic previously approved these changes Aug 16, 2019
Copy link
Contributor

@asavonic asavonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please fix the name in CGClass.cpp

@@ -246,7 +246,9 @@ ApplyNonVirtualAndVirtualOffset(CodeGenFunction &CGF, Address addr,

// Apply the base offset.
llvm::Value *ptr = addr.getPointer();
ptr = CGF.Builder.CreateBitCast(ptr, CGF.Int8PtrTy);
llvm::Type *ResTy = llvm::PointerType::getInt8PtrTy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResTy name is not consistent with the rest of the names in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bader bader merged commit 76e223c into intel:sycl Aug 16, 2019
bb-sycl pushed a commit that referenced this pull request Nov 8, 2019
Summary:
The greedy register allocator occasionally decides to insert a large number of
unnecessary copies, see below for an example.  The -consider-local-interval-cost
option (which X86 already enables by default) fixes this.  We enable this option
for AArch64 only after receiving feedback that this change is not beneficial for
PowerPC.

We evaluated the impact of this change on compile time, code size and
performance benchmarks.

This option has a small impact on compile time, measured on CTMark. A 0.1%
geomean regression on -O1 and -O2, and 0.2% geomean for -O3, with at most 0.5%
on individual benchmarks.

The effect on both code size and performance on AArch64 for the LLVM test suite
is nil on the geomean with individual outliers (ignoring short exec_times)
between:

                 best     worst
  size..text     -3.3%    +0.0%
  exec_time      -5.8%    +2.3%

On SPEC CPU® 2017 (compiled for AArch64) there is a minor reduction (-0.2% at
most) in code size on some benchmarks, with a tiny movement (-0.01%) on the
geomean.  Neither intrate nor fprate show any change in performance.

This patch makes the following changes.

- For the AArch64 target, enableAdvancedRASplitCost() now returns true.

- Ensures that -consider-local-interval-cost=false can disable the new
  behaviour if necessary.

This matrix multiply example:

   $ cat test.c
   long A[8][8];
   long B[8][8];
   long C[8][8];

   void run_test() {
     for (int k = 0; k < 8; k++) {
       for (int i = 0; i < 8; i++) {
	 for (int j = 0; j < 8; j++) {
	   C[i][j] += A[i][k] * B[k][j];
	 }
       }
     }
   }

results in the following generated code on AArch64:

  $ clang --target=aarch64-arm-none-eabi -O3 -S test.c -o -
  [...]
                                        // %for.cond1.preheader
                                        // =>This Inner Loop Header: Depth=1
        add     x14, x11, x9
        str     q0, [sp, #16]           // 16-byte Folded Spill
        ldr     q0, [x14]
        mov     v2.16b, v15.16b
        mov     v15.16b, v14.16b
        mov     v14.16b, v13.16b
        mov     v13.16b, v12.16b
        mov     v12.16b, v11.16b
        mov     v11.16b, v10.16b
        mov     v10.16b, v9.16b
        mov     v9.16b, v8.16b
        mov     v8.16b, v31.16b
        mov     v31.16b, v30.16b
        mov     v30.16b, v29.16b
        mov     v29.16b, v28.16b
        mov     v28.16b, v27.16b
        mov     v27.16b, v26.16b
        mov     v26.16b, v25.16b
        mov     v25.16b, v24.16b
        mov     v24.16b, v23.16b
        mov     v23.16b, v22.16b
        mov     v22.16b, v21.16b
        mov     v21.16b, v20.16b
        mov     v20.16b, v19.16b
        mov     v19.16b, v18.16b
        mov     v18.16b, v17.16b
        mov     v17.16b, v16.16b
        mov     v16.16b, v7.16b
        mov     v7.16b, v6.16b
        mov     v6.16b, v5.16b
        mov     v5.16b, v4.16b
        mov     v4.16b, v3.16b
        mov     v3.16b, v1.16b
        mov     x12, v0.d[1]
        fmov    x15, d0
        ldp     q1, q0, [x14, #16]
        ldur    x1, [x10, #-256]
        ldur    x2, [x10, #-192]
        add     x9, x9, #64             // =64
        mov     x13, v1.d[1]
        fmov    x16, d1
        ldr     q1, [x14, #48]
        mul     x3, x15, x1
        mov     x14, v0.d[1]
        fmov    x17, d0
        mov     x18, v1.d[1]
        fmov    x0, d1
        mov     v1.16b, v3.16b
        mov     v3.16b, v4.16b
        mov     v4.16b, v5.16b
        mov     v5.16b, v6.16b
        mov     v6.16b, v7.16b
        mov     v7.16b, v16.16b
        mov     v16.16b, v17.16b
        mov     v17.16b, v18.16b
        mov     v18.16b, v19.16b
        mov     v19.16b, v20.16b
        mov     v20.16b, v21.16b
        mov     v21.16b, v22.16b
        mov     v22.16b, v23.16b
        mov     v23.16b, v24.16b
        mov     v24.16b, v25.16b
        mov     v25.16b, v26.16b
        mov     v26.16b, v27.16b
        mov     v27.16b, v28.16b
        mov     v28.16b, v29.16b
        mov     v29.16b, v30.16b
        mov     v30.16b, v31.16b
        mov     v31.16b, v8.16b
        mov     v8.16b, v9.16b
        mov     v9.16b, v10.16b
        mov     v10.16b, v11.16b
        mov     v11.16b, v12.16b
        mov     v12.16b, v13.16b
        mov     v13.16b, v14.16b
        mov     v14.16b, v15.16b
        mov     v15.16b, v2.16b
        ldr     q2, [sp]                // 16-byte Folded Reload
        fmov    d0, x3
        mul     x3, x12, x1
  [...]

With -consider-local-interval-cost the same section of code results in the
following:

  $ clang --target=aarch64-arm-none-eabi -mllvm -consider-local-interval-cost -O3 -S test.c -o -
  [...]
  .LBB0_1:                              // %for.cond1.preheader
                                        // =>This Inner Loop Header: Depth=1
        add     x14, x11, x9
        ldp     q0, q1, [x14]
        ldur    x1, [x10, #-256]
        ldur    x2, [x10, #-192]
        add     x9, x9, #64             // =64
        mov     x12, v0.d[1]
        fmov    x15, d0
        mov     x13, v1.d[1]
        fmov    x16, d1
        ldp     q0, q1, [x14, #32]
        mul     x3, x15, x1
        cmp     x9, #512                // =512
        mov     x14, v0.d[1]
        fmov    x17, d0
        fmov    d0, x3
        mul     x3, x12, x1
  [...]

Reviewers: SjoerdMeijer, samparker, dmgreen, qcolombet

Reviewed By: dmgreen

Subscribers: ZhangKang, jsji, wuzish, ppc-slack, lkail, steven.zhang, MatzeB, qcolombet, kristof.beyls, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D69437
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants