Skip to content

Commit

Permalink
update post about cbr
Browse files Browse the repository at this point in the history
update Posts about DrawCall
  • Loading branch information
Tomicyo committed Aug 21, 2017
1 parent a2f392e commit 7bc4155
Show file tree
Hide file tree
Showing 88 changed files with 189 additions and 4 deletions.
7 changes: 7 additions & 0 deletions 1.ue4_insights/ReadMe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Unreal Insights 系列

本系列主要讲述UE4引擎的内部实现,记录了一些客制化的需求实现过程中的一些坑。

* [真实角色的渲染](shading_models/paragon_character_tech.md)
* [基于LPV的动态全局光实现](global_illumination/lpv.md)
* [UE4的渲染框架](renderer_architect/renderer.md)
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 6 additions & 0 deletions 3.build_next_gen_gfx_lib/ReadMe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Kaleido3D的开发日记

* [NGFX库的灵感](ngfx/ngfx_impl.md)
* [Kaleido3D的开始](ngfx/initial.md)
* [NGFX Shader编译的改造](ngfx/compiler.md)
* [跨平台实现](posts/cross_platform.md)
File renamed without changes.
File renamed without changes.
1 change: 1 addition & 0 deletions 3.build_next_gen_gfx_lib/posts/cross_platform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 跨平台实现的细节
Empty file.
Empty file.
File renamed without changes
File renamed without changes.
63 changes: 63 additions & 0 deletions 5.checkboard_rendering/decima.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Decima引擎在PS4 Pro上的Checkboard Rendering

* 每帧渲染50%的像素
* 每一帧有选择性地采样坐标
* 以下的部分需要以原有的分辨率渲染:
* 深度缓冲
* 三角形IndexBuffer
* AlphaTested Coverage

## 棋盘旋转

We can transform this rotated buffer into what we call a ‘tangram’. We call it a tangram because it’s sort of like that so-called puzzle game.

We can cut the rotated buffer into parts and shuffle them around like so.
The nice thing about that is that it’s completely lossless, and allows the 2160p checkerboard data to be packed into a compact 2160x2160 texture again. And it also still supports bilinear sampling.
And because of the exact way we placed these parts, we can use the built-in texture-wrap hardware to do the unwrapping for us, without any additional logic or shader instructions during sampling.

The only thing required during sampling is rotating the native-res UV by 45 degrees, and offsetting this by an offset that’s constant per frame.

```c
struct Vertex
{
Vec3 mPos;
Vec2 mUV;
Vertex(const Vec3& pos, const Vec2& uv) : mPos(pos), mUV(uv) { }
};
// UV旋转
void GetVerticesForTangramRendering(int native_width, int native_height, bool is_even_frame, Vertex* out_vertices)
{
ASSERT(native_width == (native_height * 16) / 9);
float half_width = 0.5f * (float)native_width;
float half_height = 0.5f * (float)native_height;

// Prepare three 45-degree rotated quads, placed to cover each checkerboard pixel exactly once.
for (int i = 0; i < 3; ++i)
{
float x = (float)native_height * (i == 2 ? 1.0f : 0.0f) + (is_even_frame ? -0.5f : 0.0f);
float y = (float)native_height * (i == 1 ? -1.0f : 0.0f) + (is_even_frame ? 0.0f : 0.5f);
out_vertices[4 * i + 0] = Vertex(Vec3(x, y, 1.0f), Vec2(0.0f, 0.0f));
out_vertices[4 * i + 1] = Vertex(Vec3(half_width + x, half_width + y, 1.0f), Vec2(1.0f, 0.0f));
out_Vertices[4 * i + 2] = Vertex(Vec3(half_width - half_height + x, half_width + half_height + y, 1.0f), Vec2(1.0f, 1.0f));
out_vertices[4 * i + 3] = Vertex(Vec3(-half_height + x, half_height + y, 1.0f), Vec2(0.0f, 1.0f));
}
}
```
## 七巧板拼装和采样
``` c
// Get the uv for the native-res output pixel, repeating the outer most pixels to prevent blending with different tangram parts/the padding areas.
// The border distance was chosen to allow for a bit of safe neighborhood sampling, but this detail is implementation specific.
int2 native_pos = (int2)(uv * float2(native_width, native_height));
native_pos.x = clamp(native_pos.x, 1.0, native_width – 3.0);
native_pos.y = min(native_pos.y, native_height – 3.0);
float is_odd_frame = ... // 1 for odd frames, 0 for even frames
// Get the tangram uv, pointing exactly to halfway the nearest two corner samples in the tangram.
float2 tangram_uv = float2(-1.0 + is_odd_frame + native_pos.x - native_pos.y, 2.0 + is_odd_frame + native_pos.x + native_pos.y) * (0.5 / native_height);
// Do a simple resolve
float4 tangram_color = tex2Dlod(tangram_texture, bilinear_sampler, tangram_uv, 0.0);
```
Empty file.
102 changes: 102 additions & 0 deletions 7.about_drawcall/draw_call.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Draw Call优化的一些思考

当Profile游戏应用的帧率瓶颈发生在CPU时,你可能要注意GFX API接口的调用占比,他很可能是低帧率的元凶。
优化DrawCall的方向有两个:

* 引擎层的Renderer改造
* 游戏资源的改造

## AZDO (Approaching Zero Driver Overhead)

为什么Driver会产生Overhead?

* 传统的DX11/OGL图形驱动在每个API调用期间都会**检查调用参数、资源是否符合逻辑**,Validation这部分会消耗一部分时间
* 为了提高API调用的容错性,**GPU/CPU内存分配**的时机也存在不确定性
* GFX Object的**绑定操作以及同步**也消耗了CPU时间

为了减少上述的开销,GFX API提供了AZDO的接口供开发者使用。

在传统的DX11和OGL接口下,使用提供的Indirect Drawing接口就能实现AZDO的调用。

* DrawIndexedInstancedIndirect
* glMultiDrawElementsIndirect

> DX11调用
``` c
DrawElementsIndirectCommand* commands = ...;
foreach( object )
{
writeUniformData( object, &uniformData[i] );
writeDrawCommand( object, &commands[i] );
}
updateCommands(drawArgsBuffer, commands, commandCount);
context->DrawIndexedInstancedIndirect(drawArgsBuffer, 0);
```
> OGL调用
``` c
DrawElementsIndirectCommand* commands = ...;
foreach( object )
{
writeUniformData( object, &uniformData[i] );
writeDrawCommand( object, &commands[i] );
}
glMultiDrawElementsIndirect(
GL_TRIANGLES,
GL_UNSIGNED_SHORT,
commands,
commandCount,
0
)
```

* 使用Indirect Draw绘制批次模型,可以减少CPU绘制时间,前提是同批次绘制的模型的渲染状态以及资源绑定类型必须一致。
* 在资源绑定阶段,针对纹理的绑定可考虑使用TextureArrayObject来减少绑定次数,Buffer直接拷贝即可。
* 资源绑定的方式也可以使用驱动厂商提供的BindLess接口最优化开销,但会增加代码复杂度。

> 渲染状态包括Shader、RasterState/DepthStencil/VertexLayout/PrimitiveTopology等。
### Shader改造

使用Indirect Draw方法后,针对资源的绑定代码,可以考虑重建绘制ID与资源ID的索引。

## 传统的Mesh、Texture合并

* UE4引擎中针对场景中的静态物体也可以通过HLOD系统实现模型的合并来减少DrawCall数目
* 在Android的字体/UI渲染库同样使用了ATLAS、BatchRendering完成DrawCall的合并

### 彩虹6号DrawCall优化实践

* 基于材质的DrawCall分发系统(本质上是分批次渲染)
* 统一的Buffer定义(方便资源绑定)
* VertexBuffer
* IndexBuffer
* ConstantBuffer
* StructBuffer表示DrawCall的参数
* Shader的自动生成允许我们快速验证新的模型
* DrawCall收集
* 每一个批次绘制对应一个IndirectDraw的命令

优化结果:

|未合批次的DrawCall数目|合批次的DrawCall数目(VIS+GBuffer+贴花)|合批次的DrawCall数目(阴影)|剔除效率提升|
|:--:|:--:|:--:|:--:|
|10537|412|64|73%|

## 多线程Command Buffer构建提交

如果将Renderer的接口使用迁移至DX12级别的接口(VK/MTL),在驱动的优化下DrawCall的提交效率可以提升十倍,通过GPU命令的并行绑定和提交,最大程度的榨干GPU的机能。

![](images/3d_mark.png)

如上图,在3DMark的测试中,在相同时间下,**Vulkan和DX12的DrawCall数**最多可以达到**DX11的13倍**,驱动带来的优化比较明显。

* 即使是在DX12级别的API下,传统的DrawCall优化方法仍有应用的空间。

# 参考

1. [Approaching Zero Driver Overhead](https://www.slideshare.net/CassEveritt/approaching-zero-driver-overhead)
2. [Rendering Rainbow Six](http://twvideo01.ubm-us.net/o1/vault/gdc2016/Presentations/El_Mansouri_Jalal_Rendering_Rainbow_Six.pdf)
3. [Android HWUI硬件加速模块浅析](https://github.com/TsinStudio/AndroidDev)
Binary file added 7.about_drawcall/images/3d_mark.png
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
# 游戏开发笔记

- [《虚幻争霸》角色技术分析](1.character_tech_in_paragon/paragon_character_tech.md)
- [Unreal Insights 系列](1.ue4_insights/ue4_insights.md)
- [《虚幻争霸》角色技术分析](1.ue4_insights/shading_models/paragon_character_tech.md)
- [LPV动态全局光技术](1.ue4_insights/global_illumination/lpv.md)
- [渲染框架](1.ue4_insights/renderer_architect/renderer.md)
- [Kaleido3D开发日记](3.build_next_gen_gfx_lib/ReadMe.md)
- [使用Clang构建C++反射框架](2.reflect_cpp_with_clang/reflect_cpp_with_clang.md)
- [SIGGRAPH2017游戏渲染技术:海洋渲染](5.ocean_rendering/ocean_rendering.md)
- [SIGRGAPH2017游戏渲染技术:Decima的棋盘渲染](7.checkboard_rendering/decima.md)
- [Oculus VR的重投影优化](6.oculus_vr_reprojection/oculus_reprojection.md)
- SIGGRAPH2017游戏高级渲染技术
- [海洋渲染](4.siggraph2017_game/ocean_rendering.md)
- [Decima的棋盘渲染](5.checkboard_rendering/decima.md)
- [Oculus VR的重投影优化](6.oculus_vr_reprojection/oculus_reprojection.md)
- [DrawCall优化的一些思考](7.about_drawcall/draw_call.md)

0 comments on commit 7bc4155

Please sign in to comment.