cuBLAS: non-contiguous tensor support #1215

SlyEcho · 2023-04-28T14:44:48Z

This will allow cuBLAS to multiply tensors that are not contiguous in the row or column (I don't think llama has that situation) level by using cudaMemcpy2dAsync.

Testing perplexity right now on a 2080 TI.

ggml-cuda.h

SlyEcho · 2023-04-28T15:06:25Z

OK, now I have to merge with the CL code, technically the same kind of trick could work there with clEnqueueWriteBufferRect() but not sure if it works

SlyEcho · 2023-04-28T15:12:37Z

Perplexity run done on the 2080 TI: [655]6.2838,

xxx@yyy:~/src/llama.cpp$ srun --gres gpu:2080TI --cpus-per-task 8 --mem=8G ./build-cuda/bin/perplexity -m models/llama-7b-q4_0.bin -t 8 --no-mmap --memory_f32 -f models/wiki.test.raw 
main: seed = 1682692583
llama.cpp: loading model from models/llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113739.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  512.00 MB

system_info: n_threads = 8 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.85 seconds per pass - ETA 31 minutes
[1]4.3800,[2]4.9557,[3]5.8271,[4]6.4694,[5]6.5438,[6]6.5414,[7]6.7176,[8]6.8071,[9]7.1757,[10]7.4122,[11]7.6567,[12]7.6958,[13]7.6057,[14]7.6820,[15]7.9366,[16]7.5418,[17]7.4189,[18]7.3797,[19]7.0076,[20]6.9947,[21]6.8968,[22]6.7124,[23]6.6743,[24]6.5867,[25]6.5871,[26]6.4149,[27]6.2349,[28]6.1341,[29]6.0498,[30]5.8938,[31]5.8659,[32]5.8839,[33]5.8189,[34]5.8537,[35]5.8795,[36]5.9232,[37]5.9272,[38]5.9443,[39]5.9825,[40]6.0412,[41]6.0483,[42]6.0827,[43]6.0398,[44]6.0944,[45]6.0989,[46]6.0730,[47]6.0968,[48]6.0675,[49]6.0746,[50]6.0352,[51]6.0310,[52]6.0201,[53]6.0642,[54]6.0477,[55]6.0251,[56]6.0595,[57]6.0825,[58]6.1044,[59]6.1183,[60]6.1648,[61]6.1536,[62]6.2167,[63]6.2503,[64]6.2654,[65]6.3119,[66]6.3221,[67]6.3402,[68]6.3542,[69]6.3791,[70]6.4114,[71]6.4328,[72]6.4626,[73]6.5277,[74]6.5331,[75]6.5475,[76]6.5638,[77]6.5771,[78]6.5619,[79]6.5915,[80]6.5839,[81]6.5968,[82]6.6005,[83]6.5468,[84]6.5323,[85]6.5209,[86]6.4997,[87]6.4344,[88]6.4059,[89]6.3854,[90]6.3688,[91]6.3949,[92]6.3910,[93]6.3936,[94]6.3911,[95]6.4198,[96]6.4178,[97]6.4106,[98]6.4036,[99]6.3896,[100]6.3895,[101]6.4155,[102]6.4091,[103]6.4309,[104]6.4376,[105]6.4362,[106]6.4539,[107]6.4526,[108]6.4649,[109]6.4596,[110]6.4551,[111]6.4779,[112]6.4970,[113]6.4984,[114]6.4950,[115]6.5033,[116]6.4959,[117]6.5014,[118]6.5299,[119]6.5508,[120]6.5872,[121]6.6035,[122]6.6283,[123]6.6673,[124]6.6850,[125]6.6763,[126]6.7154,[127]6.7524,[128]6.7799,[129]6.7630,[130]6.7725,[131]6.7673,[132]6.7585,[133]6.7457,[134]6.7569,[135]6.7534,[136]6.7402,[137]6.7322,[138]6.7151,[139]6.7035,[140]6.7005,[141]6.6707,[142]6.6659,[143]6.6379,[144]6.6178,[145]6.6092,[146]6.5957,[147]6.6031,[148]6.6054,[149]6.5994,[150]6.5953,[151]6.5965,[152]6.5870,[153]6.5703,[154]6.5613,[155]6.5680,[156]6.5630,[157]6.5813,[158]6.5849,[159]6.5890,[160]6.5916,[161]6.6041,[162]6.5739,[163]6.5619,[164]6.5357,[165]6.5039,[166]6.4751,[167]6.4377,[168]6.4051,[169]6.3916,[170]6.3791,[171]6.3502,[172]6.3322,[173]6.3136,[174]6.2829,[175]6.2607,[176]6.2505,[177]6.2295,[178]6.2059,[179]6.1887,[180]6.1798,[181]6.1574,[182]6.1382,[183]6.1240,[184]6.1238,[185]6.1165,[186]6.1182,[187]6.1237,[188]6.1200,[189]6.1384,[190]6.1393,[191]6.1597,[192]6.1761,[193]6.1938,[194]6.2054,[195]6.2264,[196]6.2434,[197]6.2655,[198]6.2811,[199]6.2840,[200]6.2886,[201]6.2844,[202]6.3049,[203]6.3115,[204]6.3114,[205]6.3224,[206]6.3302,[207]6.3262,[208]6.3347,[209]6.3398,[210]6.3449,[211]6.3547,[212]6.3621,[213]6.3727,[214]6.3763,[215]6.3803,[216]6.3951,[217]6.4129,[218]6.4264,[219]6.4267,[220]6.4231,[221]6.4168,[222]6.4133,[223]6.4024,[224]6.3958,[225]6.3910,[226]6.4126,[227]6.4212,[228]6.4271,[229]6.4338,[230]6.4294,[231]6.4463,[232]6.4332,[233]6.4160,[234]6.4004,[235]6.3846,[236]6.3768,[237]6.3664,[238]6.3698,[239]6.3536,[240]6.3433,[241]6.3466,[242]6.3504,[243]6.3488,[244]6.3368,[245]6.3342,[246]6.3221,[247]6.3098,[248]6.3030,[249]6.3010,[250]6.3057,[251]6.2981,[252]6.2947,[253]6.2844,[254]6.2804,[255]6.2688,[256]6.2497,[257]6.2386,[258]6.2299,[259]6.2279,[260]6.2197,[261]6.2154,[262]6.2095,[263]6.2050,[264]6.1858,[265]6.1850,[266]6.1835,[267]6.1766,[268]6.1863,[269]6.1843,[270]6.1850,[271]6.1928,[272]6.1974,[273]6.1969,[274]6.1983,[275]6.2073,[276]6.2128,[277]6.2288,[278]6.2397,[279]6.2483,[280]6.2518,[281]6.2617,[282]6.2678,[283]6.2825,[284]6.2902,[285]6.2997,[286]6.3144,[287]6.3138,[288]6.3198,[289]6.3107,[290]6.2956,[291]6.2802,[292]6.2644,[293]6.2505,[294]6.2530,[295]6.2524,[296]6.2567,[297]6.2553,[298]6.2579,[299]6.2551,[300]6.2439,[301]6.2440,[302]6.2359,[303]6.2282,[304]6.2204,[305]6.2180,[306]6.2047,[307]6.2072,[308]6.2104,[309]6.1941,[310]6.1880,[311]6.1816,[312]6.1838,[313]6.1782,[314]6.1769,[315]6.1604,[316]6.1561,[317]6.1395,[318]6.1179,[319]6.1298,[320]6.1428,[321]6.1466,[322]6.1422,[323]6.1355,[324]6.1331,[325]6.1431,[326]6.1430,[327]6.1451,[328]6.1494,[329]6.1554,[330]6.1579,[331]6.1703,[332]6.1671,[333]6.1741,[334]6.1682,[335]6.1617,[336]6.1655,[337]6.1625,[338]6.1612,[339]6.1555,[340]6.1511,[341]6.1589,[342]6.1613,[343]6.1669,[344]6.1668,[345]6.1667,[346]6.1638,[347]6.1686,[348]6.1727,[349]6.1746,[350]6.1712,[351]6.1717,[352]6.1717,[353]6.1665,[354]6.1664,[355]6.1718,[356]6.1749,[357]6.1712,[358]6.1802,[359]6.1833,[360]6.1795,[361]6.1791,[362]6.1858,[363]6.1970,[364]6.2035,[365]6.2093,[366]6.2100,[367]6.2188,[368]6.2165,[369]6.2175,[370]6.2185,[371]6.2125,[372]6.2178,[373]6.2234,[374]6.2221,[375]6.2217,[376]6.2301,[377]6.2252,[378]6.2278,[379]6.2338,[380]6.2254,[381]6.2211,[382]6.2154,[383]6.2144,[384]6.2137,[385]6.2124,[386]6.2119,[387]6.2111,[388]6.2066,[389]6.2012,[390]6.1943,[391]6.1862,[392]6.1822,[393]6.1803,[394]6.1828,[395]6.1812,[396]6.1738,[397]6.1814,[398]6.1852,[399]6.1935,[400]6.1931,[401]6.1945,[402]6.1950,[403]6.1969,[404]6.2032,[405]6.1937,[406]6.1903,[407]6.1895,[408]6.1905,[409]6.2029,[410]6.2139,[411]6.2264,[412]6.2427,[413]6.2542,[414]6.2618,[415]6.2670,[416]6.2750,[417]6.2881,[418]6.2916,[419]6.2990,[420]6.3077,[421]6.3197,[422]6.3255,[423]6.3326,[424]6.3446,[425]6.3537,[426]6.3602,[427]6.3647,[428]6.3730,[429]6.3775,[430]6.3865,[431]6.4011,[432]6.4054,[433]6.4041,[434]6.3995,[435]6.4002,[436]6.4027,[437]6.4121,[438]6.4200,[439]6.4164,[440]6.4158,[441]6.4108,[442]6.4099,[443]6.4112,[444]6.4115,[445]6.4095,[446]6.4118,[447]6.4147,[448]6.4191,[449]6.4164,[450]6.4167,[451]6.4124,[452]6.4006,[453]6.3922,[454]6.3862,[455]6.3869,[456]6.3917,[457]6.3934,[458]6.3912,[459]6.3922,[460]6.4009,[461]6.3981,[462]6.3965,[463]6.4016,[464]6.4007,[465]6.3976,[466]6.3895,[467]6.3898,[468]6.3897,[469]6.3919,[470]6.3924,[471]6.3876,[472]6.3923,[473]6.3866,[474]6.3880,[475]6.3821,[476]6.3844,[477]6.3773,[478]6.3764,[479]6.3827,[480]6.3879,[481]6.3899,[482]6.3854,[483]6.3813,[484]6.3835,[485]6.3818,[486]6.3763,[487]6.3763,[488]6.3744,[489]6.3694,[490]6.3668,[491]6.3637,[492]6.3579,[493]6.3549,[494]6.3531,[495]6.3528,[496]6.3493,[497]6.3440,[498]6.3422,[499]6.3372,[500]6.3275,[501]6.3206,[502]6.3204,[503]6.3202,[504]6.3109,[505]6.3134,[506]6.3143,[507]6.3081,[508]6.3038,[509]6.3027,[510]6.3067,[511]6.3113,[512]6.3148,[513]6.3166,[514]6.3233,[515]6.3177,[516]6.3169,[517]6.3180,[518]6.3181,[519]6.3211,[520]6.3238,[521]6.3255,[522]6.3284,[523]6.3294,[524]6.3357,[525]6.3394,[526]6.3406,[527]6.3426,[528]6.3372,[529]6.3376,[530]6.3329,[531]6.3319,[532]6.3368,[533]6.3391,[534]6.3372,[535]6.3395,[536]6.3341,[537]6.3318,[538]6.3366,[539]6.3378,[540]6.3417,[541]6.3426,[542]6.3433,[543]6.3447,[544]6.3459,[545]6.3437,[546]6.3444,[547]6.3398,[548]6.3343,[549]6.3345,[550]6.3318,[551]6.3280,[552]6.3260,[553]6.3217,[554]6.3195,[555]6.3166,[556]6.3163,[557]6.3186,[558]6.3146,[559]6.3142,[560]6.3137,[561]6.3139,[562]6.3120,[563]6.3120,[564]6.3163,[565]6.3180,[566]6.3177,[567]6.3155,[568]6.3160,[569]6.3144,[570]6.3170,[571]6.3176,[572]6.3186,[573]6.3188,[574]6.3151,[575]6.3147,[576]6.3145,[577]6.3135,[578]6.3114,[579]6.3122,[580]6.3056,[581]6.3018,[582]6.3008,[583]6.3016,[584]6.3020,[585]6.2943,[586]6.2875,[587]6.2878,[588]6.2927,[589]6.2985,[590]6.3015,[591]6.3037,[592]6.3022,[593]6.2985,[594]6.2996,[595]6.2973,[596]6.3010,[597]6.2987,[598]6.2949,[599]6.2971,[600]6.2969,[601]6.2954,[602]6.2971,[603]6.3001,[604]6.3011,[605]6.3044,[606]6.3065,[607
llama_print_timings:        load time =  5105.33 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 1808409.31 ms / 335360 tokens (    5.39 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 1843211.88 ms
]6.3048,[608]6.3013,[609]6.3019,[610]6.3056,[611]6.3037,[612]6.3062,[613]6.3026,[614]6.2975,[615]6.2898,[616]6.2928,[617]6.2865,[618]6.2814,[619]6.2757,[620]6.2615,[621]6.2542,[622]6.2525,[623]6.2540,[624]6.2545,[625]6.2544,[626]6.2529,[627]6.2550,[628]6.2555,[629]6.2552,[630]6.2586,[631]6.2650,[632]6.2704,[633]6.2687,[634]6.2721,[635]6.2726,[636]6.2694,[637]6.2659,[638]6.2686,[639]6.2657,[640]6.2666,[641]6.2669,[642]6.2738,[643]6.2759,[644]6.2772,[645]6.2750,[646]6.2793,[647]6.2755,[648]6.2761,[649]6.2763,[650]6.2801,[651]6.2858,[652]6.2865,[653]6.2908,[654]6.2844,[655]6.2838,

(the output is weird because of the Slurm thing on the machine I have access to)

Without this change: 3.16 seconds per pass, 6.01 ms per token
With: 2.85 seconds per pass, 5.39 ms per token

SlyEcho · 2023-04-28T15:37:48Z

These ifdefs are kicking my ass

SlyEcho · 2023-04-28T15:52:01Z

should now build OpenBLAS, CLBlast, cuBLAS, nothingBLAS

SlyEcho added 2 commits April 28, 2023 17:38

Cuda: non-contiguous tensor support

d9bc43c

remove extra stuff

87864d9

slaren reviewed Apr 28, 2023

View reviewed changes

ggml-cuda.h Outdated Show resolved Hide resolved

SlyEcho added 4 commits April 28, 2023 18:16

rename

9a7ca5d

Merge 'origin/master'

4634aad

fix error

5495990

more fixes, now OpenBLAS and CLBlast build too

7595105

now then?

f19ee3b

slaren approved these changes Apr 28, 2023

View reviewed changes

slaren merged commit b1ee8f5 into ggerganov:master Apr 28, 2023

SlyEcho deleted the non-contig branch May 2, 2023 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBLAS: non-contiguous tensor support #1215

cuBLAS: non-contiguous tensor support #1215

SlyEcho commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

cuBLAS: non-contiguous tensor support #1215

cuBLAS: non-contiguous tensor support #1215

Conversation

SlyEcho commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

SlyEcho commented Apr 28, 2023

SlyEcho commented Apr 28, 2023