-
Notifications
You must be signed in to change notification settings - Fork 5
/
cmpr.c
8991 lines (6812 loc) · 347 KB
/
cmpr.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
/* #source_intro
Source code introduction
----
The cmpr source is organized into blocks.
Each block starts with a block comment, which is often followed by some code.
The blocks can be read in sequence from start to end inside this file.
*/
/* #example_block
Add two integers.
int add(int a, int b)
Algorithm:
- Return the sum of the arguments.
*/
int add(int a, int b)
{
return a + b;
}
/* #example_refs @example_block @config_fields
Here's a block that contains block references as an example of how these features work.
The id of this block itself is #test_refs, and the #test_block and #config_fields blocks are also pulled in as context.
These context blocks typically are used to provide context that is relevant to the current block.
The projfiles reference below is an inline reference and will be expanded inline.
It must be on a line by itself to count, so you can write @projfiles in a sentence, like this one, and it will not be replaced.
You can also use #hashtag syntax like we use above to refer to a block in NL text, and this will not be replaced either.
However, the #hashtag block id part of the block's top line will be included when expanding context blocks.
This means if you do use #hashtag to refer to some previous block, and you also include the same @hashtag block as context, then the expansion will include the same token sequence starting with the hash in both the place where the block is provided as context and later in the place where it is mentioned.
Both humans and LLMs can use this to connect the two locations quickly.
@projfiles
Here's our add function, as an example of a transformation applied to a block---in this case, getting the code part instead of the comment part:
@example_block:code
Note that the code reference is on a line by itself, but here we didn't include blank lines around it.
The blank lines are optional, and when we perform the replacement, we will maintain whitespace that you added.
You can also have ":code" or ":all" suffixes on context block references, for example we could have written "@test_block:code" above to bring in the code part of the text block (only) into the context or "@test_block:all" for the full block.
The ex command :expand shows the expansion of the block.
*/
/* import libraries
*/
#include "spanio.c"
/* #langtable
Here we have a table of languages that we support.
(The columns are written as numbered items, for easier editing.)
1. Supported Language:
C, Python, JavaScript, Markdown
2. Filename extension:
C: .c
Python: .py
JavaScript: .js
Markdown: .md
3. Find blocks implementation:
C: find_blocks_language_c
Python: find_blocks_language_python
JavaScript: find_blocks_language_c
Markdown: find_blocks_language_markdown
4. Find blocks description string:
C: Blocks start with a C-style block comment at the beginning of a line (no leading whitespace).
Python: Blocks start with a triple-quoted string, also at the beginning of a line.
JavaScript: Uses the same rules as C (block comment flush left starts a block).
Markdown: Blocks start with a heading of any level.
5. Block comment end description:
C: Comment part ends with a C-style block comment that can end anywhere on a line.
Python: The triple-quote end also has to be at the start of a line.
JavaScript: Same as C.
Markdown: There is no comment part, markdown blocks are often all prose.
6. Markdown code block tag (langtag):
C: c
Python: py
JavaScript: js
Markdown: md
*/
/* #config_fields #conftable ->gpt3.5
Here we define the CONFIG_FIELDS macro, a list of X macro calls, e.g. X(cmprdir), for each config setting.
The config settings are:
- cmprdir, the directory for project state, usually .cmpr
- buildcmd, the command to do a build (e.g. by the "B" key)
- bootstrap, a command that creates an initial prompt (see README)
- cbcopy, the command to pipe data to the clipboard on the user's platform
- cbpaste, the same but for getting data from the clipboard
- curlbin, the path to curl (or just "curl" if unspecified)
- ollamas, a comma-separated list of ollama models to use
- model, the LLM currently in use, or "clipboard" for browser chat models
- debug, a set of single-character flags that turn on debugging features when present
*/
#define CONFIG_FIELDS \
X(cmprdir) \
X(buildcmd) \
X(bootstrap) \
X(cbcopy) \
X(cbpaste) \
X(curlbin) \
X(ollamas) \
X(model) \
X(debug)
/* #pragmas
A new feature called pragmas lets us give some blocks a special function.
A pragma must appear in the top line of a block, just like a block id, but uses "!" instead of "#".
We give our pragmas long and descriptive names so they can be interpreted easily when reading source code that uses them.
We can have a :pragmas ex command to quickly select from a menu of defined pragmas and add them to the current block, so it won't be necessary to remember them in order to use them.
Here are the defined pragmas:
1. Pragma name.
use_as_global_context
2. Description.
use_as_global_context: Causes this block to be a context reference of every other block, without having to be explicitly referenced as such.
@- context_of:block_id: injection
*/
/* #checksums
We have a checksum type and a checksums generic array type.
To get the checksum for a block, we use selected_checksum with the block contents.
Checksums are used as an opaque id for the contents of a span, such as a file or a block.
*/
/* #checksum_setup @checksums
Similar to projfiles, we have a generic arena-allocated array type for checksums, which we make by MAKE_ARENA with E = checksum, T = checksums, and 256 for the stack size.
Prior to this we typedef checksum as a struct containing only a u64.
Usually this will not be accessed, so we call it __u.
*/
typedef struct {
u64 __u;
} checksum;
MAKE_ARENA(checksum, checksums, 256)
/* #projfiles
A project usually contains multiple files.
We have a projfile type which contains for each file:
- the path as a span
- the language, also a span
- the contents of the file, also a span
- a checksum of the contents, called cksum
Here we have a typedef for the projfile.
We also call our generic macro MAKE_ARENA with E = projfile, T = projfiles, choosing 256 for the stack size.
*/
typedef struct {
span path;
span language;
span contents;
checksum cksum;
} projfile;
MAKE_ARENA(projfile, projfiles, 256)
/* #rope
To support a particular memory allocation and access pattern we have the following interface:
@- Note: currently a very simple version of a "rope", since the use case is read-only data; it's really just a linked list of blocks of memory.
rope rope_new(size_t);
int rope_isnull(rope); // 1 if initialized, 0 otherwise
void rope_release(rope*);
span rope_alloc_atleast(rope*,size_t);
This was added to support our undo feature, which reads in and indexes all the revs.
The rope is opaque from the perspective of the caller, but it will have a u8 pointer to the allocated memory and two sizes, "used" and "cap", and a pointer to the next segment of the rope.
The memory is only ever extended in our use case until it is freed, so we have a way to ensure that there is at least some quantity of memory available in a contiguous block.
We only ever write into the last segment, so this method simply scans the segments until it reaches the last one (which has the next pointer set to NULL) and then checks if it has enough unused space for the request.
If not, it allocates a new segment and extends the rope.
Regardless, the call then returns a span that points to the unitialized memory.
(This is different from our usual use of spans as strings.)
Each segment will be a minimum of 32MiB, so when allocating we either allocate a block of this size, or if a larger size was requested, we allocate the requested size.
When the rope is no longer needed, we walk the list and free all the allocated segments in turn.
*/
#define SEGMENT_SIZE (32 * 1024 * 1024)
typedef struct rope_segment {
u8 *memory;
size_t used;
size_t cap;
struct rope_segment *next;
} rope_segment;
typedef struct rope {
rope_segment *head;
} rope;
rope rope_new(size_t initial_size) {
rope r;
r.head = (rope_segment *)malloc(sizeof(rope_segment));
r.head->cap = initial_size > SEGMENT_SIZE ? initial_size : SEGMENT_SIZE;
r.head->used = 0;
r.head->memory = (u8 *)malloc(r.head->cap);
r.head->next = NULL;
return r;
}
int rope_isnull(rope r) {
return r.head ? 0 : 1;
}
void rope_release(rope *r) {
rope_segment *current = r->head;
while (current) {
rope_segment *next = current->next;
free(current->memory);
free(current);
current = next;
}
r->head = NULL;
}
span rope_alloc_atleast(rope *r, size_t size) {
rope_segment *current = r->head;
while (current->next) {
current = current->next;
}
if (current->cap - current->used < size) {
size_t new_cap = size > SEGMENT_SIZE ? size : SEGMENT_SIZE;
rope_segment *new_segment = (rope_segment *)malloc(sizeof(rope_segment));
new_segment->memory = (u8 *)malloc(new_cap);
new_segment->used = 0;
new_segment->cap = new_cap;
new_segment->next = NULL;
current->next = new_segment;
current = new_segment;
}
span result;
result.buf = current->memory + current->used;
current->used += size;
result.end = current->memory + current->used;
return result;
}
/* #rev_info
Revs are older revisions of files in the project.
They are stored in <cmprdir>/revs, and used to support the undo feature.
The rev_info structure holds metadata on our revs, and has the following elements:
- filenames, a spans holding paths (under revs/) in lexicographic order (also chronological order)
- fnbuf, a char * holding allocated memory for the filenames
- revblocks, a chronologically partially-ordered list of historical blocks with metadata
- n_revblocks, the size of that array
- cap_revblocks, the allocated capacity of revblocks, also in rev_block-sized units (i.e. not in bytes).
- revrope, a rope holding the rev contents
For the revblocks we also need a type, so we'll have a rev_block struct as well.
Then revblocks can be a pointer to a rev_block, and we'll manually manage the memory for it, doubling the cap when necessary.
On the rev_block struct we need:
- a span for the actual block contents
- a checksums, sorted_line_cksums, for the sorted line checksums which we use for similarity determination
- a spans for the zero or more IDs this block may have
- a timestamp, which we will store in seconds since the epoch as a time_t
Because of the inclusion, we declare rev_block first, then rev_info.
*/
typedef struct {
span contents;
checksums sorted_line_cksums;
spans ids;
time_t timestamp;
} rev_block;
typedef struct {
spans filenames;
char *fnbuf;
rev_block *revblocks;
size_t n_revblocks;
size_t cap_revblocks;
rope revrope;
} rev_info;
/* #ui_state @config_fields:all
We define a struct, ui_state, which we can use to store any information about the UI and the project data model in a single place.
This includes, so far:
- files, a projfiles array holding the files in the project
- current language, a span, used by the file/language config handler functions (and so probably shouldn't be here)
- the blocks, a spans
- the curr_block_idx (into .blocks), the number of blocks prior to the current one, or -1 if there is no current block (e.g. empty file state)
- the curr_file_idx (into .files), the number of files prior to the current one, or -1 if there is no current file (i.e. empty project state)
- a marked_index, which represent the "other end" (from curr_block_idx) of a selected range
@- block_cksums, a checksums holding checksums for each block -- currently unused
- the lines, a spans
@- line_cksums, a checksums for the lines
- revs, a rev_info structure which stores metadata about our revision history
- block_idx, a spans which is an index of block ids
- the search span which will contain "/" followed by some search if in search mode, otherwise will be empty()
- the previous_search span, used for n/N
- the ex_command which similarly contains ":" if in ex command entry mode, otherwise empty()
- config_file_path, a span
- terminal_rows and _cols which stores the terminal dimensions
- scrolled_lines, the number of physical lines that have been scrolled off the screen upwards, supporting pagination of blocks
- openai_key, an OpenAI API key, or an empty span
- anthropic_key, an Anthropic API key, or an empty span
- bootstrapprompt, either empty or contains the bootstrap prompt (set by :bootstrap)
- ollama_models, a spans of the configured ollama model names if any
- now, a struct timespec, used by main_loop to give a consistent timestamp per loop iteration
- outputs_filenames, a spans, temporary place to hold outputs filenames until the feature is further along
Additionally, we include a span for each of the config fields, with an X macro inside the struct, using CONFIG_FIELDS defined above.
Below the ui_state struct/typedef, we declare a global ui_state* state, which will be initialized below by main().
*/
typedef struct ui_state {
projfiles files;
span current_language;
spans blocks;
int curr_block_idx;
int curr_file_idx;
int marked_index;
spans lines;
rev_info revs;
spans block_idx;
span search;
span previous_search;
span ex_command;
span config_file_path;
int terminal_rows;
int terminal_cols;
int scrolled_lines;
span openai_key;
span anthropic_key;
span bootstrapprompt;
spans ollama_models;
struct timespec now;
spans outputs_filenames;
#define X(name) span name;
CONFIG_FIELDS
#undef X
} ui_state;
ui_state* state;
/* #network_ret network return type, used by LLM API functions
Contains a response, generally json, if success; an error, a human readable string, otherwise.
@- This is really an Either type, we might generalize the name of it if we want the same interface anywhere else.
*/
typedef struct {
int success;
span response;
span error;
} network_ret;
/* #sbv_state @SBV_design
The state struct for the SBV feature.
@- SBV is the "select block version" feature available by the "U" keybinding.
@- The rest of the code and documentation comes later, but we need the struct here.
*/
typedef struct {
int *revblock_indices;
int max_index;
int current_index;
spans curr_block_ids;
checksums sorted_line_cksums;
} sbv_state;
/* #partials
Needed by the LLM callback machinery.
See #make_output_saver, below, for more.
Here we set up a tagged union type for partial applications.
We have an enum with the tagged union types, PARTIAL_SP_SP and PARTIAL_SP, called PartialType.
The tagged union type, Partial, holds one of:
PARTIAL_SP_SP
This holds a function pointer and a span.
The name indicates that when it is partially applied, a span is stored, and then when it is called, another span is provided.
The underlying function takes both spans and returns void.
So the struct has two members, f and a, of types void(*)(span,span) and span.
PARTIAL_0_SP
This struct is similar, but has no partially-applied arguments, as indicated by the 0 in the first position.
The underlying function just takes one span, and the tagged union only holds the function pointer.
In addition to the PartialType and Partial typedefs we also write some functions:
Partial partial_sp_sp(span, void(*)(span,span));
Partial partial_0_sp(void(*)(span));
void apply_partial(Partial,span);
Finally, we typedef llm_message_handler as an alias for Partial.
*/
typedef enum {
PARTIAL_SP_SP,
PARTIAL_SP
} PartialType;
typedef struct {
PartialType type;
union {
struct {
void (*f)(span, span);
span a;
} sp_sp;
void (*f)(span);
} value;
} Partial;
Partial partial_sp_sp(span a, void(*f)(span, span)) {
Partial p;
p.type = PARTIAL_SP_SP;
p.value.sp_sp.f = f;
p.value.sp_sp.a = a;
return p;
}
Partial partial_0_sp(void(*f)(span)) {
Partial p;
p.type = PARTIAL_SP;
p.value.f = f;
return p;
}
void apply_partial(Partial p, span arg) {
switch (p.type) {
case PARTIAL_SP_SP:
p.value.sp_sp.f(p.value.sp_sp.a, arg);
break;
case PARTIAL_SP:
p.value.f(arg);
break;
default:
// handle error
break;
}
}
typedef Partial llm_message_handler;
/* #all_functions
Our functions are declared, usually in comments, and we have a Python script that extracts those decls into a header file.
This is convenient, because it means we can write a block as a self-contained unit, but call that function from anywhere, without having to manually maintain a header file.
Below these we have a list that we used to manually maintain; these should gradually be moved into their actual blocks.
(Note that even though the below is commented out, because of the mentioned Python script, it does affect our build!)
*/
#include "fdecls.h"
/* #op_includes @optable
We will generate these includes from optable later.
Currently, this wouldn't work, since not all our ops actually have corresponding ops/ files.
*/
#include "ops/nl2pl_rewrite.c"
#include "ops/pl2nl_rewrite.c"
#include "ops/nl2algo.c"
#include "ops/summarize_block.c"
#include "ops/agreement_to_pl_diff.c"
/*
// search
void start_search();
void perform_search();
void finalize_search();
void search_forward();
void search_backward();
int find_block(span); // find first block containing text
int block_by_id(span); // find a block by id (without hash char)
// ex commands
void start_ex();
void handle_ex_command();
void bootstrap();
void addfile(span);
void addlib(span);
void ex_help();
void set_highlight();
void reset_highlight();
void select_model();
int select_menu(spans opts, int sel); // allows selecting from a short list of options
void print_menu(spans, int);
// pagination and printing
void page_down();
void page_up();
void print_current_blocks();
void render_block_range(int,int);
void print_physical_lines(span, int);
int print_matching_physical_lines(span, span);
span count_physical_lines(span, int*);
void print_multiple_partial_blocks(int,int);
void print_single_block_with_skipping(int,int);
// supporting functions, CLI flags
void cmpr_init(); // handles --init
void print_block(int);
void print_comment(int);
void print_code(int);
int count_blocks();
void clear_display();
*/
/* #ingest_functions
When we start, get_code handles everything in the current project files, and get_revs handles all the historical revisions in revs/.
The ingest() function is used whenever code is changed and it re-does everything.
Finally, get_revs handles all the revs, which can be a significant amount of data, so this only happens when it is needed (currently on the "U" feature only).
*/
void get_code(); // read and index current code
void get_revs(); // read and index revs
spans find_blocks(span); // find the blocks in a file
spans find_blocks_language(span file, span language); // find_blocks helper function dispatching on language
void find_all_lines(); // like find_all_blocks, but for lines; applies to the whole project
void index_block_ids();
void ingest(); // updates everything that needs to be updated after code has changed
/* #set_default_clipboard_commands
*/
char* detect_os() {
#ifdef _WIN32
return "Windows";
#elif __APPLE__
return "MacOS";
#elif __linux__
return "Linux";
#else
return "Unknown";
#endif
}
int is_wsl() {
char buffer[256];
FILE* fp = fopen("/proc/version", "r");
if (fp != NULL) {
if (fgets(buffer, sizeof(buffer), fp)) {
fclose(fp);
return strstr(buffer, "Microsoft") != NULL || strstr(buffer, "WSL") != NULL;
}
fclose(fp);
}
return 0;
}
void set_default_clipboard_commands() {
char* os = detect_os();
if (is_wsl()) {
if (empty(state->cbcopy)) state->cbcopy = S("clip.exe");
if (empty(state->cbpaste)) state->cbpaste = S("powershell.exe Get-Clipboard");
} else if (strcmp(os, "MacOS") == 0) {
if (empty(state->cbcopy)) state->cbcopy = S("pbcopy");
if (empty(state->cbpaste)) state->cbpaste = S("pbpaste");
} else if (strcmp(os, "Linux") == 0) {
if (empty(state->cbcopy)) state->cbcopy = S("xclip -i -selection clipboard");
if (empty(state->cbpaste)) state->cbpaste = S("xclip -o -selection clipboard");
} else if (strcmp(os, "Windows") == 0) {
if (empty(state->cbcopy)) state->cbcopy = S("clip.exe");
if (empty(state->cbpaste)) state->cbpaste = S("powershell.exe Get-Clipboard");
}
}
/*
In the following block we write the main function, but here, we surround it by cpp directives so that if -D PROMPT_LIST is used, this main() function won't be included.
(Later we might have more alternate main functions if we continue generating different binaries from the same file, and we'll do something fancier here.)
Here we just write the directive that comes before the main function.
*/
#ifndef PROMPT_LIST
/* #main
In main,
We set up a stack_state variable, a zeroed ui_state, and point the state global to its address.
Then we initialize, then we read, then we go into the main loop.
To initialize, we will set up our i/o library, memory areas, and globals.
These decisions are recorded in the init() function which we define later.
There we group things that happen the same way every time the program is run.
In read_(), which we also define later, we read in project data and input that is different from run to run.
This starts with the arguments and configuration files, so we pass our argc and argv to it.
Then we read in the project's code.
@- The underscore in the name avoids collision with the C library function read()
Finally, we go into main_loop().
This function does not return, so that's also the end of main.
However, since main returns int, we indicate to the compiler that the line after main_loop() is unreachable.
This prevents us having to add a return 0 line which will never be used.
Below we write just the main function itself.
int main(int, char**);
*/
int main(int argc, char** argv) {
ui_state stack_state = (ui_state){0};
state = &stack_state;
init();
read_(argc, argv);
main_loop();
return 0;
}
/*
...and here we simply close the ifndef.
*/
#endif
/* #init @generic_array_initialization
void init();
We call init_spans_ioc which takes three size_t arguments, i, o, and c, the sizes of the input, output, and cmp spaces.
We use input space for the current contents of the files in the project.
We use output space for buffering output that is intended for stdout or stderr, including all our terminal output.
We use cmp space for all other miscellaneous uses, including other small files that we read and write to, such as conf files.
However, the rev files which we also read in have their own "rope" data structure.
We might use this for input files as well in the future, as we only need each file to be in contiguous memory.
Current sizes here:
inp space: 2^30
out space: 2^30
cmp space: 2^30
@- these are just the previous defaults that we brought over into direct arguments here; we might give some thought to the actual numbers in the future.
implementation limits
----
projfiles arena: 2^14
spans arena: 2^20
checksums arena: 2^30
These are all generic array types with separate arena allocation for each one.
We may replace this with some other approach in the future.
T and E types for the generic arrays:
projfiles projfile
spans span
checksums checksum
So here we call each T_arena_alloc(size_t) function (each of these has already been created by our generic array macro), with T replaced by the T type (which is the name of the array type).
@- projfiles doesn't need any of the machinery, since there's only ever one projfiles array; could be specialized away like we did for revs
Our spans arena size is a binary million.
We will hit this limit soon with prompt template expansion and other features, so we need to add instrumentation and give back spans memory.
Our checksums size a "binary billion," is an overestimate, but we will are still adding checksum related features.
We set config_file_path on state to ".cmpr/conf", which is the default (but may be changed later by handle_args).
We set state->files, allocating space for 1024 files.
This means we only handle 1024 files in a project, which we might need to revisit if there are some larger projects out there (e.g. the linux kernel).
We then call:
- set_default_clipboard_commands
- read_openai_key
- read_anthropic_key
*/
void init() {
init_spans_ioc(1UL<<30, 1UL<<30, 1UL<<30);
projfiles_arena_alloc(1UL<<14);
spans_arena_alloc(1UL<<20);
checksums_arena_alloc(1UL<<30);
state->config_file_path = S(".cmpr/conf");
state->files = projfiles_alloc(1024);
state->files.n = 0;
set_default_clipboard_commands();
read_openai_key();
read_anthropic_key();
}
/* #read_
This is one of the setup functions called from main().
void read_(int argc, char** argv);
In init() we set up things that are the same on every run, but here, we read information from the environment that could change every time.
This starts with the current working directory.
We call find_cmprdir which checks ".cmpr" to see if we are in the top level directory of a cmpr project.
If not, it will chdir up the filesystem hierarchy to see if a parent directory is a cmpr project, and will leave us in that directory if so.
We projfiles_alloc the files array on the state.
We just set the capacity to the full capacity of the projfiles arena, which we can find in the init() function, above, since there won't be any other projfiles arrays allocated.
We call a function handle_args to handle argc and argv.
This function will not do anything directly, but sets indicators on state (possibly spans in cmp space) that define what we should do next.
@- This function will also read our config file (if any).
We call check_conf_vars() once after this; we will also call it in the main loop but we need it before trying to get the code.
We call find_cmprdir, which looks for the .cmpr directory that makes a directory into a cmpr project, and sets it on state.
We set config_file_path on the state to the default configuration file path which is ".cmpr/conf", i.e. always relative to the CWD.
This is important because in handle_args we will already want to either print the args in the default file, or otherwise in a non-standard file location, so this is really where we set the default configuration.
We call check_dirs() which creates any missing directories.
Next we call a function get_code().
This function reads the files indicated by our config file, populates inp, and handles any code indexing steps.
*/
void read_(int argc, char** argv) {
handle_args(argc, argv);
check_conf_vars();
check_dirs();
get_code();
}
/* delete feature
20240707
Deleting a block is a feature we already have, by using 'e' and simply deleting the block's contents.
So the 'd' feature just needs to do this exact same operation.
However, what we really want is undelete, especially since we now have 'U'ndo.
It would be strange if we had undo for changes to a block, but not for block deletions.
So we can have 'D' for a menu of deleted blocks.
This will be any block that was completely removed, unless it was so similar to a current block that it is an undo match for it.
In other words, if you delete a function in another editor, we will create a new rev without that function.
If you open the delete list, we will then show you that function as of the time that it was deleted.
However, if a block was a duplicate of another block, then we would not consider it as deleted.
Also, if it has the same block id as an existing block, we would consider that the block still exists and has not been deleted.
However, we could have lists of changed (or deleted) lines, just like we currently do for blocks.
We could have, in fact, a continuous unified-diff-like display of the updates to the project, as timestamps and unified diffs for each rev.
In other words, as we go back through the revs, we could use diff itself to tell us which current file the rev is closest to.
We can then represent the revs as this compressed representation.
It is strange that we don't have any way to actually know the filename that the rev corresponds to.
Probably we should have some kind of manifest representation, which can be a Merkle tree, something like a hash of (path, hash) pairs, one per file in the project, similar to how git represents trees.
Or, we should just start keeping the conf file in revs every time it is modified.
(Which we actually currently do if it is edited with 'e', but most people won't have added it.)
If a file is removed from the project, we will detect all the blocks in that file as not currently in the project (unless they were moved into other files).
However, we will not know what file it was.
There is another clever way to get back a deleted block.
If you know the block id, you can simply create a new block, give it that id, and then use 'U' to select the previous version of that block.
As Enter resets the content of the block with 'U', with 'D' it might set a register, like what 'dd' or 'yy' does in vi.
Then 'p' or 'P' can be used to put the block in place in the current project.
As a side note, we imagine a "C natural" filetype, intended for existing C code that doesn't use our style of blocks defined by block comments.
This should blockize in a deterministic way similar to what a human would do, and should make use of ctags.
Alternatively, we could have a single filetype that would automatically detect the language as it goes and blockize accordingly.
In existing codebases, we should get blocks for free and also ids for most of those blocks for free from ctags.
This means we should allow using tags such as "main()" to identify the block that defines the main function in the current namespace as per ctags.
That means we can do block references as "@main()" or "@main():all", without needing to do any extra work to define block ids.
*/
/* #call_llm @complain_and_prompt @network_ret:code
void call_llm(span model, json messages, llm_message_handler cb);
Here we get a model name, a json object containing chat messages, and a callback function to handle LLM output from a successful API call.
We dispatch on model name.
- if it starts with "gpt" or matches "llama.cpp" we will use call_gpt and handle_openai_response
- if it starts with "claude" we use call_anthropic and handle_anthropic_response
- otherwise we use call_ollama and handle_ollama_response
In any case we call the appropriate function and get back a network_ret object.
In the case of error we #complain_and_prompt, printing only the error as-is.
Otherwise we pass on the .response and the callback function to the appropriate handle_*_response function.
Implementation:
1. Call the model.
2. Handle the error handling.
3. Handle the successful case.
First set an indicator variable to distinguish between the three model types, since only steps 1 and 3 branch on the model type.
*/
void call_llm(span model, json messages, llm_message_handler cb) {
network_ret ret;
int is_gpt = starts_with(model, S("gpt")) || span_eq(model, S("llama.cpp"));
int is_claude = starts_with(model, S("claude"));
if (is_gpt) {
ret = call_gpt(messages, model);
} else if (is_claude) {
ret = call_anthropic(messages, model);
} else {
ret = call_ollama(messages, model);
}
if (!ret.success) {
wrs(ret.error);
prt("\nPress any key to continue...");
flush();
getch();
return;
}
if (is_gpt) {
handle_openai_response(ret.response, cb);
} else if (is_claude) {
handle_anthropic_response(ret.response, cb);
} else {
handle_ollama_response(ret.response, cb);
}
}
/* #read_openai_key
void read_openai_key();
In this function, we check that the file ~/.cmpr/openai-key exists and can be read.
We use getenv to get the HOME directory, and construct the path from there (using a char buffer of size PATH_MAX).
If the file does not exist at all, we assume that the user is not using the feature, and we want to return silently in that case.
Therefore, we use stat to check if the file exists, and if not, return.
(This will leave openai_key empty, indicating to the rest of the code that the API is not available.)
Otherwise, we will read the file into cmp space using our library method.
(This will complain and exit on any failure to read the file, as we intend.)
We set state.openai_key to point to the file contents.
However, we actually want to trim whitespace (such as a newline that must end the file) in case we print the key as a string (such as in an HTTP header), so we call trim() on it.
*/
void read_openai_key() {
char path[PATH_MAX];
struct stat st;
char *home = getenv("HOME");
if (!home) return;
snprintf(path, PATH_MAX, "%s/.cmpr/openai-key", home);
if (stat(path, &st) != 0) return;
state->openai_key = trim(read_file_into_cmp(S(path)));
}
/* #read_anthropic_key @read_openai_key:all
void read_anthropic_key();
This the same as #read_openai_key, with s/openai/anthropic everywhere.
*/
void read_anthropic_key() {
char path[PATH_MAX];
struct stat st;
char *home = getenv("HOME");
if (!home) return;
snprintf(path, PATH_MAX, "%s/.cmpr/anthropic-key", home);
if (stat(path, &st) != 0) return;
state->anthropic_key = trim(read_file_into_cmp(S(path)));
}
/* #filename_template @template_language_design @assoc_spans
The filename_template pattern lets us reliably generate file paths across the codebase using a simple syntax.
@- Usage: with @filename_template as a context reference to simply use the filename_template function.
@- Or, include @filename_template:all and special instructions to inline the interpretation of a static (compile-time) pattern into direct concat() calls, etc.
span filename_template(span);
We use helper functions to parse the template, set up a context (or variable dictionary) and then evaluate the template.
We return a span in cmp space, and we only extend cmp space by the length of this span.
Note that users of this function will often write a pattern like this: "<cmprdir>/revs/<timestamp>", as an example.
However, our template language uses curly braces, so we would translate this into S("{cmprdir}/revs/{timestamp}") before calling filename_template (or expand_template).
The difference between filename_template and expand_template is that filename_template is specialized to already include the variables in scope, and therefore only takes one argument (the template) instead of expand_template, which requires the caller to also provide the variable dictionary.
@- We can do this either by only using spans that were already allocated, or if we allocate new spans, by copying our final output into place at the previous end of cmp space.
@- This can also be abstracted into a library pattern, where we push cmp.end onto a stack, and then pop with a span argument.
@- The span argument will be copied to the previous end of cmp unless it is already at that location, and will be returned from the pop call, which can then be the return value of the function that's returning the cmp-allocated span.
@- In this case, all our filename components are going to be static (at least per main loop iteration) which makes our filename templates have the properties we want.
@- In particular, the {timestamp} variable will be set once per main loop iteration, which means that we can rely on multiple files (as under api_calls) having the same timestamp, even though we are making separate calls to the filename_template function.
Implementation:
We call filename_variables to get a spans containing a dictionary.
Then we simply return the result of expand_template on the template with this dictionary.
@- Partial inlined implementation:
@- In cases where the pattern is partly inlined, instead of calling filename_template, we can expand the filename pattern into a static spans.
@- We can then call expand_template directly inline, with the static spans and a filename_variables() call as arguments.
@- Fully inlined implementation:
@- Using concat and the NL part of filename_variables, we could inline the simplest cases directly as spanio library calls and references to state values.
@- This pattern would match most of the current places where <cmprdir> is used.
*/
span filename_template(span template) {
spans vars = filename_variables();
return expand_template(template, vars);
}
/* #assoc_spans @gcb
Similar to assoc lists in Lisp.
An "assoc spans" is just a regular spans (which is a generic array of span elements) with a particular intepretation.
Much like in Lisp, we use them to represent maps for small, code-like datasets such as variable environments.
To set one up, we simply create a spans and append variable name, value pairs onto it.
To look up values in it, we just examine the even-offset spans until we find a match and then return or otherwise use the value at the following offset.
I.e. (i*2) and (i*2 + 1) are the offsets (0-based array indices) of the i'th key and value, respectively.
We have a function, assoc_spans_lookup(spans, span) which returns the value for a given key, or the null span if it is not found.
*/