|
| 1 | +# Georges Hatem |
| 2 | + |
| 3 | +## Requirements |
| 4 | + - mORMot2 library |
| 5 | + - 64-bit compilation |
| 6 | + |
| 7 | +## Hardware + Environment |
| 8 | +host: |
| 9 | + - Dell XPS 15 (9560, 2017) |
| 10 | + - OS: ArchLinux |
| 11 | + - CPU: Intel i7 7700 HQ (4 Cores, 8 Threads @2.80-3.80 GHz, Kaby Lake) |
| 12 | + - 32 GB RAM DDR4 (2400 MHz) |
| 13 | + - 1 TB SSD NVME |
| 14 | + |
| 15 | +VM (VirtualBox): |
| 16 | + - OS: Windows |
| 17 | + - CPU count: 4 out of 8 (threads, probably) |
| 18 | + - 20 GB RAM |
| 19 | + |
| 20 | +note about the hash: |
| 21 | +run with DEBUG compiler directive to write from stream directly to file, otherwise the hash will not match. |
| 22 | + |
| 23 | +## Baseline |
| 24 | +the initial implementation (the Delphi baseline found in /baseline) aimed to get a correct output, regardless of performance: |
| 25 | +"Make it work, then make it work better". |
| 26 | +It turns out even the baseline caused some trouble, namely the `Ceil` implementation was yielding different results between FPC and Delphi (and different results between Delphi Win32/Win64). |
| 27 | +After the input of several peers including gcarreno, abouchez and paweld (thanks!), this last detail was ironed out, and the baseline yielded a matching hash. |
| 28 | + |
| 29 | +## Single-Threaded Attempt (2024-04-03) |
| 30 | + |
| 31 | +in this first attempt, the implementation is broken down into 3 major steps: |
| 32 | + 1. read the input file |
| 33 | + 2. process it |
| 34 | + 3. output to file/stdout |
| 35 | + |
| 36 | +a key point: |
| 37 | + - the reading / writing (steps 1 an 3) will be done on the main thread. |
| 38 | + - the processing (step 2) is where a future submission will attempt to parallelize the work done. |
| 39 | + |
| 40 | +## 1. Read The Input File |
| 41 | + |
| 42 | +#### v1. File Stream |
| 43 | +In the baseline implementation, a file stream is used to read line by line, which is far from optimal. |
| 44 | + |
| 45 | +#### v2. Memory-Mapped File |
| 46 | +An improvement was to read the whole file into memory in one shot, using memory mapping. |
| 47 | +In this implementation, I use `GetFileSize` and `CreateFileMapping`, procedure found online (need to find URL). |
| 48 | +First thing to note is, the usable memory by a Win32 process is limited to ~1.5-2 GB of RAM. Exceeding this limit yields an out-of-memory exception, so we must compile for Win64. |
| 49 | +Some issues with this implementation (see unit FileReader.pas): |
| 50 | + - `GetFileSize` was returning a size of ~3.9B, while we know the real input is ~16B. |
| 51 | + - `CreateFileMapping` was taking 2.5 seconds to read the file into a `TArray<Utf8Char>` the first time, and subsequent reads were down to 1.7 seconds. But this is for only a quarter of the input size. |
| 52 | + - using `GetFileSizeEx` instead, we now get the real file size, ~16B |
| 53 | + - however, `CreateFileMapping` accepts as parameters `Cardinal` types, so a value of ~16B (Int64) would yield a `range check` error. |
| 54 | + - if we wanted to move forward with this implementation, we would need to call `CreateFileMapping` in 4 or 5 batches, which would take 1.7 x 5 ~= 8.5 seconds just to read the data. |
| 55 | + - attempt aborted, see v3. |
| 56 | + |
| 57 | +#### v3. Memory-Mapped File, Provided by `mORMot2` |
| 58 | +A v3 attempt at reading the file was using a ready-made implementation of file memory-mapping, provided by synopse/mORMot, big thanks @abouchez! |
| 59 | +The function returns a pAnsiChar and the size as Int64 of the mapped data. Performance-wise, it all happens in under 0.1 seconds, but now we must delve into pointers. |
| 60 | + |
| 61 | + |
| 62 | +## 2. Process the File |
| 63 | + |
| 64 | +Well, at a glance this is straightforward: |
| 65 | + - look for new-line chars to delimit each line, split it to extract StationName / Temperature. |
| 66 | + - Decode the temperature into a numerical value (double, or integer x 10) |
| 67 | + - accumulate the information into a dictionary of StationName -> Record of data |
| 68 | + |
| 69 | +A few optimizations done here, to the best of my knowledge: |
| 70 | + |
| 71 | +#### For Each line, Iterate Backwards |
| 72 | +`Length(StationName) > Length(Temperature)`, so for each line, better look for the `;` starting from the end. |
| 73 | +Given the below input: |
| 74 | +``` |
| 75 | +Rock Hill;-54.3 |
| 76 | + ^^^ |
| 77 | +``` |
| 78 | +the 3 last characters will be mandatorily present, so we can skip them while iterating. |
| 79 | +I tried unrolling the loop over the last 2-3 characters that must be checked, but measuring it, it showed to be slower, don't know why. |
| 80 | + |
| 81 | +#### Extract the Strings Using `SetString` |
| 82 | +manual string concatenation and splitting proved to be very slow. |
| 83 | +Using `SetString` yielded a noticeable improvement. Remaining in the realm of pointers would probably be much faster, but I haven't ventured there (yet, maybe if time is available). |
| 84 | + |
| 85 | +#### Decode Temperature Into Numerical Value |
| 86 | +First attempt was to use `StrToFloat`, which was pretty catastrophic. Using `Val` was still slow, but definitely a step-up. `Val` with SmallInt proved to be faster than with Double, even though there's extra operations to be done. |
| 87 | +So now we need to get rid of the `.` character. |
| 88 | + |
| 89 | +Again, string functions being very slow, replicating the last character at length-1, and then reduce the length of the AnsiString using `SetLength` yielded faster results. I tried doing so with pAnsiChar but got some range-check errors. Might get back to it later on. |
| 90 | + |
| 91 | +Finally, assuming temperatures in the range `-100` and `+100`, with 1 decimal figure, there should be 2000 different temperatures possible. |
| 92 | +Instead of decoding the same temperature values using `Val`, do it once, store it in a TDictionary (TemperatureString -> TemperatureSmallInt). There were I believe 1998 different temperature values, so we only call `Val` 1998 times instead of 1 billion times. Over an input size of 100M, the gain was 4-5 seconds (total 28s -> 23s) |
| 93 | + |
| 94 | +#### Accumulate Data Into a Dictionary of Records |
| 95 | + - the records are packed, with minimal size |
| 96 | + - the dictionary maps StationName -> Pointer to record, to avoid passing around full records |
| 97 | + - records are pre-allocated in an array of 45,000, instead of allocating them on-the-fly. |
| 98 | + - when a station is not found in the dictionary, we point to the next element in the records-array. |
| 99 | + - with an input of size 100M, this accumulation step takes a considerable amount of time (9 seconds out of 23 total). I haven't identified yet if it is the `dict.Add` that takes time, the `dict.TryGetValue`, or just generally the dictionary hash collisions. Even though the dictionary is pre-allocated with a capacity of 45,000, but that did not seem to improve much. I also tried the dictionary implementation of Spring4D, but also no improvements. |
| 100 | + |
| 101 | +## 3. Output the Results |
| 102 | +Since I started using pointers (pAnsiChar), getting a matching hash was a bit of a pickle: |
| 103 | +Some Unicode characters were messed up in their display, or messed up in their ordering. |
| 104 | +Eventually, the ordering issue was resolved by using `AnsiStrings.CompareStr` instead of `SysUtils.CompareStr`. This step will clearly remain single-threaded, but takes 0.15 seconds for all 45,000 stations, so it is not a big deal. |
| 105 | + |
| 106 | +## Conclusion |
| 107 | + |
| 108 | +If there are any improvements to be done in this single-threaded version, they would be in the following order, from most impactful to least impactful (performance numbers over an input of 100M, not 1B): |
| 109 | + - try improve storing / retrieval from the dictionary of station data, cost: 9 sec / 23 sec |
| 110 | + - try improve the extraction of string data, maybe using pointers (6.5 sec / 23 sec) |
| 111 | + - try improve the type conversion, though not sure how at this point (4.5 sec / 23 sec) |
| 112 | + - somehow, incrementing an integer 1B times is taking 1.2 seconds, while incrementing the main input index (16B times) is only taking 0.5 second. It's just 1.2 seconds, but I'm not understanding why it should behave that way. |
| 113 | + |
| 114 | + |
| 115 | +# Delphi port of my FPC implementation, to try and compare performance issues on Craig Chapman's PC: |
| 116 | + |
| 117 | +Somehow on Windows x64, Craig and Gus noticed very poor performance as compared to Gus' setup on Linux FPC. |
| 118 | +Is it a Windows vs Linux problem? or a Delphi vs FPC problem? |
| 119 | +After discussing the matter with Gus, here's a (as close as possible) port of my FPC code onto Delphi, so we can compare the generated exe out of both compilers. |
0 commit comments