First write a token generator that creates a file with 10 million random tokens, one per line, each consisting of seven lowercase letters a-z. Then write a token reader that reads the file and stores the tokens in your DB. Naturally some tokens will occur more than once, so take care that these aren't duplicated in the DB, but do produce a list of all non-unique tokens and their frequencies. Find a clever way to do it efficiently in terms of network I/O, memory, and time and include documentation inline with your code or as txt file, describing your design decisions.
-
$10 \cdot 10^6 \cdot 7 = 7 \cdot 10^7$ total bytes, so roughly66
MB. Should fit into main memory -
$|{a,\cdots,z}| = 26 \implies 26^7$ combinations. Thinking of the Birthday problem, there is a good chance that we will have a collision of two random strings.
-
For a simple implementation, a line by line read in should do the job.
- Further io/performance improvement with
bufio
, but needs testing
- Further io/performance improvement with
-
Checking for duplicates
$n = 10^7$ - A
Map[string]int
can do the job with low programming complexity, because access is O(1) and filling it is O(n)- We can even hint
make
with$n$ to reduce the number of map resizes
- We can even hint
- Radix sort with 7 iterations will be O(n * 7)
- But should be very memory intensive
- Counting sort will be O(n + 26^7) which does not look promising because of the possible huge input space
$\implies$ https://www.youtube.com/watch?v=kVgy1GSDHG8 - A
My goal is to do most of the work in the main memory after loading in the tokens from a file.
I have never used Postgres, so I will give it a try. The lib (PGX)[https://github.com/jackc/pgx] seems to be up to date & maintained, so I'll use it.
We just have to save unique token. Documentation and Stackoverflow suggest using character varying
to save the token. We do not need to mark the token with PRIMARY KEY
because it implies UNIQUE
which we already check beforehand. In addition, filtering duplicates reduces the calls to insert rows into the database.
CREATE TABLE tokens(
token VARCHAR(7) NOT NULL
)
We issue one
Insert
when a new token pops up. Therefore, we have for$n$ unique tokens$n$ calls toconn.Exec
. Also, it seems that the call is blocking such that it should be put into ago-routine
.
This is visible, if we compare the time spent with and without sql insertions:
2021/11/26 19:44:54 Start scanning file
2021/11/26 20:25:35 Finish scanning file
2021/11/26 22:46:47 Start scanning file
2021/11/26 22:46:49 Finish scanning file
2 seconds without SQL
- Issues the
Insert
in batches to reduce IO and increase speed - We can spin up multiple connections to the database
Implementing the first idea leads to a total runtime of roughly 4 seconds.
2021/11/27 00:39:43 Start scanning file
2021/11/27 00:39:43 Start writing to DB
2021/11/27 00:39:46 Finish scanning file
2021/11/27 00:39:46 Writing duplicates to file
2021/11/27 00:39:46 Finish writing to DB
However, pooling
i.e. multiple connections did not increase performance. Therefore, batching solves our problem of slow db writes by reducing IO.
Just wanted to test it out. Most of the time is spent in the hashmap.
- Be sure to have docker installed
- Navigate to the main dir
cd interview-task
- Execute
setup.sh
which will pull apostgres
image, adds a user and creates a table - Run
go run main.go