Implementation of the Git from scratch. To understand the fundamentals of Git, the CLI tool was developed using Python and basic shell. The idea comes from Thibault Polge article.
The entire application is based on a guit command (git with an "extra u").
A very simplified version of Git core commands was implemented.
The list of commands are displayed with command guit --help.
To initialize a new, empty repository:
guit init [path]This function initializes the repository by creating the necessary directories and configuration files:.
.git
│───config: configuration file (repositoryformatversion, filemode, bare)
│───description: free-form description of repository (rarely used)
│───HEAD: the reference to the current HEAD (e.g. refs/heads/master)
│
├───branches
├───objects: the object store
└───refs: the reference store
├───heads
└───tagsThe config file is set to:
[core]
# the version of the gitdir format.
# 0 means the initial format, 1 the same with extensions.
# guit will only accept 0.
repositoryformatversion = 0
# disable tracking of file modes (permissions) changes in the work tree.
filemode = false
# indicates that this repository has a worktree.
# guit does not support optional worktree key
bare = falseTwo commands are implemented: cat-fileand hash-object. There are not
very known... but they are quite simple. The hash-object converts
a file to a git object, and cat-file prints the raw content of an object,
uncompressed and without the git header.
If you are very confused about Git objects, read a basic understanding of Git objects.
For example, if you use:
guit cat-file blob d110cf2ee6b39b1224e6919d26aac168533289d7You will see the contents of the first version of README.
To write a file, you use:
guit hash-object --t blob -w README.mdThe parameter --w is used to actually write the object into the git repository.
To display the history of a given commit, you can use:
guit log 3b2193d574be54e31f9c24a8e9478a2eeb307617You will see a Mermaid-compatible directed graph, with nodes representing commits and edges showing parent-child relationships. You can paste the code below in Mermaid live-editor to visualize it.
graph TD
c_3b2193d574be54e31f9c24a8e9478a2eeb307617["3b2193d: lint: black"]
c_3b2193d574be54e31f9c24a8e9478a2eeb307617 --> c_b91bee0ff2768ebb9a6ee26a2074f30b10440a19
c_b91bee0ff2768ebb9a6ee26a2074f30b10440a19["b91bee0: feat: cat-file and hash-file"]
c_b91bee0ff2768ebb9a6ee26a2074f30b10440a19 --> c_740589fb4ad82835afbc5cbb28b141547ac844a6
c_740589fb4ad82835afbc5cbb28b141547ac844a6["740589f: refactor: ran isort"]
To display the files in a tree, you can use:
guit ls-tree f478a2a96fcfd0c71231f126948d6608ca83591b100644 blob a8b41082206759a2a6564088573b329861a039ae classes.py
100644 blob d8c2e471a1f1eb92e5945fbb4edceda554d8d491 cli.py
100644 blob 6c2e1c32a72cec30a8ad56582be59651f15c741d create.py
100644 blob a1c191d4a251b5ddff07c3af545e5f18ef637874 io.py
100644 blob bbdfbeb1b7cb277dbc756c3f145665e03ce4d6bc utils.py
A very simple checkout method implemented to instantiates a commit in the worktree. It instantiates a tree in a directory ONLY if the directory is empty (that's because git has several safeguards to avoid deleting data).
guit checkout d0abf88de4d39d2dbf9e6a586f921e405bb1f645 testReferences are text files, in the .git/refs which holds SHA-1
identifier of an object, or a reference to another reference, ultimately
to a SHA-1.
To show all references use:
guit show-refs
8c31299f8f1a53c4a02c1d88f6980731f308a005 refs/heads/main
8c31299f8f1a53c4a02c1d88f6980731f308a005 refs/remotes/origin/main
The tag command let's you create tags as regular refs to commit, tree or blob. You can create a new tag or list existing tags with the same command:
guit tag
Now to create a tag object (so not only a reference) with name "first readme" of the README first object
guit tag --a --name "first readme" --object d110cf2ee6b39b1224e6919d26aac168533289d7
This command is used to solve references (parse revisions).
For example you can parse HEAD:
guit rev-parse --guit-type commit HEAD
Git is a “content-addressed filesystem” - which means the name of a file is derived mathematically from the contents it has.
This implies in every modification in a file in git means creating a new file in a different path.
The path where git stores a given object is computed by calculating the SHA-1 hash of its contents.
The mathematical computation is done by a hash function, which is a kind of unidirectional mathematical function: it is easy to compute the hash of a value, but there’s no way to compute back which value produced a hash.
Git renders the hash as a lowercase hexadecimal string, and splits it in two parts: the first two characters (used as directory name), and the rest (file name):
The object with SHA-1 equals to d110cf2ee6b39b1224e6919d26aac168533289d7 is store in .git/objects/d1/10cf2ee6b39b1224e6919d26aac168533289d7.
Git’s method creates 256 possible intermediate directories, hence dividing the average number of files per directory by 256
An object starts with a header that specifies its type: blob, commit, tag or tree (blobs have no actual format, the most simplest of them).
This header is followed by an ASCII space (0x20), then the size of the object in bytes as an ASCII number, then null (0x00) (the null byte), then the contents of the object.
header + ' ' + str(len(data)).encode() + b'\x00' + data
Writing an object is reading it in reverse: we compute the hash of the object after inserting the header, zlib compress everything and write to the location.
A commit object uncompressed without headers has this format:
- A tree object
- Zero, one or more parents;
- An author identity (name and email), and a timestamp;
- A committer identity (name and email), and a timestamp;
- An optional PGP signature
- A message
tree 29ff16c9c14e2652b22f8b78bb08a5a07930c147
parent 206941306e8a8af65b66eaaaea388a7ae24d49a0
author Thibault Polge <thibault@thb.lt> 1527025023 +0200
committer Thibault Polge <thibault@thb.lt> 1527025044 +0200
gpgsig -----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEExwXquOM8bWb4Q2zVGxM2FxoLkGQFAlsEjZQACgkQGxM2FxoL
kGQdcBAAqPP+ln4nGDd2gETXjvOpOxLzIMEw4A9gU6CzWzm+oB8mEIKyaH0UFIPh
rNUZ1j7/ZGFNeBDtT55LPdPIQw4KKlcf6kC8MPWP3qSu3xHqx12C5zyai2duFZUU
wqOt9iCFCscFQYqKs3xsHI+ncQb+PGjVZA8+jPw7nrPIkeSXQV2aZb1E68wa2YIL
3eYgTUKz34cB6tAq9YwHnZpyPx8UJCZGkshpJmgtZ3mCbtQaO17LoihnqPn4UOMr
V75R/7FjSuPLS8NaZF4wfi52btXMSxO/u7GuoJkzJscP3p4qtwe6Rl9dc1XC8P7k
NIbGZ5Yg5cEPcfmhgXFOhQZkD0yxcJqBUcoFpnp2vu5XJl2E5I/quIyVxUXi6O6c
/obspcvace4wy8uO0bdVhc4nJ+Rla4InVSJaUaBeiHTW8kReSFYyMmDCzLjGIu1q
doU61OM3Zv1ptsLu3gUE6GU27iWYj2RWN3e3HE4Sbd89IFwLXNdSuM0ifDLZk7AQ
WBhRhipCCgZhkj9g2NEk7jRVslti1NdN5zoQLaJNqSwO1MtxTmJ15Ksk3QP6kfLB
Q52UWybBzpaP9HEd4XnR+HuQ4k2K0ns2KgNImsNvIyFwbpMUyUWLMPimaV1DWUXo
5SBjDB/V/W2JBFR+XKHFJeFwYhj7DD/ocsGr4ZMx/lgc8rjIBkI=
=lgTX
-----END PGP SIGNATURE-----
Create first draftAll this hashed together in a unique SHA-1 identifier.
Important to note that dictionaries preserve the insertion order and this is essential in Git since if we change the order (e.g. putting tree after parent), we'd modify the SHA-1 hash of the commit and this would be two equivalent but distinct commits.
Since commit is made out of it's parents, they are immutable and have the hole history
It’s an array of three-element tuples made of a file mode, a path (relative to the worktree) and a SHA-1. The format is given by:
[mode] + ' ' + [path] + b'\x00' + [sha-1]
One example is:
Mode | SHA-1 | Path
------------- | ------------- | -------------
100644 | 894a44cc066a027465cd26d634948d56d13af9af | .gitignore
100644 | 94a9ed024d3859793618152ea559a168bbcbb5e2 | LICENSE
100644 | 894a44cc066a027465cd26d634948d56d13af9af | .gitignore
100644 | 94a9ed024d3859793618152ea559a168bbcbb5e2 | LICENSE
100644 | bab489c4f4600a38ce6dbfd652b90383a4aa3e45 | README.md
100644 | 6d208e47659a2a10f5f8640e0155d9276a2130a9 | src
040000 | e7445b03aea61ec801b20d6ab62f076208b7d097 | tests
040000 | d5ec863f17f3a2e92aa8f6b66ac18f7b09fd1b38 | main.cA branch is a reference to a commit. Wait, but what's the difference of a branch and a tag?
There are, of course, differences between a branch and a tag:
- Branches are references to a commit (tags can refer to any object);
- The branch
refis updated at each commit.
The current branch is a ref file outise of the refsfolder, in .git/HEAD,
which is an indirect reference.
"Staging area" is actually an abstraction of Git, since it's all based index file mechanism. The index is a binary file (.git/index) that tracks file metadata and staged content.
Index file uses a structured format with file paths, SHA-1 hashes, and mode bits.
Contributions are welcome! Feel free to fork the repository and submit a pull request.