I began developing a git client this week, with support finished for the following commands:

  • init
  • log
  • cat-file
  • hash-object

In this post I’ll detail what I learned and how it related to each command. The repo can be found here.

Where I Learned From

The official git-scm documentation was a very useful source of information, with it detailing key information such as the structure of git objects. I also followed parts of Write Yourself A Git, a guide on writing git implementations, which was particularly useful in helping with code to correctly parse Git ‘Objects’.

Git Init & The Core Of A Git Repo

Init creates a new git repo at a specified path. Implenting it teaches us the core of a git repo:

  • The .git folder contains all of the actual repository files
  • There are three key folders (branches, objects and refs)
  • A description file is created but is largely vestigial
  • The file HEAD specifies the pathspec of the repo head (‘ref: refs/heads/master’ by default)
  • A config file is contained in every git repo and specifies details such as the git repository format version

Git Objects & Their Structure

Git stores information about the repository using ‘Objects’. There are 4 types of Git Object:

  • Blob (raw data such as main.c’s contents)
  • Commit (a commits information)
  • Tag (a named specific commit or object, containing information such as tag date and maker)
  • Tree (relates files to folders)

Git objects are compressed using zlib.

When decompressed their format (in order) is:

  • A header identifying their type in plain ascii text (blob, commit etc.)
  • A space
  • The object size in ascii text
  • A null character
  • The object data (file contents, commit data etc.)

Using this information both cat-file and hash-object could be implemented (see here for their implementation)

Git Commits & The Git Log Command

Log lists the commit history of the repository, for example when run on the PyGit repository itself:

To implement log we first have to parse the object for the specific starting commit. The format is quite complex, but good documentation on it can be found here.

Once parsed printing the commit log is as simple as displaying the commit information and then repeating the process until there are no parent commits left:

def print_log(self, hash, seen):
    """Git log display"""
    if hash in seen:
        return
    seen.add(hash)

    commit = self.read_object(hash)
    short_hash = hash[0:8]
    message = commit.commit[None].decode("utf8").strip()
    message = message.replace("\\", "\\\\")
    message = message.replace("\"", "\\\"")

    if "\n" in message: # Keep only the first line
        message = message[:message.index("\n")]

    print(f"{hash}: {message}")
    assert commit.fmt==b'commit'

    if not b'parent' in commit.commit.keys():
        # Base case: the initial commit.
        return

    parents = commit.commit[b'parent']

    if type(parents) != list:
        parents = [ parents ]

    for p in parents:
        p = p.decode("ascii")
        self.print_log(p, seen)