A grasp of Git
- Processes, standards and quality
Git is simple but different – it is probably the most popular version control system (VCS) used in IT projects around the world. It is also very often misused because although it is very similar to other VCSs at the API level, it is very different underneath. Let’s take a look at the nuts and bolts of Git to understand it better and, as a result, to use it properly and effectively.
A stream of snapshots
The first thing that you should know about Git is how it “thinks about data” and stores changes. In Git we do not store a set of files tracked separately, but snapshots of an entire project (repository). We’re not going to go into technical details here: it is important to remember that Git stores changes in our project as a stream of snapshots. Snapshots are recorded as commits – with some metadata (author, committer, date, message) and link(s) to parent(s).
Key-value store with handy API
Yes, it is not a mistake. In fact, Git is nothing more than a key-value store with handy API and additional features that allow it to work as a version control system. In the Git repository, you can find a .git hidden folder. Repository data is stored within it as an object directory. Everything (values) can be identified (key) using the SHA-1 checksum created out of its contents. This rule applies to all types of objects stored by Git:
- Blob – to store file data
- Tree – to map directory structure
- Commit – to identify repository snapshots
- Tag – to identify objects
Let’s assume that we want to create a snapshot of the simplest project ever containing the following:
- readme.md file,
- code.js file in src directory.
To store this data, Git will use the following objects:
- tree that represents root directory,
- tree that represents src directory,
- blob for readme.md,
- blob for code.js.
The whole structure is presented in the diagram above. Please note that every object has its own identifier. SHA-1 is used to compute it, based on the object contents. Identifiers are unique. Only one blob is stored by Git in case we have two identical files in our project (you can try to create the same repository contents on your machine and compare identifiers. Note that they will be the same for the same contents.) Although generally, it seems that storing many snapshots of an entire repository is redundant and not efficient, Git does it in a smart way. Details are beyond the scope of this overview. Please note that in most cases a large amount of data from a previous snapshot is reused to create the following one. Let’s assume that after an initial commit to our repository (with readme.md and code.js files) we modify readme.md and make the second commit. For the second snapshot, new blob for the readme file and new root tree are created and tree for src and blob for code.js are reused.
If you try to create a repository with a structure as on the diagram above, you can start by creating an empty src repository. After calling a git status, you may be surprised that this action is irrelevant to Git. It ignores this change as it tracks contents, not a directory structure. Once the first file is added, the – readme.md git status will tell us that there is one untracked file. To commit it, you need to add it to index (staging area). Index is a state (stored in binary file index in .git) that will be used to create the snapshot of an entire repository for the next commit. In other words, in a commit, we include the current state of our index.
After modification of a file previously added to the index it is reported to the git status in both the “changes to be committed” and the “changes not staged for commit” sections. Git always tries to be very helpful providing, in this case, not only a summary of changes, but also a proposition of commands that could be used to manage them.
The index is a very convenient way to prepare a commit exactly the way we want it, without making changes in a working directory. The decision on what should be moved to the index could be made not only on the file but also on a single change level (using –path or –interactive options for git add).
Possible states of changes and transitions in the Git repository are presented in the diagram below. It is important to remember how the index works and to be aware that Git does not track files but changes to be able to take snapshots of an entire repository.
Branches: a killer feature
Why is a branch called a “killer feature” of Git? Because it is extremely lightweight. It is nothing more than a simple pointer to a given commit. Physically it is stored as a file where:
- filename is a name of a branch,
- content is an identifier of a given commit.
There is also another important pointer, namely, the HEAD. It contains information concerning the branch we are currently on. It is very easy to create a branch using various versions of the git branch command. As a result, a new pointer is created. The process of switching branches is also very simple. After calling a git checkout:
- HEAD is updated to point to a given branch,
- the content of a working directory is updated to match the state stored in a snapshot connected to a selected commit (the one pointed to by the branch))
Branch in Git is lightweight. Operations on branches in Git are very fast and convenient.
Configuration: levels and power
There are three configuration levels in Git:
- system – applicable globally
- global – applicable to a given user
- repository – applicable to a given repository
Names could be confusing. It is important to remember that the system level is more global than global. Each of more specific levels overrides the configuration settings from a more general one. You can check and change configuration using the git config command or directly by using config files.
Git config is powerful. Do you want to change a default editor or modify the push command behaviour? No problem. Just set a proper value for core.editor or push.default. The same rule applies to many other options.
In the Git configuration, you can define aliases. This feature gives you much more than suggested by its name. You can not only define simple aliases for Git commands but also combine many Git commands with shell commands. This way you can create new – even very complex – commands that fulfill your specific needs.
Remember that aliases, configuration, as well as hooks (scripts executed when given operations like commit or rebase are performed), are local: specific to a given machine and not shared with other repository contributors.
Git is a very handy tool that you can use as you want. While working with Git we use branches. But you can use them in many ways defining your own workflows or using methods described by other users. You can decide how to integrate changes: via merge or rebase. You can easily integrate Git with continuous integration or ticket tracking systems.
Everything is local
Git is a distributed VCS which means that an entire repository is cloned on every machine you work on. As a result, every operation is local until you fetch or push changes. But remember that you need to do it explicitly. There is no automatic synchronisation so if someone in your team has pushed changes and you are not able to see them, then you need to fetch them. It’s worth pointing out that it may not be obvious to people who are new to Git. To avoid such frustrating problems, it is necessary to have at least a basic knowledge of how Git works.
Local operations are fast. An additional advantage of a distributed system is that you can continue your work with a repository even if you do not have a connection with a remote repository.
Shortcuts, shortcuts everywhere
I have mentioned before that Git is, in fact, a key-value store with a handy API. It exposes two sets of commands:
- plumbing – to do low-level work, manipulate data directly,
- porcelain – used to work with Git as VCS.
You probably don’t have to know anything about plumbing API to work with Git effectively. But it is important to be aware of how polished the porcelain API is. Git provides you with a lot of useful information about each command. You can access it using Git help. It is important to know exactly what the commonly used commands do. Let’s take a look at a very popular pull command. In fact, it is an alias for two commands performed in this order:
- fetch – fetches changes from a remote repository
- merge – updates the current remote-tracking branch with changes from the remote repository
Listen to Git
I recommend using Git via console (with the editor and diff/merge tools configured as you like). This way you know exactly what you are doing in each step. And Git can tell you what you are doing wrong or what options you have in the current situation.
Git tries to be very verbose and helpful. If you have a problem, then a solution or at least a useful tip will be most certainly suggested to you in your console. Do not ignore this help.
I am aware that this article is only a very high level (and not complete) Git overview. My main goal is to encourage you to get familiar with Git. You can use Git’s power when you know what possibilities it gives you and how it works exactly.
Let’s call it “four steps to get a better Git experience” and start doing it.