Nanorepositories
I recently encountered a microservice antipattern called nanoservice that is described in following manner:
Nanoservice is an Anti-pattern where a service is too fine grained. Nanoservice is a service whose overhead (communications, maintenance etc.) out-weights its utility.
I have encountered a lot of similar situations with source code repositories where different parts of a program or a system that was shipped as a single entity were divided into smaller repositories. The code in these repositories was not used by anything else than the product (the unit of release) that the code was part of.
In the most extreme case there were thousands of smaller repositories. Many of those repositories were just holding less than 10 files in a deep directory hierarchy implementing some tiny functionality of the whole. Plus some boilerplate functionality for building and repository management that could just have been a couple of extra lines in a build system for a larger entity.
In more common cases, one product consists of dozens of smaller repositories where one or two repositories get over 90% of the whole weekly commit traffic and other repositories just get one commit here and there. Or that there are multiple interlinked repositories (see an example in How many Git repos article) that depend on each other and very often all of them need to go through the same interface changes.
Sometimes there also is a situation that all the work is done in smaller repositories and then there is one superproject[1, 2, 3] that is automatically updated when a commit happens in any of its child projects. So basically you have one big repository that just consists of pointers to smaller repositories and has one unneeded layer of indirection. Also instead of making one commit that would reveal integration problems immediately, you now need to have multiple commits to reveal these issues. With some extra unneeded delay.
I would suggest calling these types of repositories nanorepositories. A nanorepository is a repository that holds a subsystem that is too limited to stand on its own and needs a bigger system to be part of. This bigger system usually is also the only system that is using this nanorepository. The nanorepository is also owned by the same organization as the larger entity it’s part of. Therefore it doesn’t give any advantages, for example, in access control. Nanorepositories can hold just couple of files, but they can also be relatively large applications that are, however, tightly coupled with the system they are part of.
Downsides of premature repository division
Nanorepositories are a case of premature optimization for code sharing where there is no real need for it. There are articles (Advantages of monolithic version control, On Monolithic Repositories) and presentations (Why Google Stores Billions of Lines of Code in a Single Repository, F8 2015 - Big Code: Developer Infrastructure at Facebook’s Scale) talking about the advantages of monolithic repositories, but those advantages can be hard to grasp without knowing the disadvantages of the other end.
I’ll list some issues that I have encountered when working with independent repositories. These all lead to a situation where developers need to spend extra time in repository management that could be avoided by grouping all software components that form the final product into one repository.
Expensive interface changes
Interface changes between components become expensive. You also need cumbersome interface deprecation policies and a way to support the old and new interface versions between repositories until the change has propagated everywhere. It can take months or years to ensure that all interface users have done the intended interface change. And if you don’t have a good search at your disposal, you still can’t be sure about it before you really remove the old interface.
With separate repositories it’s often the case that you can’t easily search where the interface you are deprecating is used at. This means that you don’t beforehand know what is actually depending on the interface. There naturally are search engines that span over multiple repositories, but they very rarely beat a simple grep -r
(or git grep
) command whose output you can further filter with simple command line tools. Especially if there are hundreds of small repositories that you need to include in your search.
Ignore file duplication
Often you need to add ignore files (.gitignore
, .hgignore
, etc…) to prevent junk going in to the repository by accident. Situations that can generate junk next to your code can be for example:
- In-source build (versus separate build directories) generated files (
*.a
,*.o
,*.exe
,*.class
,*.jar
…). - Using any editor that creates backup and other files next to the file you are editing (
*~
,*.swp
,*.bak
…). - Using interpreted languages, like Python, whose default implementation byte compile the scripts for faster start-up (
*.pyc
,*.pyo
…). - Using integrated development environments that require their own project directories.
All these generic ignore rules need to be included in every project in addition to project specific ignores. Other possibility is forcing these ignore rules on developers themselves instead of taking care of them centrally. In case of nanorepositories there likely is just one or two languages used per repository, so the amount of ignore rules likely depends on the development environment that the developers work with. But it’s still needless duplication when you could get by without.
Re-inventing inefficient build system rules
Small repositories lead into having to reinvent build system rules for every repository from scratch if you want to test and build your component in isolation. Or doing a lot of code duplication or including references to a repository including common build rules. This also applies for test runners, as different levels of testing for different languages usually have their own test runners. Sometimes multiple test runners per language, that all have some non-default options that provide various advantages in test result reporting.
Modern build systems like, ninja and Bazel, usually work on knowing the whole build graph of the system that they are trying to build. This makes it more easy to discover dependencies that only rebuild parts that are necessary to rebuild. Building every repository independently from each other leads into recursive build systems that treat their inputs as black boxes (Bitbake, npm, Maven, Make…). Changes in these black boxes are either communicated with version number changes or always rebuilding the component in question. This leads into a wasteful process and resource usage when compared to trunk based development of monolithic repositories.
Overly complicated continuous integration machinery
One of the defining principles of modern software development is having a working continuous integration system in place. This ensures that independent changes also work when they leave developer’s machine and also work when they are integrated with the rest of the product that the change is part of. And this is done multiple times every day. This, combined with trunk based development, keeps integration issues short (minutes to days) and avoids many-month release freezes compared to branched or forked development methods.
Nanorepositories likely end up in a repository specific checks in the continuous integration machinery that only verify that the component in the repository itself works well. And if this continuous integration machinery has an automatic per repository check job generation, it likely needs to have an entry point (like make test
or test.sh
script) to execute those tests. And the same applies for compilation. Not to mention the extra work when trying to compile and test against different systems and runtime instrumentation (like AddressSanitizer).
When the component finally gets integrated with everything else and the system breaks, figuring out the exact commit where the breakage happens (besides the integrating one) can be really painful. This is because it is easily possible to have dozens to thousands of commits between component releases. See a physical world example where components work perfectly together, but fail when integrated. And its hotfix.
A case for small repositories
Nanorepositories should not be confused with small independent repositories, as not everything needs to aim to be a part of a bigger product. A very common reason for small repositories is the combination of ownership management with shareable components. Most open source projects are such that they are owned by a certain people, or an organization, and it’s just not a good case for them to be part of anything bigger. Especially if they provide independent components that really are used by multiple external entities. Same downsides, however, generally apply to a collection of open source projects as to products consisting of multiple repositories.