• Taking a look at python-afl

    I have been using american fuzzy lop to fuzz various C and C++ programs and libraries. It is a wonderfully fast fuzzer that is generally easy to get started with and it usually finds new bugs in programs that have not been fuzzed previously. Support for american fuzzy lop instrumentation has also been added for other languages and I decided to try out how it works with Python. More specifically with the reference CPython implementation of it.

    Fuzzing Python programs with american fuzzy lop

    American fuzzy lop generally works by running a program that is compiled with american fuzzy lop instrumentation built in. It executes the program with afl-fuzz command that modifies the input data that is fed to the program, monitors how the program behaves, and registers everything that causes abnormal program behavior. This works well for natively compiled programs, but causes various issues with interpreted programs.

    Python is by default an interpreted language, so to execute Python programs, you need to start a Python interpreter before executing your code. This means that if you would instrument the Python interpreter with american fuzzy lop instrumentation and run the interpreter with afl-fuzz, it would mostly fuzz the inner workings of the interpreter, not the actual Python program.

    Fortunately there is python-afl module that enables american fuzzy lop instrumentation for just the Python code instead of instrumenting the Python interpreter. In native programs american fuzzy lop compiler wrapper (afl-gcc, afl-clang, afl-clang-fast) adds the necessary instrumentation and the connection to afl-fuzz. Python-afl is, however, designed in such way that it doesn’t try to wrap the whole program, but requires you to create a wrapper module that initializes fuzzing.

    The most simple way to wrap a Python program with python-afl is to initialize python-afl and then run the program (assuming that main() function exists):

    import afl
    import fuzzable_module

    This script, saved to, can be then run with py-afl-fuzz command that wraps afl-fuzz for Python programs:

    py-afl-fuzz -m 400 -i initial-inputs/ -o fuzzing-results/ -- \
        python @@

    More details about these command line switches can be found from AFL readme file. This then brings out the famous american fuzzy lop status screen, but now for Python programs:

    afl-fuzz status screen with python-afl. Yes, I use white background.

    Figure 1: afl-fuzz status screen with python-afl. Yes, I use white background.

    Next sections will explain in more details how to make fuzzing these programs more efficient and what pitfalls there could be in Python programs from fuzzing efficiency point of view.

    Afl-fuzz modes and their python-afl equivalents

    Generally afl-fuzz provides 4 fuzzing modes that differ in how the program execution between different fuzzing inputs behaves:

    • Dumb mode that just executes the program by doing fork() and execv(). This is the slowest mode that does not rely on any fancy tricks to speed up program execution and also does not provide any insights how the program behaves with different inputs.
    • Basic fork server mode where the fuzzed binary does all the initialization steps that happen before calling the main() function and then program is repeatedly forked from that point on. This also includes instrumentation that is compiled in to the program so there already is some insight on what is happening inside the program when a specific input is processed. There exists QEMU mode for afl-fuzz that technically enables fork server mode for uninstrumented binaries, but with some performance penalty.
    • Deferred instrumentation that works in similar fashion as the basic fork server mode. Instead forking just before calling main() function, this enables to move the fork point further down the line and enables heavy program initialization steps to be avoided if they can be executed independently of the input.
    • Persistent mode where the fuzzable part of the program is repeatedly executed without resetting the program memory every time the program is called. This only works in practice if the program does not have a modifiable global state that can not be reset to the previous state.

    Afl-fuzz generates new inputs and analyzes the program execution results roughly at the same speed regardless of the mode. So these modes are in the order of efficiency in a sense that how much overhead there is for fuzzing one input. They are also in the order of complexity on how easy they are to integrate into an existing program that has not been made to be fuzzed. Especially as the fastest modes require clang to be available as a compiler and the fuzzable program needs to be able to be compiled and linked with it.

    Python-afl, fortunately, provides equivalent modes without having to use special tools. These are also very fast to try out, as you don’t need to compile Python programs from scratch.

    The dumb mode would be just equivalent of running Python interpreter directly with afl-fuzz without any instrumentation, so we will skip over it. The more interesting part is to use the deferred instrumentation. The code in the introductory section called afl.init() before the fuzzable module was imported. This is the most safe approach, as the fuzz target might do something with the input at import time. But more realistically, Python programs generally only call import statements, possibly conditionally, during the start-up and don’t handle any user provided data yet. So in this case, we can do imports first and move the afl.init() function just before where the actual work happens:

    import afl, fuzzable_module

    We can gain some speed-ups with this by calling os._exit() function instead of letting Python to exit in the usual fashion where all the destructors and other functions that are called at exit:

    import afl, fuzzable_module, os

    Previous examples assume that the input file generated by the fuzzer comes as the first parameter on the command line. This is quite a good assumption, as many data processing modules for Python include a command line interface where they read and process files given on the command line. But if we can directly call the data processing function, we can instead use the standard input to feed the data:

    import afl, fuzzable_module, os, sys

    With Python 3 comes additional complexity. Python 3 processes the standard input using the encoding specified in the environment. Often in Unix environments it is UTF-8. As afl-fuzz mostly does bit manipulation, input is going to end up with broken UTF-8 data and results in exception when reading from the standard input file object. To work around this, you can use sys.stdin.buffer instead of sys.stdin in Python 3 based programs. Or create a shim that always results in raw bytes:

    import afl, fuzzable_module, os, sys
        # Python 3:
        stdin_compat = sys.stdin.buffer
    except AttributeError:
        # There is no buffer attribute in Python 2:
        stdin_compat = sys.stdin

    The fastest persistent mode requires that the program should not have a global state where the previous program execution affects the next one. There unfortunately is a surprising amount of global state in Python programs. It is not that uncommon to initialize some specific variables only during the program execution and then re-use the results later. This usually is harmless, but negatively affects the program stability that afl-fuzz is going to show in its status screen.

    Persistent mode code for a fuzzable program could look like following including the learnings from the deferred instrumentation and its speedups:

    import afl, fuzzable_module, os
        # Python 3:
        stdin_compat = sys.stdin.buffer
    except AttributeError:
        # There is no buffer attribute in Python 2:
        stdin_compat = sys.stdin
    while afl.loop(10000):

    Benchmarking different afl-fuzz modes

    I wanted to measure how these different afl-fuzz modes behave with Python. So I created a small fuzz target whose main algorithm does some conditional computation based on the input data and prints out the result:

    def fuzz_one(stdin, values):
        data =
        total = 0
        for key in data:
            # This only includes lowercase ASCII letters:
            if key not in values:
            value = values[key]
            if value % 5 == 0:
                total += value * 5
                total += ord(key)
            elif value % 3 == 0:
                total += value * 3
                total += ord(key)
            elif value % 2 == 0:
                total += value * 2
                total += ord(key)
                total += value + ord(key)

    This is just to exercise the fuzzer a little bit more than a trivial function that does nothing would do. I also created an equivalent fuzz target in C++ to give numbers to compare what kind of an overhead different fuzzing modes incur for both Python and native applications. The approximate results are summarized in table 1. The actual scripts used to generate this data are available from following links:,, and target-simple.cpp.

      Python 2 Python 3 Native
    dumb mode 110/s 47/s 1200/s
    pre-init 130/s 46/s 5800/s
    deferred 560/s 260/s 6800/s
    quick exit 2700/s 2100/s 8700/s
    persistent mode 17000/s 15000/s 44000/s
    Table 1: afl-fuzz benchmarks for various fuzzing modes for Python 2, Python 3, and for C++ versions of the example fuzz target.

    What these results show, it is possible to make fuzzable program in Python in such way that the fuzzing start-up overhead is only three to four times larger than for a native one with afl-fuzz. This is an excellent result considering that generally algorithms implemented in Python can be considered 10-100 times slower than ones implemented in C family languages. But if you want to use Python for performance critical tasks, you are anyways using Cython or write performance critical parts in C.

    There is a clear performance difference between Python 2.7.14 and Python 3.6.4 especially when all start-up and exit optimization tricks are not in use. This difference is also visible in Python start-up benchmarks at This difference gets smaller when the persistent mode is used as Python executable is not shut down immediately after processing one input. What can also help Python 3 in the persistent fuzzing mode is the fact that the tracing function that python-afl sets with sys.settrace() is called only half as often with Python 3 is it is called with Python 2 for this fuzz target.

    More speed for repeated Python executions

    Python enables imports and other type of code loading at any phase of the program execution. This makes it quite possible that program has not fully loaded before it is executed for the first time. This is usually used as a configuration mechanism to configure the program based on the runtime environment and other type of configuration. It also provides the possibility to write plugins, similarly to what dynamically loaded libraries provide.

    You can see from table 1 how the fuzzing overhead decreases a lot when the fuzzing start point moved after import statements for this simple example program. So when I was fuzzing an old version of flake8, I tried to see if it had any hidden configuration statements that would execute and cache their results for repeated calls. And it did!

    Initially I used following type of fuzzing wrapper for flake8:

    import afl, sys,

    It is basically a simple wrapper that imports all what it needs and then fuzzes what is fed from the standard input to the program. But the performance of this was horrible, around 15 executions/second. So I tried to see what happens when I change the code a little bit by calling the function with an empty string before setting the fuzzing starting point:

    import afl, sys,
    # Make sure that the hidden runtime configuration is executed:"")

    This doubled the execution speed to around 30 executions/second. It is still quite slow for the small inputs that afl-fuzz initially creates, but an improvement nonetheless. I looked at what flake8 does when it is executed and a following line popped up:

    from pkg_resources import iter_entry_points

    There basically is a hidden import statement in the execution path whose result is cached after the first time it is encountered. Also this pkg_resources.iter_entry_points() function is used to configure the program at runtime and that also adds some extra overhead to the process.

    Flake8 also by default tries to execute checks in parallel with multiprocessing module. This might be a good idea when you have multiple files to verify at once, but during fuzzing it just adds unneeded overhead. Also the fact that it starts a new process makes the fuzzer lose all information about what is happening in the subprocess. Fortunately in flake8 it was possible to override the detected multiprocessing support by just setting one variable into False and then flake8 will act as there is no multiprocessing support. This increased the average fuzzing speed of flake8 by threefold.

    The final speedup with flake8 came when looking at how flake8 is constructed. It is basically a wrapper around mccabe, pycodestyle, and pyflakes packages. So rather than fuzz flake8, it is much more productive to create a fuzz target for each one of those packages individually. I did this for pycodestyle and ended up finally executing it with around 420 executions/second for trivial data and around 200 executions/second for more realistic. So basic recommendations on how to fuzz native programs also apply for Python.

    Monkey patching around fuzzing barriers

    As american fuzzy lop does not know almost anything about the input, it can encounter various impediments (see section 13 from afl’s README.txt) when the file format includes any values that depend on the previously encountered data. This is especially problematic with checksums and is also an issue with other mutation based fuzzers. To work around this the C world, you can generally use C preprocessor to turn off checksum verification when such thing is encountered. This also applies for all other types of barriers that might skew fuzzing results, like random number usage.

    Unfortunately Python does not have preprocessor support by default, so this type of conditional compiling is out of the question. Fortunately Python provides the possibility to do monkey patching where you can replace functions or methods at runtime. So to make a library more fuzzer friendly, you can monkey patch all functions related to data verification to always return True, or return some constant value when checksums are considered.

    I used this approach to fuzz python-evxt 0.6.1 library. Python-evxt is a library to parse Windows event log files and is one of the first hits when you search Python Package Index with “pure python parser” keyword. The file format includes CRC32 checksum that will prevent the fuzzer from being able to meaningfully modify the input file, as almost all modifications will create an incorrect checksum in the file.

    To monkey patch around this issue, I searched the source code for all functions that potentially have anything to do with checksum generation and made them always to return a constant value:

    import Evtx.Evtx
    checksum_patches = (
        (Evtx.Evtx.ChunkHeader, "calculate_header_checksum"),
        (Evtx.Evtx.ChunkHeader, "header_checksum"),
        (Evtx.Evtx.ChunkHeader, "calculate_data_checksum"),
        (Evtx.Evtx.ChunkHeader, "data_checksum"),
        (Evtx.Evtx.FileHeader, "checksum"),
        (Evtx.Evtx.FileHeader, "calculate_checksum"),
    for class_obj, method_name in checksum_patches:
        setattr(class_obj, method_name, lambda *args, **kw: 0)

    The checksum_patches variable holds all the functions that you need to overwrite to ignore checksums. You can use setattr() to overwrite these methods on class level with an anonymous function lambda *args, **kw: 0 that always returns 0 and takes any arguments. Taking any number of arguments in any fashion is enabled with *args and **kw and this syntax is explained in keyword arguments section on Python’s control flow tools manual.


    I was impressed on how low fuzzing overhead Python programs can have when compared to C family languages. When looking at how python-afl is implemented, it becomes quite clear that the deferred mode of afl-fuzz has a great part in this where the whole Python environment and program initialization is skipped between different fuzzing runs.

    Making a Python module fuzzable was also more easy than in C family languages. Although american fuzzy lop is already the easiest fuzzing engine to get started with, the fact that it uses clang to do the more advanced tricks often gives headaches when trying to get software to compile. The fact that I did not have to use any modified Python interpreter to get started but only had to import afl module made me to realize how many steps I skipped that I normally have to do when using american fuzzy lop on new systems.

    Thanks to Python’s dynamic execution and monkey patching capabilities I could also try out fuzz target creation with external libraries without having to actually modify the original code. Especially selecting some specific functions to fuzz and overriding checksum generation would generally require nasty software patches with C family languages. Especially if there is main() function to override.

    It was also nice to realize that the os._exit(0) equivalent in standard C library function call _Exit(0) could help make light native fuzz targets even faster. The process cleanup in this relatively trivial C++ program adds almost 30% overhead for repeated program execution. Although it will likely break any sanitizer that does any verification work during the exit, like searching for dynamically allocated memory that was not freed.

  • Nanorepositories

    I recently encountered a microservice antipattern called nanoservice that is described in following manner:

    Nanoservice is an Anti-pattern where a service is too fine grained. Nanoservice is a service whose overhead (communications, maintenance etc.) out-weights its utility.

    I have encountered a lot of similar situations with source code repositories where different parts of a program or a system that was shipped as a single entity were divided into smaller repositories. The code in these repositories was not used by anything else than the product (the unit of release) that the code was part of.

    In the most extreme case there were thousands of smaller repositories. Many of those repositories were just holding less than 10 files in a deep directory hierarchy implementing some tiny functionality of the whole. Plus some boilerplate functionality for building and repository management that could just have been a couple of extra lines in a build system for a larger entity.

    In more common cases, one product consists of dozens of smaller repositories where one or two repositories get over 90% of the whole weekly commit traffic and other repositories just get one commit here and there. Or that there are multiple interlinked repositories (see an example in How many Git repos article) that depend on each other and very often all of them need to go through the same interface changes.

    Sometimes there also is a situation that all the work is done in smaller repositories and then there is one superproject[1, 2, 3] that is automatically updated when a commit happens in any of its child projects. So basically you have one big repository that just consists of pointers to smaller repositories and has one unneeded layer of indirection. Also instead of making one commit that would reveal integration problems immediately, you now need to have multiple commits to reveal these issues. With some extra unneeded delay.

    I would suggest calling these types of repositories nanorepositories. A nanorepository is a repository that holds a subsystem that is too limited to stand on its own and needs a bigger system to be part of. This bigger system usually is also the only system that is using this nanorepository. The nanorepository is also owned by the same organization as the larger entity it’s part of. Therefore it doesn’t give any advantages, for example, in access control. Nanorepositories can hold just couple of files, but they can also be relatively large applications that are, however, tightly coupled with the system they are part of.

    Downsides of premature repository division

    Nanorepositories are a case of premature optimization for code sharing where there is no real need for it. There are articles (Advantages of monolithic version control, On Monolithic Repositories) and presentations (Why Google Stores Billions of Lines of Code in a Single Repository, F8 2015 - Big Code: Developer Infrastructure at Facebook’s Scale) talking about the advantages of monolithic repositories, but those advantages can be hard to grasp without knowing the disadvantages of the other end.

    I’ll list some issues that I have encountered when working with independent repositories. These all lead to a situation where developers need to spend extra time in repository management that could be avoided by grouping all software components that form the final product into one repository.

    Expensive interface changes

    Interface changes between components become expensive. You also need cumbersome interface deprecation policies and a way to support the old and new interface versions between repositories until the change has propagated everywhere. It can take months or years to ensure that all interface users have done the intended interface change. And if you don’t have a good search at your disposal, you still can’t be sure about it before you really remove the old interface.

    With separate repositories it’s often the case that you can’t easily search where the interface you are deprecating is used at. This means that you don’t beforehand know what is actually depending on the interface. There naturally are search engines that span over multiple repositories, but they very rarely beat a simple grep -r (or git grep) command whose output you can further filter with simple command line tools. Especially if there are hundreds of small repositories that you need to include in your search.

    Ignore file duplication

    Often you need to add ignore files (.gitignore, .hgignore, etc…) to prevent junk going in to the repository by accident. Situations that can generate junk next to your code can be for example:

    • In-source build (versus separate build directories) generated files (*.a, *.o, *.exe, *.class, *.jar…).
    • Using any editor that creates backup and other files next to the file you are editing (*~, *.swp, *.bak…).
    • Using interpreted languages, like Python, whose default implementation byte compile the scripts for faster start-up (*.pyc, *.pyo…).
    • Using integrated development environments that require their own project directories.

    All these generic ignore rules need to be included in every project in addition to project specific ignores. Other possibility is forcing these ignore rules on developers themselves instead of taking care of them centrally. In case of nanorepositories there likely is just one or two languages used per repository, so the amount of ignore rules likely depends on the development environment that the developers work with. But it’s still needless duplication when you could get by without.

    Re-inventing inefficient build system rules

    Small repositories lead into having to reinvent build system rules for every repository from scratch if you want to test and build your component in isolation. Or doing a lot of code duplication or including references to a repository including common build rules. This also applies for test runners, as different levels of testing for different languages usually have their own test runners. Sometimes multiple test runners per language, that all have some non-default options that provide various advantages in test result reporting.

    Modern build systems like, ninja and Bazel, usually work on knowing the whole build graph of the system that they are trying to build. This makes it more easy to discover dependencies that only rebuild parts that are necessary to rebuild. Building every repository independently from each other leads into recursive build systems that treat their inputs as black boxes (Bitbake, npm, Maven, Make…). Changes in these black boxes are either communicated with version number changes or always rebuilding the component in question. This leads into a wasteful process and resource usage when compared to trunk based development of monolithic repositories.

    Overly complicated continuous integration machinery

    One of the defining principles of modern software development is having a working continuous integration system in place. This ensures that independent changes also work when they leave developer’s machine and also work when they are integrated with the rest of the product that the change is part of. And this is done multiple times every day. This, combined with trunk based development, keeps integration issues short (minutes to days) and avoids many-month release freezes compared to branched or forked development methods.

    Nanorepositories likely end up in a repository specific checks in the continuous integration machinery that only verify that the component in the repository itself works well. And if this continuous integration machinery has an automatic per repository check job generation, it likely needs to have an entry point (like make test or script) to execute those tests. And the same applies for compilation. Not to mention the extra work when trying to compile and test against different systems and runtime instrumentation (like AddressSanitizer).

    When the component finally gets integrated with everything else and the system breaks, figuring out the exact commit where the breakage happens (besides the integrating one) can be really painful. This is because it is easily possible to have dozens to thousands of commits between component releases. See a physical world example where components work perfectly together, but fail when integrated. And its hotfix.

    A case for small repositories

    Nanorepositories should not be confused with small independent repositories, as not everything needs to aim to be a part of a bigger product. A very common reason for small repositories is the combination of ownership management with shareable components. Most open source projects are such that they are owned by a certain people, or an organization, and it’s just not a good case for them to be part of anything bigger. Especially if they provide independent components that really are used by multiple external entities. Same downsides, however, generally apply to a collection of open source projects as to products consisting of multiple repositories.

  • Static code analysis and compiler warnings

    Compiler generated warnings are one form of static code analysis that provides a codified form of certain types of beneficial programming practices. Nowadays modern compilers used to compile C family languages (C, C++, and Objective-C) provide hundreds of different warnings whose usefulness varies depending on project and its aims.

    In this article I will examine what level of issues compiler warnings can find, what is the cost of enabling warnings and analyze compiler warning flag lists for both clang and GCC compilers.

    Levels of static code analysis

    Compiling C family languages usually involves preprocessor, compiler, assembler, and a linker. This also leads to situation that static code analysis can be done in various phases of program construction. These generally are:

    • Analysis on plain source files.
    • Analysis on preprocesses source files.
    • Analysis on compilation unit level.
    • Link-time analysis.

    This multi-stage program construction results in difficulties for tools that are not called with the exact same arguments providing information about preprocessor definitions, and include and library directories. For example tools like splint, Cppcheck, and many editor front-ends work outside the build system and can result in false warnings because they can not see inside some macro definitions that were not included in the simple static analysis setup. This becomes an issue with larger projects that do not necessarily have the most straightforward build setups and the most trivial header file inclusion policies. This does not mean that such tools are useless, but they will result in false positive warnings that can be really annoying unless they are silenced or ignored in some way.

    Analysis on preprocessed source files already provides pretty accurate picture of what kind of issues there can be in the program, but it necessarily is not enough. In the compilation phase compilers constantly transform the program into new, functionally equivalent, forms during optimization phases that can even result in unexpected code removal that is not necessarily trivial to notice. Compilation phase also gives more opportunities for target platform specific static code analysis. For example pipeline stalls or value overflows due to incorrect assumptions on data type sizes can usually be noticed only after the target platform is known.

    Final phase in program construction, that provides options for static analysis, is the linking phase. In the linking phase linker takes care that all the functions and global variables that the program calls come from somewhere and that there are no conflicting duplicate names defined. This should also enable some automatic detection capabilities for memory leaks and such that come from calling functions defined in different compilation units. I’m not sure if any freely available static analyzer does this.

    Compiler warning flags

    Compiler warning flags are one way to do static code analysis that cover all possible phases of program construction. This assumes that the compiler is involved in all phases of program construction. And they usually are, as in all phases from preprocessing to linking compiler front-end is used as a wrapper to all the tools that do the actual hard work.

    Warning flags and compilation time

    Using static code analysis in form of compiler warnings incurs some penalty, as they need to execute some extra code in addition to normal code related to compilation. To measure the penalty and to contrast it with some more advanced static analysis tools,

    I did some benchmarks by compiling Cppcheck 1.73 and FFTW 3.3.4 with clang 3.8, GCC 6.1, and Infer 0.8.1 by using -O3 optimization level. Cppcheck is a program mainly written in C++ and FFTW is mainly written in C. Infer has some experimental checks for C++ enabled with --cxx command line option, so I ran Infer twice for Cppcheck, with and without C++ checks. Clang had all warnings enabled -Weverything and GCC had all warning options that did not require any special values. This resulted in following minimum execution times of 3 runs:

    Compiler Program No warnings All warnings
    clang Cppcheck 59.3 s 1 min 1.1 s (+ 3.0 %)
    GCC Cppcheck 1 min 32.7 s 1 min 38.8 s (+ 6.6 %)
    Infer Cppcheck - 17 min 50 s (18x slower)
    Infer --cxx Cppcheck - 1 h 36 min (97x slower)
    clang FFTW 40.5 s 40.9 s (+ 1 %)
    GCC FFTW 42.7 s 58.1 s (+ 36 %)
    Infer FFTW - 4 min 43 s (10x slower)

    We can see that for clang and GCC the extra processing time added even by all warnings flags is pretty small compared to all the other compilation and optimization steps for a C++ application (Cppcheck). But for mostly C based application (FFTW) GCC gets surprisingly heavy, although build times still remain within the same order of magnitude.

    If we then compare the time that a more heavy static code analyzer takes, these compiler warnings are extremely cheap way to add static code analysis. They may not catch all the same bugs as these more advanced methods do, but they do offer a cheap way to avoid the basic mistakes.

    Warning flag lists

    I have created a project that can automatically parse compiler warning flags from command line option definition files in clang and GCC. This came partially from a necessity and partially from curiosity to examine what kind of options clang and GCC provide in easy to digest format. Although both compiler provide some kind of lists of warning flags as part of their documentation, they are pretty cumbersome to go through when the main interest is first figure what there is available and then just look at the details.

    Warning options and deprecation

    Different compilers have different policies about backwards compatibility and deprecation. When looking at how warning options have evolved, GCC has not removed between versions 3.4 and 6.1 a single switch, it has just switched them to do nothing (-Wimport, -Wunreachable-code, and -Wmudflap switches). Clang on the other hand has removed multiple switches between versions and for example there is no references to -Wcxx98-cxx11-compat in the current codebase even if clang 3.3 had such switch.

    Examining differences visually

    Generating large purely textual differences between different files becomes quite cumbersome quite soon if you want to do anything more complicated than a simple difference of unique command line options between two subsequent versions. For example if we look at figure 1 that shows what other warnings -Wall flag enables in GCC 6 when compared to GCC 5. We can see that there are quite many extra warnings added to -Wall switch so newer compiler versions provide extra analysis capabilities even without adding all the new options individually.

    Meld showing differences what flags
-Wall enables between GCC 5 and 6.

    Figure 1: Meld showing differences what flags -Wall enables between GCC 5 and 6.

    From figure 2 we can also see that GCC 6 uses -Wc++11-compat as the default warning flag indicating differences between ISO C++ 1998 and ISO C++ 2011 for constructs that have the same name instead of -Wc++0x-compat, that refers to a draft standard. So GCC has basically deprecated -Wc++0x-compat switch in favor of a switch that refers to the actual standard.

    -Wc++0x-compat is an alias of -Wc++11-compat in GCC 6 instead the other way around.

    Figure 2: -Wc++0x-compat is an alias of -Wc++11-compat in GCC 6 instead the other way around.

    Suggestions for usable warning options

    I won’t be giving any specific suggestions here for warning flags, as there seem to be new options for each subsequent compiler release. A good place to start is NASA’s JPL Institutional Coding Standard for the C Programming Language that includes a very short list of rudimentary warning flags for GCC. It also includes a short list of coding standards of which each one would have prevented a mission failure for NASA. SEI CERT coding standards for secure coding also provide various automatically generated lists for clang warning flags and GCC warning flags based on the issues that these standards take into account.

    And finally, check out the warning flag lists for clang and GCC and make your own combinations that bring the most benefit for whatever you are working with. Not all of them are appropriate for your project and some of them may be even working against the useful development patterns that you have.

    Cautionary tales about compiler warnings flags

    Even though it might sound like a good idea to rush and fix all the issues that these new compiler warning flags uncover, it might actually cause some new bugs to pop up. Specifically SQLite database engine has had its own take on compiler warnings and their fixing and they have concluded that fixing compiler warnings actually has produced some extra bugs that would not have come into light if there would have not been tries to fix compiler warnings.

    I have also had my own take on compiler warning fixes and sometimes I have screwed up and messed up with a perfectly working code while fixing a misleading warning. But generally my own experience has lead to more fixes than what there have been bugs. And the coolest thing is, that having these warnings enabled as the standard development process prevent some bugs from ever creeping up to the application in the first place.

  • Being a good CPU neighbor

    Very often computational tasks are roughly divided into real-time and batch processing tasks. Sometimes you might want to run some tasks that take a large amount of computation resources, get the result as fast as possible, but you don’t really care exactly when you get the result out.

    In larger organizations there are shared computers that usually have a lot of CPU power and memory available and that are mostly idle. But they are also used by other users that will be quite annoyed if they suffer from long delays to key presses or from similar issues because your batch processing tasks take all the resources that are available away from the system. Fortunately *nix systems provide different mechanisms available to non-root users to avoid these kind of issues.

    These issue can also raise in situations where you have a system that can only allocate static resource requirements to tasks, like in Jenkins. As an example, you might have some test jobs that need to finish within certain time limit mixed with compilation jobs that should finish as soon as possible, but that can yield some resources to these higher priority test jobs, as they don’t have as strict time limitation requirements. And you usually don’t want to limit the resources that compilation jobs can use, if they are run on otherwise idle machine.

    Here I’m specifically focusing on process priority settings and other CPU scheduling settings provided by Linux, as it currently happens to be probably the most used operating system kernel for multi-user systems. These settings affect how much CPU time a process gets relative to other processes and are especially useful on shared overloaded systems where there are more processes running than what there are CPU cores available.

    Different process scheduling priorities in Linux

    Linux and POSIX interfaces provide different scheduling policies that define how Linux scheduler allocates CPU time to a process. There are two main scheduling policies for processes: real-time priority policies and normal scheduling policies. Real-time policies are usually accessible only to root user and are not the point of interest of here, as they can not usually be used by normal users on shared systems.

    At the time of the writing this article there are three normal scheduling policies available to normal users for process priorities:

    1. SCHED_OTHER: the default scheduling policy in Linux with the default dynamic priority of 0.
    2. SCHED_BATCH: policy for scheduling batch processes. This schedules processes in similar fashion as SCHED_OTHER and is affected by the dynamic priority of a process. This makes the scheduler assume that the process is CPU intensive and makes it harder to wake up from a sleep.
    3. SCHED_IDLE: the lowest priority scheduling policy. Not affected by dynamic priorities. This equals to nice level priority of 20.

    The standard *nix way to change process priorities is to use nice command. By default in Linux, process can get nice values of -20–19 where the default nice value for a process is 0. When process is started with nice command, the nice value will be

    1. Values 0–19 are available to a normal user and -20–-1 are available to the root user. The lowest priority nice value of 19 gives process 5% of CPU time with the default scheduler in Linux.

    chrt command can be used to give a process access to SCHED_BATCH and SCHED_IDLE scheduling policies. chrt can be used with combination of nice command to lower the priority of SCHED_BATCH scheduling policy. And using SCHED_IDLE (= nice level 20) policy should give around 80% of the CPU time that nice level of 19 has, as nice level weights increase the process priority by 1.25 compared to lower priority level.

    Benchmarking with different competing workloads

    I wanted to benchmark the effect of different scheduling policies on a work done by a specific benchmark program. I used a system with Linux 3.17.0 kernel with Intel Core i7-3770K CPU at 3.5 GHz processor with hyperthreading and frequency scaling turned on and 32 gigabytes of RAM. I didn’t manage to make the CPU frequency constant, so results vary a little between runs. I used John the Ripper’s bcrypt benchmark started with following command as the test program for useful work:

    $ /usr/sbin/john --format=bcrypt -test

    I benchmarked 1 and 7 instances of John the Ripper for 9 iterations (9 * 5 = 45 seconds) with various amounts of non-benchmarking processes running at the same time. 1 benchmarking process should by default not have any cache contention resulting from other benchmarking processes happening and 7 benchmarking processes saturate all but 1 logical CPU core and can have cache and computational unit contention with each other.

    Non-bechmarking processes were divided into different categories and there were 0–7 times CPU cores instances of them started to make them fight for CPU time with benchmarking processes. The results of running different amounts of non-benchmarking processes with different scheduling policies can be found from tables 1, 2, 3, and 4. Different scheduling policies visible in those tables are:

    • default: default scheduling policy. Command started without any wrappers. Corresponds to SCHED_OTHER with nice value of 0.
    • chrt-batch: the default SCHED_BATCH scheduling policy. Command started with chrt --batch 0 prefix.
    • nice 10: SCHED_OTHER scheduling policy with the default nice value of 10. Command started with nice prefix.
    • nice 19: SCHED_OTHER scheduling policy with nice value of 19. Command started with nice -n 19 prefix. This should theoretically take around 5 % of CPU time out from the useful work.
    • nice 19 batch: SCHED_BATCH scheduling policy with nice value of 19. Command started with nice -n 19 chrt --batch 0 prefix.
    • sched idle: SCHED_IDLE scheduling policy. Command started with chrt --idle 0 prefix. This should theoretically take around 80 % of CPU time compared to nice level 19 out from useful work.

    The results in those tables reflect the relative average percentage of work done compared to the situation when there are no additional processes disturbing the benchmark. These only show the average value and do not show, for example, variance that could be useful to determine how close those values are of each other. You can download get the raw benchmarking data and the source code to generate and analyze this data from following links:

    CPU looper

    CPU looper applications consists purely of an application that causes CPU load. Its main purpose is just to test the scheduling policies without forcing CPU cache misses. Of course there will be context switches that can replace data cache entries related to process state and instruction cache entries for the process with another, but there won’t be any extra intentional cache flushing happening here.

    The actual CPU looper application consists code shown in figure 1 compiled without any optimizations:

    int main() {
        while (1) {}
        return 0;
    Figure 1: source code for CPU looper program.

    The results for starting 0–7 times CPU cores of CPU looper processes with the default priority and then running benchmark with 1 or 7 cores can be found from tables 1 and 2.

    1 worker 0 1 2 3 4 5 6 7
    default 100 64.6 36.4 25.5 18.2 15.4 12.5 11.0
    nice 0 vs. sched batch 100 66.5 37.0 24.0 16.3 14.8 12.2 10.9
    nice 0 vs. nice 10 100 93.3 90.2 82.7 75.4 75.2 75.0 74.9
    nice 0 vs. nice 19 100 92.9 90.4 91.0 91.8 89.8 89.4 90.9
    nice 0 vs. nice 19 batch 100 94.0 89.2 91.4 89.8 86.4 89.4 90.7
    nice 0 vs. sched idle 100 95.0 91.3 92.2 92.0 92.6 91.0 92.4
    nice 10 vs. nice 19 100 80.0 74.7 68.0 67.7 67.5 64.5 60.7
    nice 10 vs. nice 19 batch 100 92.5 83.8 75.9 75.0 74.3 74.1 69.8
    nice 10 vs. sched idle 100 89.8 89.7 90.9 89.0 87.8 87.4 87.2
    nice 19 vs. sched idle 100 77.9 69.7 66.9 66.3 64.6 63.4 58.1
    Table 1: the relative percentage of average work done for one benchmarking worker when there are 0–7 times the logical CPU cores of non-benchmarking CPU hogging jobs running with different scheduling policies.
    7 workers 0 1 2 3 4 5 6 7
    default 100 49.8 33.6 24.6 19.4 16.4 13.8 12.2
    nice 0 vs. sched batch 100 51.1 34.0 25.1 20.0 16.6 14.2 12.4
    nice 0 vs. nice 10 100 92.9 87.2 80.1 75.0 71.0 66.3 61.1
    nice 0 vs. nice 19 100 94.1 94.9 95.4 95.2 96.2 95.7 96.0
    nice 0 vs. nice 19 batch 100 95.7 95.2 95.7 95.5 95.2 94.6 95.4
    nice 0 vs. sched idle 100 96.3 95.7 96.2 95.6 96.7 95.4 96.7
    nice 10 vs. nice 19 100 93.6 84.9 78.0 71.4 65.7 60.4 56.2
    nice 10 vs. nice 19 batch 100 92.4 84.1 77.6 69.9 65.6 60.7 56.4
    nice 10 vs. sched idle 100 95.5 94.9 94.5 93.9 94.3 93.7 91.9
    nice 19 vs. sched idle 100 91.4 80.9 71.4 64.3 58.3 52.5 48.5
    Table 2: the relative percentage of average work done for seven benchmarking workers when there are 0–7 times the logical CPU cores of non-benchmarking CPU hogging jobs running with different scheduling policies.

    Tables 1 and 2 show that the effect is not consistent between different scheduling policies and when there is 1 benchmarking worker running it suffers more from the lower priority processes than what happens when there are 7 benchmarking workers running. But with higher priority scheduling policies for background processes the average amount of work done for 1 process remains higher for light loads than with 7 worker processes. These discrepancies can be probably explained by how hyperthreading works by sharing the same physical CPU cores and by caching issues.

    Memory looper

    Processors nowadays have multiple levels of cache and no cache isolation and memory access from one core can wreak havoc for other cores with just moving data from and to memory. So I wanted to see what happens with different scheduling policies when running multiple instances of a simple program that run on the background when John the Ripper is trying to do some real work.

    #include <stdlib.h>
    int main() {
        // 2 * 20 megabytes should be enough to spill all caches.
        const size_t data_size = 20000000;
        const size_t items = data_size / sizeof(long);
        long* area_source = calloc(items, sizeof(long));
        long* area_target = calloc(items, sizeof(long));
        while (1) {
            // Just do some memory operations that access the whole memory area.
            for (size_t i = 0; i < items; i++) {
                area_source[i] = area_target[i] + 1;
            for (size_t i = 0; i < items; i++) {
                area_target[i] ^= area_source[i];
        return 0;
    Figure 2: source code for memory bandwidth hogging program.

    Program shown in figure 2 is compiled without any optimizations and it basically reads one word from memory, adds 1 to it and stores the result to some other memory area and then XORs the read memory area with the written memory area. So it basically does not do anything useful, but reads and writes a lot of data into memory every time the program gets some execution time.

    Tables 3 and 4 show the relative effect on John the Ripper benchmarking program when there are various amounts of the program shown in figure 2 running at the same time. If you compare these numbers to values shown in tables 1 and 2 a program that only uses CPU cycles is running, the numbers for useful work can be in some cases around 10 percentage points lower. So there is apparently some cache contention ongoing with this benchmarking program and the effect of lower priority scheduling policies is not the same that could be theoretically expected just from the allocated time slices.

    1 worker 0 1 2 3 4 5 6 7
    default priority 100 61.7 32.6 21.8 16.6 13.3 11.2 9.8
    nice 0 vs. sched batch 100 60.2 32.6 22.7 16.2 12.9 10.9 10.2
    nice 0 vs. nice 10 100 85.2 82.1 75.8 70.8 69.6 70.5 68.1
    nice 0 vs. nice 19 100 85.4 83.0 79.7 81.7 78.2 81.9 84.7
    nice 0 vs. nice 19 batch 100 83.8 81.2 80.5 83.4 80.4 79.4 84.3
    nice 0 vs. sched idle 100 82.9 80.9 81.2 82.1 80.6 82.3 81.8
    nice 10 vs. nice 19 100 80.0 74.7 68.0 67.7 67.5 64.5 60.7
    nice 10 vs. nice 19 batch 100 80.8 74.1 67.6 67.1 67.9 65.5 63.3
    nice 10 vs. sched idle 100 83.9 81.6 79.2 77.0 75.9 76.8 77.1
    nice 19 vs. sched idle 100 77.9 69.7 66.9 66.3 64.6 63.4 58.1
    Table 3: the relative percentage of average work done for one benchmarking worker when there are 0–7 times the logical CPU cores of non-benchmarking CPU and memory bandwidth hogging jobs running with different scheduling policies.
    7 workers 0 1 2 3 4 5 6 7
    default 100 48.4 32.6 24.5 19.7 16.3 14.2 12.3
    nice 0 vs. sched batch 100 48.6 32.2 23.7 19.4 16.2 14.0 12.6
    nice 0 vs. nice 10 100 89.3 81.5 74.6 69.2 64.6 60.4 56.3
    nice 0 vs. nice 19 100 92.0 90.5 90.4 91.2 91.3 90.6 90.5
    nice 0 vs. nice 19 batch 100 91.5 91.8 92.5 91.8 91.9 92.1 91.9
    nice 0 vs. sched idle 100 92.1 91.9 91.9 91.5 92.0 92.4 92.1
    nice 10 vs. nice 19 100 85.4 77.3 70.4 63.3 58.3 54.2 50.3
    nice 10 vs. nice 19 batch 100 86.8 77.7 70.0 63.4 58.8 54.4 50.6
    nice 10 vs. sched idle 100 90.6 89.0 88.9 88.9 88.0 87.4 86.3
    nice 19 vs. sched idle 100 82.8 72.9 62.9 57.3 52.2 48.2 44.1
    Table 4: the relative percentage of average work done for seven benchmarking workers when there are 0–7 times the logical CPU cores of non-benchmarking CPU and memory bandwidth hogging jobs running with different scheduling policies.


    These are just results of one specific benchmark with two specific workloads on one specific machine with 4 hyperthreaded CPU cores. They should anyways give you some kind of an idea how different CPU scheduling policies under Linux affect the load that you do not want to disturb when your machine is more or less overloaded. Clearly when you are reading and writing data from and to memory, the lower priority background process has bigger impact on the actual worker process than what the allocated time slices would predict. But, unsurprisingly, a single user can have quite a large impact on how much their long running CPU hogging processes affect rest of the machine.

    I did not do any investigation how much work those processes that are supposed to disturb the benchmarking get done. In this case SCHED_BATCH with nice value of 19 could probably be the best scheduling policy if we want to get the most work done and at the same time avoid disturbing other users. Otherwise, it looks like the SCHED_IDLE policy, taken into use with chrt --idle 0 command, that is promised to have the lowest impact on other processes has the lowest impact. Especially when considering processes started with lower nice values than the default one.

  • C, for(), and signed integer overflow

    Signed integer overflow is undefined behavior in C language. This enables compilers to do all kinds of optimization tricks that can lead to surprising and unexpected behavior between different compilers, compiler versions, and optimization levels.

    I encountered something interesting when I tried different versions of GCC and clang with following piece of code:

    #include <limits.h>
    #include <stdio.h>
    int main(void)
        int iterations = 0;
        for (int i = INT_MAX - 2; i > 0; i++) {
        printf("%d\n", iterations);
        return 0;

    When compiling this with GCC 4.7, GCC 4.9, and clang 3.4 and compilation options for x86_64 architecture, we get following output variations:

    $ gcc-4.9 --std=c11 -O0 tst.c && ./a.out
    $ gcc-4.9 --std=c11 -O3 tst.c && ./a.out
    $ clang-3.4 --std=c11 -O3 tst.c && ./a.out
    $ gcc-4.7 --std=c11 -O3 tst.c && ./a.out asdf
    <infinite loop...></code>

    So let’s analyze these results. The loop for (int i = INT_MAX - 2; i > 0; i++) { should only iterate 3 times until variable i wraps around to negative value due to two’s complement arithmetic. Therefore we would expect iterations variable to be 3. But this only happens when optimizations are turned off. So let’s see why these outputs result in these different values. Is there some clever loop unrolling that completely breaks here or what?

    GCC 4.9 -O0

    There is nothing really interesting going in the assembly code produced by GCC without optimizations. It looks like what you would expect from this kind of loop for() on the lower level. So the main interest is in optimizations that different compilers do.

    GCC 4.9 -O3 and clang 3.4 -O3

    GCC 4.9 and clang 3.4 both seem to result in similar machine code. They both optimize the for() loop out and replace the result of it with a (different) constant. GCC 4.9 is just creating code that assigns value 2 to the second parameter for printf() function (see System V Application Binary Interface AMD64 Architecture Processor Supplement section 3.2.3 about parameter passing). Similarly clang 3.4 assigns value 4 to the second parameter for printf().

    GCC 4.9 -O3 assembler code for main()

    Here we can see how value 2 gets passed as iterations variable:

    Dump of assembler code for function main:
    5 {
    0x00000000004003f0 <+0>: sub $0x8,%rsp
    6     int iterations = 0;
    7     for (int i = INT_MAX - 2; i > 0; i++) {
    8         iterations++;
    9     }
    10     printf("%d\n", iterations);
    0x00000000004003f4 <+4>: mov $0x2,%esi
    0x00000000004003f9 <+9>: mov $0x400594,%edi
    0x00000000004003fe <+14>: xor %eax,%eax
    0x0000000000400400 <+16>: callq 0x4003c0 <printf@plt>
    11     return 0;
    12 }
    0x0000000000400405 <+21>: xor %eax,%eax
    0x0000000000400407 <+23>: add $0x8,%rsp
    0x000000000040040b <+27>: retq

    clang 3.4 -O3 assembler code for main()

    Here we can see how value 4 gets passed as iterations variable:

    Dump of assembler code for function main:
    5 {
    6     int iterations = 0;
    7     for (int i = INT_MAX - 2; i > 0; i++) {
    8         iterations++;
    9     }
    10     printf("%d\n", iterations);
    0x00000000004004e0 <+0>: push %rax
    0x00000000004004e1 <+1>: mov $0x400584,%edi
    0x00000000004004e6 <+6>: mov $0x4,%esi
    0x00000000004004eb <+11>: xor %eax,%eax
    0x00000000004004ed <+13>: callq 0x4003b0 <printf@plt>
    0x00000000004004f2 <+18>: xor %eax,%eax
    11     return 0;
    0x00000000004004f4 <+20>: pop %rdx
    0x00000000004004f5 <+21>: retq

    GCC 4.7 -O3

    GCC 4.7 has an interesting result from optimizer that is completely different that could be expected to happen. It basically replaces main() function with just one instruction that does nothing else than jumps at the same address that it resides in and thus creates an infinite loop:

    Dump of assembler code for function main:
    5 {
    0x0000000000400400 <+0>: jmp 0x400400 <main>


    This is an excellent example of nasal demons happening when there is undefined behavior ongoing and compilers have free hands to do anything they want. Especially as GCC 4.7 produces really harmful machine code that instead of producing unexpected value as a result, it just makes the program unusable. Other compilers and other architectures could result in different behavior, but let that be something for the future investigation.

Older articles • Page 1/2

subscribe via RSS